Interpreting the Validity of a High-Stakes Test in Light of the Argument-Based Framework: Implications for Test Improvement

Document Type : Research Article


1 Department of English Language, College of Languages , University of Human Developmentv, Sulaimani, Kurdistan, Iraq

2 Shiraz University


The validity of large-scale assessments may be compromised, partly due to their content inappropriateness or construct underrepresentation. Few validity studies have focused on such assessments within an argument-based framework. This study analyzed the domain description and evaluation inference of the Ph.D. Entrance Exam of ELT (PEEE) sat by Ph.D. examinees (n = 999) in 2014 in Iran. To track evidence for domain definition, the test content was scrutinized by applied linguistics experts (n = 12). As for evaluation inference, the reliability and differential item functioning (DIF) of the test were examined. Results indicated that the test is biased because (1) the test tasks are not fully represented in the Ph.D. course objectives, (2) the test is best reliable for high-ability test-takers (IRT analysis), and (3) 4 items are flagged for nonnegligible DIF (logistic regression [LR] analysis). Implications for language testing and assessment are discussed and some possible suggestions are offered.


Alderson, J. C. (1986). Testing English for specific purposes: How specific can we get? ELT Documents, 127, 16-28.
Armstrong, W. B. (2000). The association among student success in courses, placement test scores, student background data, and instructor grading practices. Community College Journal of Research and Practice, 24(8), 681-695.
Azmoon.Net. (2014). Ph.D. entrance examination news. Retrieved October 15, 2014, from the World Wide Web: www.Phd.Azmoon.Net.www.PhD Test
Bachman, L. F. (1990). Fundamental considerations in language testing. UK: Oxford University Press.
Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly: An International Journal, 2, 1-34.
Bachman, L., & Palmer, A. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford, UK: Oxford University Press.
Bennett, R. E. (2010). Cognitively-based assessment of, for, and as learning: A preliminary theory of action for summative and formative assessment. Measurement: Interdisciplinary Research and Perspectives, 8, 70-91.
Butler, F. A., Lord, C., Stevens, R., Borrego, M., & Bailey, A. L. (2004). An approach to operationalizing academic language for language test development purposes: Evidence from fifth-grade science and math. CSE Report 626. US Department of Education.
Chappelle, C. A., Enright, M. K., & Jamieson, J. (2008). Building a validity argument for the test of English as a foreign language. New York. Routledge.
Chappelle, C. A., Enright, M. K., & Jamieson, J. (2010). Does an argument-based approach to validity make a difference? Educational Measurement: Issues and Practice, 29(1), 3-13.
Cizek, G.J. (2012). Defining and distinguishing validity: Interpretations of score meaning and justifications of test use. Psychological Methods, 17(1), 31-43.
Cubilo, J. (2014). Argument-based validity in classroom and program contexts: applications and Considerations. Shiken Research Bulletin, 18(1), 18-24.
Cheng, L., & Sun, Y. (2015). Interpreting the impact of the Ontario secondary school literacy test on second language students within an argument-based validation framework. Language Assessment Quarterly, 12, 50-66
Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization measures of differential item functioning. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35-66). Hillsdale, NJ: Lawrence Erlbaum.
Farhady, H. (1998). A critical review of the English section of the B.A. and M.A. University Entrance Examination. In the Proceedings of the conference on M.A. tests in Iran (1998). Ministry of Culture and Higher Education, Center for Educational Evaluation. Tehran, Iran.
French, B. F., & Maller, S. J. (2007). Iterative purification and effect size use with logistic regression for differential item functioning detection. Educational and Psychological Measurement, 67, 373-393.
Glaser, B. G., & Strauss, A., L. (1967).  The discovery of grounded theory: Strategies for qualitative research. Chicago: Aldine.
Goldman, S. R. (2004). Cognitive aspects of constructing meaning through and across multiple texts. In N. Shuart-Faris & D. Bloome (Eds.), Uses of intertextuality in classroom and educational research (pp. 317-351). Greenwich, CT: Information Age.
Green, A.  (2007). Washback to the learners:  Learners and teacher perspectives on IELTS preparation course expectation and outcomes. Assessing Writing, 11, 113-134.
Hamavandy, M. (2014). Validation of a high-stakes test of English in Iran in comparison with TOEFL and IELTS:  An assessment use argument approach. Unpublished doctoral dissertation, Department of English, Tarbiat Modares University, Tehran, Iran.
Hauger, J. B., & Sireci, S. G. (2008). Detecting differential item functioning across examinees tested in their dominant language and examinees tested in a second language. International Journal of Testing, 8, 237-250.
Herrera, A., N., & Gomez, J. (2008). Influence of equal or unequal comparison group sample sizes on the detection of differential item functioning using the Mantel- Haenszel and logistic regression techniques. Quality & Quantity, 42, 739-755.
Hubley, A. M., & Zumbo, B. D. (2011). Validity and the consequences of test interpretation and use. Springer, 103, 219-230.
James, C. L., & Templeman, E. (2009). A case for faculty involvement in EAP placement testing. TESL Canada Journal, 26(2), 82-99.
Jodoin, M. G., & Gierl, M. J. (2001). Evaluating power and type I error rates using an effect size with the logistic regression procedure for DIF. Applied Measurement in Education, 14, 329-349.
Johnson, R. C., & Riazi, M. (2013). Assessing the assessments: Using an argument-based validity framework to assess the validity and use of an English placement system in a foreign language context. Papers in Language Testing and Assessment, 2(1), 31-58.
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527-535.
Kane, M. T. (2006). Validation. Educational Measurement, 4, 17-64.
Kane, M.T. (2011). Validating score interpretations and uses. Language Testing, 29(1), 3-17.
Kane, M.T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1),1-73
Kheirzade, Sh. (2015). Fairness in a validity argument: The case of the General English section of the Ph.D. Entrance Exam for non-English majors in Iran. Unpublished doctoral dissertation, Department of English, Al Zahra University, Tehran.
Kiany, R., Shayestefar, P., Ghafar Samar, R., & Akbari, R. (2013). High-rank stakeholders’ perspectives on high-stakes university entrance examinations reform: Priorities and problems. Higher Education, 65, 325-340.
Knoch, U., & Elder, C. (2013).  A framework for validating postentry language assessments (PELAs). Papers in Language Testing and Assessment, 2(2), 48-66.
Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd ed.; pp. 13-100). Washington, DC: American Council on Education.
Motamedi, A. (2006). The effect of university entrance examination on general health, self-esteem and psychic disorders symptom of those who were not admitted to the university. Quarterly Journal of Research and Planning in Higher Education,12(2), 54-72.
Monahan, P. O., McHorney, C. A., Stump, T. E., & Perkins, A. J. (2007). Odds ratio, delta, ETS classification, and standardization measures of DIF magnitude for binary logistic regression. Journal of Educational and Behavioral Statistics, 32, 92-109.
Moss, P. A. (2007). Reconstructing validity. Educational Researcher, 36(8), 470-476.
Murphy, S., & Yancey, K. B. (2008). Construct and consequence: Validity in writing assessment. In C. Bazerman (Ed.), Handbook of research on writing: History, society, school, individual, text (pp. 365-385) New York: Lawrence Erlbaum Associates.
NOET. (2013). Ph.D. entrance examination news. Retrieved December 20, 2013, from the World Wide Web: 730.WWW.Sanjesh Organization
Ryan, K. (2002). Assessment validation in the context of high‐stakes assessment. Educational Measurement: Issues and Practice, 21(1), 7-15.
Shepard, L. A. (2000). The role of assessment in a learning culture. Educational Researcher, 29(7),4-14.
Span, M. (2006). Test and item specifications development. Language Assessment Quarterly: An International Journal, 3(1), 71-79.
Stiggins, J. R. (1990). Toward a relevant classroom assessment research agenda. Alberta Journal of Educational Research, 36(1), 92-97.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370.
Taylor, L. (2007). The impact of joint-funded research studies on the IELTS writing module. In L. Taylor & P. Falvey (Eds.), IELTS collected papers: Research in speaking and writing assessment (pp. 20-48). Cambridge, MA: Harvard University Press.
Weir, C. J. (2005c). Language testing and validation. Hampshire: Palgrave McMillan.
White, E. M. (1990). Language and reality in writing assessment. College Composition and Communication, 41(2), 187-200.
Williams, K. L. (1990). Three new tests for overseas students entering postgraduate and vocational training courses. ELT Journal, 44(1), 55-65.
Xi, X. (2008). Methods of test validation. In N. H. Hornberger (Ed.), Encyclopedia of language and education (pp. 2316-2335). Boston, MA: Springer.
Xi, X. (2010). How do we go about investigating test fairness? Language Testing, 27(2), 147- 170.
Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa ON: Directorate of Human Resources Research and Evaluation, Department of National Defense.