Teacher vs. Machine Correction: Comparing Assessments of Students’ Reading Comprehension and Writing Skills

Document Type : Research Article

Authors

1 Department of Didactics and School Organization, Faculty of Education, Universidad Nacional de Educación a Distancia, Madrid, Spain

2 Department of Economics, Quantitative Methods and Economy History, Faculty of Business, Pablo de Olavide University, Sevilla, Spain

3 Department of Foreign Languages and their Linguistics, Faculty of Philology, Universidad Nacional de Educación a Distancia, Madrid, Spain

Abstract

This article presents research that compared two correction techniques applied to a PISA text summary question written by 30 Spanish students aged 14-16, one by automatic correction software (G-Rubric) and the other by 30 Spanish language teachers varying in age, sex, and classroom experience. The methodology was a parametric approach based on latent class analysis using Latent Gold 4.5 software, and correspondence analysis. In the results, the Euclidean distances between each individual and the system were measured as low, medium or high dissimilarity, based how close the teachers’ assessment was to that of the correction software. The results showed a first cluster, comprised of teachers whose correction scores exhibited a significant correlation with the tool, represented the quartile of younger and less experienced teachers. This stands in contrast to a second cluster, characterized by "high" dissimilarity, which consisted of older and more experienced teachers whose corrections deviated notably from the system, yielding scores lower than those produced by the tool.

Keywords


ACARA NASOP research team. (2015). An evaluation of automated scoring of NAPLAN persuasive writing. Acara Australian Curriculum Assessment and Reporting Authority30. https://nap.edu.au/_resources/20151130_ACARA_research_paper_on_online_automated_scoring.pdf
Amorim, E., & Veloso, A. (2017). A multi-aspect analysis of automatic essay scoring for Brazilian Portuguese. In Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics (pp. 94-102). Association for Computational Linguistics, Valencia, Spain. https://doi.org/10.18653/v1/E17-4010
Arnal-Bailera, A., Muñoz-Escolano, J. M., & Oller-Marcén, A. M. (2016). Characterization of behavior of correctors when grading mathematics tests. Revista de Educación, 371, 35-60. doi:10.4438/1988-592X-RE-2015-371-307
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. The Journal of Technology, Learning and Assessment4(3), 1-21. https://doi.org/10.1002/j.2333-8504.2004.tb01972.x
Barrada, J. R., Olea, J., Ponsoda, V., & Abad, F. J. (2006). Item selection rules in a computerized adaptive test for the assessment of written English. Psicothema18(4), 828-834.
Barrett, C.M. (2015). Automated essay evaluation and the computational paradigm: Machine scoring enters the classroom [Unpublished doctoral dissertation]. University of Rhode Island. https://doi.org/10.23860/diss-barrett-catherine-2015
Bartholomew, D. J., Steele, F., Moustaki, I., & Galbraith, J. I. (2002). The analysis and interpretation of multivariate data for social scientists. Chapman & Hall.
Bejar, I. I. (2011). A validity-based approach to quality control and assurance of automated scoring. Assessment in Education: Principles, Policy & Practice18(3), 319-341. https://doi.org/10.1080/0969594X.2011.555329
Benítez, M. H., & Lancho, M. S. (2016). G-Rubric: Una aplicación para corrección automática de preguntas abiertas. Primer balance de su utilización. In Nuevas perspectivas en la investigación docente de la historia económica (pp. 473-494). Editorial de la Universidad de Cantabria.
Bennett, R. E. (2010). Cognitively based assessment of, for, and as learning (CBAL): A preliminary theory of action for summative and formative assessment. Measurement8(2-3), 70-91. https://doi.org/10.1080/15366367.2010.508686
Ben-Simon, A., & Bennett, R. E. (2007). Toward more substantively meaningful automated essay scoring. The Journal of Technology, Learning and Assessment6(1), 1-47. https://ejournals.bc.edu/index.php/jtla/article/view/1631
Blázquez, M., & Fan, C. (2019). The efficacy of spell check packages specifically designed for second language learners of Spanish. Pertanika Journal of Social Sciences & Humanities – JSSH, 27(2), 847-863.
Blumenstein, M., Green, S., Fogelman, S., Nguyen, A., & Muthukkumarasamy, V. (2008). Performance analysis of GAME: A generic automated marking environment. Computers & Education50(4), 1203-1216. https://doi.org/10.1016/j.compedu.2006.11.006
Bol, L., Stephenson, P., O'connell, A., & Nunnery, J. (1998). Influence of experience, grade level, and subject area on teachers' assessment practices. Journal of Educational Research, 91, 323-330. https://doi.org/10.1080/00220679809597562 
Brackett, M. A., Floman, J. L., Ashton-James, C., Cherkasskiy, L., & Salovey, P. (2013). The influence of teacher emotion on grading practices: A preliminary look at the evaluation of student writing. Teachers and Teaching19(6), 634-646. https://doi.org/10.1080/13540602.2013.827453
Bridgeman, B. (2009). Experiences from large-scale computer-based testing in the USA. In F. Scheuermann, & J. Bjornsson (Eds.), The transition to computer-based assessment (pp. 39-44). Office for Official Publications of the European Communities.
Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education25(1), 27-40. https://doi.org/10.1080/08957347.2012.635502
Burstein, J., & Chodorow, M. (1999). Automated essay scoring for normative English speakers. Joint Symposium of the Association of Computational Linguistics and the International Association of Language Learning Technologies, Workshop on Computer-Mediated Language Assessment and Evaluation of Natural Language Processing, College Park, Maryland. https://aclanthology.org/W99-0411
Chen, C. F. E., & Cheng, W. Y. E. C. (2008). Beyond the design of automated writing evaluation: Pedagogical practices and perceived learning effectiveness in EFL writing classes. Language Learning & Technology12(2), 94-112.
Chodorow, M., & Burstein, J. (2004). Beyond essay length: Evaluating e-rater’s performance on TOEFL essays. ETS.
Crossley, S. A., Bradfield, F., & Bustamante, A. (2019). Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. Journal of Writing Research11(2), 251-270. https://doi.org/10.17239/jowr-2019.11.02.01 
Csapó, B., Ainley, J., Bennett, R. E., Latour, T., & Law, N. (2012). Technological issues for computer-based assessment. In Assessment and teaching of 21st century skills (pp. 143-230). Springer. https://doi.org/10.1007/978-94-007-2324-5_4
da Cunha, I. (2020). Una herramienta TIC para la redacción del Trabajo de Fin de Grado (TFG). UA Revistes Cientifiques, 34, 39-72. https://doi.org/10.14198/ELUA2020.34.2
Díez-Arcón, P., & Martín-Monje (2021). G-Rubric: The use of open technologies to provide personalised feedback in languages for specific purposes. In EDULEARN21 Proceedings (pp. 2635-2643). IATED. https://doi.org/10.21125/edulearn.2021.0574
Dikli, S. (2006). An overview of automated scoring of essays. The Journal of Technology, Learning and Assessment5(1), 1-36. https://ejournals.bc.edu/index.php/jtla/article/view/1640
Eggen, T. J. H. M., & Verschoor, A. J. (2006). Optimal testing with easy or difficult items in computerized adaptive testing. Applied Psychological Measurement, 30, 379-393. doi: 10.1177/0146621606288890
Ericsson, P. F., & Haswell, R. H. (2006). Machine scoring of student essays: Truth and consequences. All USU Press Publications.
Fernández-Alonso, R., Woitschach, P., & Muñiz Fernández, J. (2019). Rubrics do not neutralize Raters’ effects: A many-faceted Rasch model estimation. Revista de Educación, 386. 89-112. http://dx.doi.org/10.4438/1988-592X-RE-2019-386-428
Foltz, P. W. (1996). Latent semantic analysis for text-based research. Behavior Research Methods, Instruments, & Computers28(2), 197-202. https://doi.org/10.3758/BF03204765
Hashemian, M., & Farhang-Ju, M. (2018). Effects of metalinguistic feedback on grammatical accuracy of Iranian field (in)dependent L2 learners’ writing Ability. Journal of Research in Applied Linguistics, 9(2), 141-161. https://doi.org/10.22055/rals.2018.13797
He, Y., Hui, S. C., & Quan, T. T. (2009). Automatic summary assessment for intelligent tutoring systems. Computers & Education53(3), 890-899. https://doi.org/10.1016/j.compedu.2009.05.008
Hoomanfard, M. H., Jafarigohar, M., Jalilifar, A., & Masum, S. M. H. (2018). Comparative study of graduate students’ self-perceived needs for written feedback and supervisors’ perceptions. Journal of Research in Applied Linguistics, 9(2), 24-46. https://doi.org/10.22055/rals.2018.13792
JISC (Joint Information Systems Committee). (2007). Effective practice with e-assessment: An overview of technologies, policies and practice in further and higher education. http://www.jisc.ac.uk/media/documents/themes/elearning/effpraceassess.pdf
Jorge-Botana, G., Luzón, J. M., Gómez-Veiga, I., & Martín-Cordero, J. I. (2015). Automated LSA assessment of summaries in distance education: some variables to be considered. Journal of Educational Computing Research52(3), 341-364.
Klobucar, A., Elliot, N., Deess, P., Rudniy, O., & Joshi, K. (2013). Automated scoring in context: Rapid assessment for placed students. Assessing Writing, 18(1), 62-84. https://doi.org/10.1016/j.asw.2012.10.001.
Landauer, T. K. (2003). Automatic essay assessment. Assessment in education: Principles, Policy & Practice10(3), 295-308.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes25(2-3), 259-284.
Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent structure analysis and the interpretation and mathematical foundation of latent structure analysis. In S.A. Stouffer et al. (Eds.), Measurement and prediction (pp. 362-472), Princeton University Press.
Lazarsfeld, P. F., & Henry, N. W. (1968). Latent Structure Analysis. Houghton Mill.
Magidson, J., & Vermunt, J.K. (2004). Latent class models. In: Kaplan, D. (ed.) The Sage handbook of quantitative methodology for the social sciences (pp. 175-198). Sage Publications, Thousand Oakes.
Martín-Monje, E., & Barcena, E. (2024). Tutor vs. automatic focused feedback and grading of student ESP compositions in an online learning environment. Journal of Research in Applied Linguistics, 15(2), 22-42. https://doi.org/10.22055/rals.2024.45636.3198
Mirzaee, A., & Tazik, K. (2014). Typological description of written formative feedback on student writing in an EFL context. Journal of Research in Applied Linguistics, 5(2), 79-94. https://rals.scu.ac.ir/article_11013_1201.html
Noorbehbahani, F., & Kardan, A. A. (2011). The automatic assessment of free text answers using a modified BLEU algorithm. Computers & Education56(2), 337-345. https://doi.org/10.1016/j.compedu.2010.07.013
Perelman, L. (2012). Mass-market writing assessments as bullshit. In N. Elliot & L. Perelman, Writing assessment in the 21st century: Essays in honor of Edward M. White, (pp. 425-437). Hampton Press.
Perelman, L. (2013). Critique of Mark D. Shermis & Ben Hamner: Contrasting state-of-the-art automated scoring of essays: Analysis. The Journal of Writing Assessment, 6(1), 1-10. http://journalofwritingassessment.org/article.php?article=69
Pérez-Marín, D., Alfonseca Cubero, E., & Rodríguez Marín, P. (2006). ¿Pueden los ordenadores evaluar automáticamente preguntas abiertas? Novátic, 50, 50-53.
Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2000). Comparing the validity of automated and human essay scoring. ETS Research Report Series2000(2), 1-23. https://doi.org/10.2190/CX92-7WKV-N7WC-JL0A
Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2002). Stumping e-rater: challenging the validity of automated essay scoring. Computers in Human Behavior18(2), 103-134. https://doi.org/10.1016/S0747-5632(01)00052-8
Redecker, C. (2013). The use of ICT for the assessment of key competences. Publications Office of the European Union. doi:10.2791/87007
Royal‐Dawson, L., & Baird, J. A. (2009). Is teaching experience necessary for reliable scoring of extended English questions?. Educational Measurement: Issues and Practice28(2), 2-8. https://doi.org/10.1111/j.1745-3992.2009.00142.x
Rudner, L., & Gagne, P. (2001). An overview of three approaches to scoring written essays by computer. ERIC Digest.
Rudner, L., Garcia, V., & Welch, C. (2005). An evaluation of Intellimetric™ essay scoring system using responses to GMAT AWA prompts. McLean, VA: GMAC. https://www.gmac.com/~/media/Files/gmac/Research/research-report-series/RR0508_IntelliMetricAWA.pdf
San Mateo, A. (2016). A bigram corpus used as a grammar checker for Spanish native speakers. Revista Signos49(90), 94-118. http://dx.doi.org/10.4067/S0718-09342016000100005
Santamaría-Lancho, M., Hernández, M., Sánchez-Elvira, Á., Luzón, J. M., & Jorge-Botana, G. (2018). Using semantic technologies for formative assessment and scoring in large courses and MOOCs. Journal of Interactive Media in Education2018(1), 12. http://doi.org/10.5334/jime.468
Shermis, M. D., & Burstein, J. C. (2003). Automated essay scoring: A cross-disciplinary perspective. Routledge.
Shermis, M. D., & Hamner, B. (2013). Contrasting state-of-the-art automated scoring of essays: Analysis. In M. D. Shermis, & J. Burstein, Handbook of automated essay evaluation, (Chapter 19). Routledge. https://doi.org/10.4324/9780203122761
Soleimani, H., & Rahmanian, M. (2014). Self-, peer-, and teacher-assessments in writing improvement: A study of complexity, accuracy, and fluency. Journal of Research in Applied Linguistics, 5(2), 128-148. https://rals.scu.ac.ir/article_11016.html
Toranj, S., & Ansari, D. N. (2012). Automated versus human essay scoring: A comparative study. Theory and Practice in Language Studies2(4), 719-725. https://doi.org/10.4304/tpls.2.4.719-725
Tsai, M. H. (2012). The consistency between human raters and an automated essay scoring system in grading high school students' English writing. Action in Teacher Education34(4), 328-335. https://doi.org/10.1080/01626620.2012.717033
Usener, C. A., Gruttmann, S., Majchrzak, T. A., & Kuchen, H. (2010). Computer-supported assessment of software verification proofs. In 2010 International Conference on Educational and Information Technology (Vol. 1, pp. V1-115). IEEE. doi: 10.1109/ICEIT.2010.5607766.
Valenti, S., Neri, F., & Cucchiarelli, A. (2003). An overview of current research on automated essay grading. Journal of Information Technology Education: Research2(1), 319-330. https://doi.org/10.28945/331
Vázquez‑Cano, E., Mengual‑Andrés, S., & López‑Meneses, E. (2021). Chatbot to improve learning punctuation in Spanish and to enhance open and flexible learning environments. International Journal of Educational Technology in Higher Education, 18, 33. https://doi.org/10.1186/s41239-021-00269-8
Vázquez‑Cano, E., Ramírez-Hurtado, J. M., & Sáez-López, J. M., & López-Meneses, E. (2023). ChatGPT: The brightest student in the class. Thinking Skills and Creativity, 49, 101380. https://doi.org/10.1016/j.tsc.2023.101380
Vermunt, J. K., & Magidson, J. (2002). Latent class cluster analysis. In J. Hagenaars & A. McCutcheon (Eds.), Applied latent class models (pp. 89-106). Cambridge University Press.
Vermunt, J. K., & Magidson, J. (2003). Addendum to Latent Gold user's guide: Upgrade for version 3. Statistical Innovations Inc.
Vermunt, J. K., & Magidson, J. (2005). Technical guide for Latent Gold 4.0: Basic and advanced. Statistical Innovations, Inc.
Villena, J., González, B., González, B., & Muriel, M. (2002). STILUS: Sistema de revisión lingüística de textos en castellano. Procesamiento del Lenguaje Natural29, 305-306. https://rua.ua.es/dspace/handle/10045/1759
Wang, H. C., Chang, C. Y., & Li, T. Y. (2008). Assessing creative problem-solving with automated text grading. Computers & Education51(4), 1450-1466. https://doi.org/10.1016/j.compedu.2008.01.006
Wang, J., & Brown, M. S. (2008). Automated essay scoring versus human scoring: A correlational study. Contemporary Issues in Technology and Teacher Education8(4), 310-325.
Ware, P. (2011). Computer-generated feedback on student writing. TESOL Quarterly45(4), 769-774. https://doi.org/10.5054/tq.2011.272525
Warschauer, M., & Grimes, D. (2008). Automated writing assessment in the classroom. Pedagogies: An International Journal3(1), 22-36. https://doi.org/10.1080/15544800701771580
Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language teaching research10(2), 157-180. https://doi.org/10.1191/1362168806lr190oa
Wohlpart, A. J., Lindsey, C., & Rademacher, C. (2008). The reliability of computer software to score essays: Innovations in a humanities course. Computers and Composition25(2), 203-223. https://doi.org/10.1016/J.COMPCOM.2008.04.001
Zhang, M. (2013). Contrasting automated and human scoring of essays. R & D Connections21(2), 1-11. https://www.ets.org/research/policy_research_reports/publications/periodical/2013/jpdd.html