Can AI Outperform Students in English Writing? A Costa Rican Study Reveals Surprising Results

Spanish

^{Garry Knight}

Redacción HC
05/05/2024

With the rapid integration of artificial intelligence (AI) into language education, a key question has emerged: Can AI write as well—or better—than human students in English as a foreign language (EFL) courses? Equally pressing is the issue of how consistently teachers evaluate such writing. A recent study from the University of Costa Rica dives deep into these questions, offering timely insights for educators and policymakers across Latin America and beyond.

The research, titled "Assessing Artificial Intelligence and Professors' Calibration in English as a Foreign Language Writing Courses at a Costa Rican Public University" and published in Actualidades Investigativas en Educación (2024), reveals that AI-generated texts are not only technically proficient but sometimes outperform student writing—at least in grammar and mechanics. But perhaps more striking is the wide variation in grading among professors, raising concerns about fairness and consistency in educational assessment.

Evaluating the Evaluators: A Quasi-Experimental Approach

The study, conducted by William Charpentier‑Jiménez, employed a quasi-experimental quantitative design. Eight university students enrolled in a TESOL course wrote a paragraph each. Using the same prompts, the researcher generated eight additional paragraphs using AI tools.

Ten experienced EFL professors, each with over a decade of teaching experience, evaluated all 16 paragraphs using a standardized rubric covering five key criteria: content, organization, grammar, mechanics, and vocabulary.

The assessments were analyzed in Excel, calculating average scores, standard deviations, and ranges by paragraph, criterion, and evaluator. Although the sample was relatively small, the results were revealing.

AI's Technical Edge—and Its Limitations

Overall, the AI-generated paragraphs received a slightly higher average score (7.61) than those written by students (7.56) on a 0–10 scale. When broken down by criterion:

Grammar, mechanics, and vocabulary: AI outperformed students with higher scores and lower variability.
Content and organization: Human-authored texts demonstrated richer structure and originality, though results varied widely among students.

This suggests that AI excels in technical precision—likely due to its training on grammatically correct language patterns—but falls short in creativity and narrative cohesion, which remain human strengths.

"AI is a flawless spellchecker," the author suggests, "but students are still the ones who tell meaningful stories."

Inconsistency in Grading: The Hidden Challenge

Perhaps the most critical finding was the high variability in professor evaluations. Some professors awarded up to 2 points higher or lower than others for the same paragraph. This inconsistency indicates a lack of calibration, which can undermine the fairness and reliability of assessments.

The issue of calibration—the alignment of grading standards among instructors—is rarely addressed in Latin American educational contexts. However, it is crucial for ensuring equity, especially as AI becomes more integrated into teaching and assessment processes.

"It doesn't matter how well a student writes," Charpentier-Jiménez warns, "if the grade depends more on who's reading than what's written."

Practical Implications for Teachers and Institutions

The study offers several actionable takeaways for improving EFL writing assessment:

AI as a benchmark: Institutions could use AI-generated texts as standardized reference points during grading workshops to enhance consistency.
Regular calibration sessions: Organizing workshops where teachers collectively evaluate the same texts and discuss rubric criteria can significantly reduce subjective discrepancies.
Balanced evaluation: While AI may assist in scoring grammar and mechanics, teachers should retain authority over evaluating content and creativity, where nuance is essential.
Future research: Larger-scale studies and longitudinal tracking are needed to evaluate the long-term impact of combining AI tools with calibrated human evaluation.

A Hybrid Future for Language Education?

In contexts like Costa Rica—where English proficiency is increasingly important for global engagement—blending AI tools with teacher expertise and calibration protocols could revolutionize how writing is taught and assessed.

This research aligns with broader trends across Latin America, where universities and ministries of education are exploring hybrid approaches to teaching and assessment. However, the study also sounds a cautionary note: without teacher training in calibration and a nuanced understanding of AI's limitations, these tools could deepen inequalities rather than close gaps.

Conclusion: Technology Needs Human Judgment

The study reaffirms a growing consensus: AI is here to stay in education, especially in language learning. But it cannot—and should not—replace human judgment.

Ensuring fair, accurate, and effective writing assessment requires calibrated educators, thoughtful integration of AI tools, and a commitment to continuous professional development. As Charpentier-Jiménez's research shows, when paired correctly, AI and teachers can complement one another—each enhancing the other's strengths.

The future of language education may be artificial—but it must also be intelligent.

Topics of interest

Academia Education Technology

Referencia: Charpentier-Jiménez W. Assessing artificial intelligence and professors' calibration in English as a foreign language writing courses at a Costa Rican public university. Actual Investig Educ [Internet]. 2024;24(1):1–25. Available on: https://doi.org/10.15517/aie.v24i1.55612.

License