Evaluating Original Oratory with Large Language Models: A Comparative Study of AI and Human Judging

Yatharth Sathya

04/03/2026

https://doi.org/10.65161/recA56vfb2MgpYC6K

This paper explores the potential of pre-existing large language models (LLMs), such as ChatGPT, Claude, and Gemini, to assist in judging Original Oratory (OO), a category in competitive speech and debate, although in the future it should expand into other speech and debate events. As judging in speech events often involves subjective assessments of content, structure, delivery, and rhetorical impact, this study evaluates whether current LLMs can offer reliable and consistent feedback aligned with their human counterparts. The experiment involved inputting transcripts of OO speeches into various LLMs and comparing their evaluations against those of humans who participate in the activity, using a standardized rubric to evaluate the speeches. The results show that while LLMs can effectively identify key structural elements, their performance is less consistent than their human counterparts, tending to overrate the speeches and struggling to accurately judge emotional appeal. The findings suggest that, with refinement, LLMs could serve as valuable tools for preliminary judge feedback or judge training, although they are not ready to be used as substitutes for human adjudication in high-stakes competition settings. The outcome of this paper illustrates that if an LLM model is deployed into the field of speech and debate, then it will be able to reduce subjectivity, as well as make the field much more efficient, because it is trained without the same personal biases as its human counterparts.