Human Essay Scoring Versus Automated Essay Scoring: A Comparative Study Dr. Beata Lewis Sevcikova
Prince Sultan University, Riyadh, Saudi Arabia
Language instructors have been giving preference to the different writing tests designed to score the written material of students differently in order to identify whether the students have the writing skills as per the requirements. Most of the language instructors do not consider the writing test as a predictor the scores on the subject area of an individual. Nevertheless, the writing test score is the only way that demonstrates the report of score of the English Language Art. From this perspective, two major types of scoring systems can be found to be used frequently including Human Essay Scoring and Automated Essay Scoring both of which have advantages and disadvantages . Zhang (2013) states, “Essay scoring has traditionally relied on Human Raters, who understand both the content and the quality of writing” . From this perspective, an essay is evaluated on the basis of the evidence that provides the following abilities to be expected from a good writer:
Clear statement of the perspective of the issue to be discussed and the analysis of the relationship between the writer’s perspective and the perspective or perspectives of others
Development of idea or ideas
Supporting of the developed idea or ideas with reasons and examples
Organization of the idea or ideas logically and clearly Communicating the ideas or ideas effectively using standardized written English
Nevertheless, the increased use of the constructed response items as well as the increased number of students, has made the big question relating to the viability of the human scoring alone. According to Zhang (2013), the scoring method in which only human effort is involved is expensive and requires a lot of logistical efforts . Furthermore, this process of scoring usually depends upon the judgment of a less-than-perfect human . Considering these factors relating to the question of the efficiency in scoring the written work of students, testing processes are being tapped on computers to make the process more efficient and less expensive. “The interest in automated scoring of essays is not new and has recently received additional attention from two federally supported consortia, PARCC and Smarter Balanced, which intend to incorporate automated scoring into their common core state assessments planned for 2014” . The objective of the current research is to compare both of the scoring systems and their advantages and the disadvantages. From this perspective, it will review the opinions of the researchers and scholars favoring either option with their own perspectives using logical arguments. Specifically, the current study aims at investigating the usefulness and the validity of automated essay scoring versus human scoring. In other words, this paper aims at discussing the in-depth comparisons of the pros and cons of human and automated scoring. According to Zhang (2013) the debate surrounding this comparison can also be found in academia, among the general public, among media concerning with the usage of automated essays scoring within the standardized tests, as well as within the context of electronic learning environments, particularly used inside and outside of classrooms . One of the most important things for the test developers, educators, and policymakers is to have adequate knowledge relating to the strengths and weaknesses of both the scoring methods so that the prevention of misuse can be made . From this perspective, the present work contrasts the distinguishing features of the two scoring methods by elucidating their differences and discussing their practical implications for testing programs. (Hussain, 2013).
2. Human Essay Scoring Versus Automated Essay Scoring
2.1 Automated Scoring
Dikli (2006) points out, “Automated Essay Scoring (AES) is defined as the computer technology that evaluates and scores the written prose . AES systems are mainly used to overcome time, cost, reliability, and generalizability issues in writing assessment” (p. 3). On the other hand, Williamson at al. (2010) has the argument that human scoring is not the only option today, where technology is available everywhere, for score the constructed-response (CR) items . However, a lot of practical experience and the years of research demonstrate a large number of challenges. There is an expensive and considerable logistical effort in supporting the human scoring via recruiting, monitoring, training, and paying human graders. According to Williamson at al. (2010), the process of essay scoring in which human effort is used, is also time consuming in order to accomplish the scoring process and thus if a huge quantity is required to be scored, then it would be cumbersome. The other notable disadvantage of human essay scoring includes its limitations of consistency and objectivity . Streeter, Bernstein, Foltz, and DeLand (2011) argue that automated essay scoring provides consistency in location and time that ultimately promotes concistency by enabling a precise trend analysis, as it provides comparable feedback for use at school, classroom, state, or district level . The rapid growth of automated essay scoring can be observed on a significant scale and that is probably due to the reason that the system has the potential capability to produce the scores quicker and more reliably. Moreover, it is considerably more costly . On the other hand, Zhang (2013) points out that the noticeable shortcomings to be found in the human essay scoring system can be gotten rid of by using the automated essay scoring systems including e-Rater, Grammarly or any other available online program . The state- f-the-art systems of today involve the construct-relevant combination of quantifiable text features for computer-based scoring to measure the quality of a written essay. Nevertheless, an automated essay scoring system works solely with variables that are to be extracted as well as combined mathematically . Furthermore, Zhang (2013) also adds the point in the discussion favouring the automated essay scoring system by stating that an automated essay scoring system has the potential capability to assess the essays across grade levels . The example can be taken from the IntelliMetric® of Vantage Learning, Intelligent Essay Assessor, and e-Rater. According to ETS Research (2017), the specific features of the e-Rater engine include mechanics such as capitalization, usage of preposition selection, grammatical errors like subject-verb agreement, discourse structure such as the thesis statement or the main points, style, namely word repetition, sentence variety, vocabulary usage, relative complexity of vocabulary, discourse consistency quality, and source use . One of the new features included in the new version of e-Rater is the qualitative feedback received from the writing analysis tools of Criterion. There were around fifty features in the previous version of the e-Rater. According to Burstein and Wolska (2003), the output of the Criterion is based on the feedback containing 33 errors is based on grammatical errors, sentence structure and usage, and other mechanics . This feedback also includes the comments related to the styles of the writing. All types of the output include the four basic features of the feedback of the new version of the e-Rater. From this perspective, the four features are basically the rates of errors in a four-category structure according to which, each category of the errors is divided by the word count used in an essay . In contrast, human graders are more trained graders and thus more authentic, comparing them with the automatic graders, as their particular focus is on a specific grade range that is associated with a specific set of tasks and a
2.2 Human Scoring
Zhang (2013) argues that the quality of an essay is gauged by Human Raters with the help of a scoring rubric identifying the set characteristics of an essay . In other words, the human essay scoring system is aligned with a certain score level that is known as merit. Although human essay scoring systems are time-consuming and require a lot of effort, there are a few strengths in these systems. For example, the information given in the text is sent through a cognitive process and thus has a connection with prior knowledge. Human scoring is also based on the understanding of the given content, which is the reason why Human Raters can make a judgment on the quality of the text. Zhang (2013) states, “Trained Human Raters are able to recognize and appreciate a writer’s creativity and style (e.g., artistic, ironic, rhetorical), as well as evaluate the relevance of an essay’s content to the prompt”. In the same manner, a human Rater has the ability to evaluate the critical thinking skill of an examinee, including the factual accuracy of the claims as well as the quality of argumentation presented in the essay. In the light of the reviewed literature, it can be concluded that humans possess the ability to make holistic conclusions under the impact of numerous interacting factors. One of the noteworthy strengths of the automated essay scoring system is its efficiency and consistency. Automated essay scoring systems provide fine-tuned and instantaneous feedback that is helpful in practicing and improving writing skills simultaneously. Since the automated essay scoring system is based on a computer application, it is not influenced by any external factors such as deadlines. Nor is it attached emotionally to any piece of work. According to ETS Research (2017), there is no bias, preconceptions or stereotypes in a computer-based application . From this perspective, the automated essay scoring system has the potential to achieve the greater objectivity as compared to a human essay scoring system. Dikli and Bleyle (2014) argue that the increased reliability of the automated essay scoring system is the significant cause for the increased demand of this system . According to Toranj and Ansari (2012), one of the most important skills for students to develop is their proficiency in writing and a well-trained language teacher is the only one who can teach writing in well manner . From this perspective, Toranj and Ansari (2012) argue that the new technology in the form of automated essay scoring has played a noteworthy role to ease the burden of language teachers, as Human Scoring is time consuming and thus prevents the language teachers to accomplish most of their tasks to be done on the assigned time . “Writing always needs some kinds of application of technology, whether pencil, typewriter, or printing press, and each innovation involves new skills applied in new ways” (p. 719). On the other hand, human scoring systems are typically used as a development target as well as an evaluation criterion. From this perspective, it would not be wrong to state that offering the analytical feedback is not possible for Human Raters because it is difficult to produce the feedback of such a large number of essays immediately. Nevertheless, with the aid of automated essay scoring system, it is quite easy to evaluate the essays across the different grade levels.
Offering the analytical feedback is nearly impossible for Human Raters, as it is difficult to produce the feedback of such a large number of essays immediately. As mentioned previously, with the aid of automated essay scoring system, it is quite easy to evaluate the essays across the grade levels such as IntelliMetric® of Vantage Learning, Intelligent Essay Assessor, and e-Rater engine does . On the other hand, human graders are most often trained to give their feedback focusing on a specific grade range that is associated with a specific set of tasks and a specific rubric. Considering these points, it can be concluded that the advancement if the artificial intelligence technologies has made it possible to score large numbers of essays rapidly, a realistic option. Therefore, the automated essay scoring system, when developed carefully, is capable of contributing to the efficient delivery of essay scores. It may be an important aid in the improvement of educational writing skills. Furthermore, the new technology in the form of automated essay scoring has played a noteworthy role to ease the burden of language teachers, as Human Scoring is time consuming and thus prevents the language teachers to accomplish most of their tasks to be done on the assigned time.
3. Automated Essay Scoring With e-Rater
he term “Automated Essay Scoring” (AES) refers to a specialized computer tool used by teachers or tutors to make objective evaluations of the work, assignments, research paper, or essays written in the English language. Smolentzov (2013) discusses, “Essay scores may be used for very different purposes . In some situations, they are used to provide feedback for writing training in the classroom. In other situations, they are used as one criterion for passing/failing a course or admission to higher education” (p. 1). From this perspective, the two most commonly used tests include low-stake tests and high-stake tests. The former is used for training purposes, while the latter is more commonly used for grading a potential applicant for a particular course. Nevertheless, in contrast to the automated essay scoring system, human scoring systems are based on the understanding of the content, which is the main reason why Human Raters can make a judgment on the quality of the text. School authorities, in different countries, use standardized tests for different subjects to evaluate individually and collectively to ensure the quality of education in various educational settings. One of the most common and most frequently used component is essay writing. In some countries, teachers evaluate the essays of their students and determine their scores accordingly, while in other countries, including Canada and United States, teachers use two blind Raters to evaluate the essays of their students . These blind Raters involve the scoring process in the standardized tests. However, when both of the blind Raters agree on the gained score, the score is defined as satisfactory. On the other hand, when both of the blind Raters do not agree on the gained score, a third Rater is used in order to resolve the disagreement .
Murray and Orii (2012) point out that the standardized test should not be evaluated by using ‘manual effort’ to score the written work of students . This is because the latest technology has advanced machine learning methods that can not only save time and effort, but also reduce the chances of errors. This paper aims at describing the specialized computer tool that is used for Automated Essay Scoring known as e-Rater. It compares the two versions of the specialized computer tool v.2.0 the latest one and v.1.3. The paper describes the latest version of the tool, comparing both of the versions i.e. v.2.0 and v.1.3, in terms of their performance in presenting the reliability and validity of the scores produced by the two versions.
3.2. e-Rater and Its Specific Features
Educational Testing Service (ETS), the largest educational assessment and testing organization, has been using the specialized computer tool “e-Rater” for the last two decades. Attali and Burstein (2004) state, “the operational system put in use for scoring the Graduate Management Admission Test® Analytical Writing Assessment (GMAT® AWA) and for essays submitted to ETS’s writing instruction application, Criterion SM Online Essay Evaluation Service” . The criteria used by ETS in order for the evaluation of the writing skills of the students, is the web-based service using ETS provides an instant score reporting as well as diagnostic feedback. The application is used under the feature of Criterion® that is an Online Writing Evaluation Service. The Criterion is instructor-led and the web-based tool used to help students plan, write, and (if required) revise their write up. The main purpose of the tool is to give the students instant diagnostic feedback as well as the opportunity to practice the writing skills at their pace. Students use the feedback of the e-Rater engine, in the setting of the Criterion application, to evaluate their writing skills and thus can identify the areas they need to improve . One of the significant advantages of using the e-Rater engine includes the ability for the students to develop written skills independently, as they receive automated and constructive feedback. According to ETS Research (2017), the specific features of e-Rater engine include:
Mechanics such as capitalization
Usage of preposition
Grammatical errors such as subject-verb agreement
Discourse structure, the thesis statement or the main points
Style, for example word repetition
Relative complexity of vocabulary
As the name itself implies, the term Desk Research refers to the type of study in which the researcher acquires the findings as well as the other data, by sitting at a desk. Crouch and Housden (2012) point out: Desk Research is so called because it refers to that type of research data that can be acquired and worked upon mainly by sitting at a desk . That is to say, it is research data that already exists, having been produced for some other purpose and by some other person or body. It is commonly referred to as secondary research because the user is the secondary user of the data” (p. 19).
From this perspective, it can be stated that Desk Research can be an effective starting point for any research programs, as it usually has quick and easy access to the required data. Crouch and Housden (2012) strongly believe that Desk Research is extremely beneficial in improving the research process as well as generating ideas about collection of primary data even though this approach is not able to produce complete answers . On the other hand, the process of data collection in Desk Research is based on existing resources and, therefore, the approach is considered as low-cost costing as compared to other field research . From this perspective, the current study uses the approach of Desk Research by collecting data from existing resources available in electronic libraries, including books, research article, scholarly articles and dissertations to locate results included in both of the versions of e-Rater. The overall score is calculated in light of the total of the individual items, including the score of the thesis, the main points, the supporting ideas, and the elements used in the conclusion of the essay. Some of the other features included in the new version of e-Rater are Lexical Complexity, Usage of Prompt-Specific Vocabulary, and Essay Length.
3.4 New Feature Set
Based on the information found in the old version of the e-Rater, the new features and tools were developed based on feedback received from the writing analysis tools of Criterion. The fifty features in the previous version of the e-Rater have been substantially increased. According to Attali and Burstein (2004), the features included in the previous version of the eRater implicitly measure the length of the essays submitted, i.e. the word count used in a submitted essay . The reports received on the former version were based on non-monotonic logic. Brewka, Niemela, and Truszczynski (2008) figure out that the reasoning in which additional information invalidates conclusions is known as a non-monotonic . Research studies have been putting focus on knowledge representation community since the early eighties of the last century. A large number of fundamental challenges fuelled this interest experiencing knowledge representation for example reasoning about rules with defaults or exceptions, modelling, and solving the frame problem. On the other hand, some of the features included in the new version of the e-Rater include standardized checking of the length of an essay submitted and altering a definition to take account of non-monotonic relationship along with the human score. There are some other distinguishing features included in the new version of the e-Rater that make this version generate standardized scores . Following is the further description of some of the other distinguishing features:
3.4.1. Errors in Usage, Grammar, Style, and Mechanics
According to Burstein and Wolska (2003), the output of the Criterion is based on the feedback containing 33 grammatical errors, sentence structure and usage, and other mechanics . This feedback also includes the comments related to the styles of the writing. All types of the output include the four basic features of the feedback of the new version. From this perspective, the four features are the rate of errors into a four- category structure according to which each category of the errors is divided by the word count used in an essay.
3.4.2. Development and Organization
Burstein, Marcu, and Knight (2003) are of the opinion that the application of Criterion feedback automatically identifies the sentences used in an essay by a student that correspond to essay-discourse categories . In this feedback, the application uses natural language processing such as the background of the topic, statement of the thesis, presentation of the main idea, and the conclusion methods. The overall score is calculated in light of the individual items including the score of the thesis, the main points, the supporting ideas, and the elements used in the conclusion of the essay. Some of the other features in the new version of e-Rater are Lexical Complexity, Usage of Prompt-Specific Vocabulary, and Essay Length. Attali and Burstein (2004) conclude, “e-Rater V.2.0 uses a small and fixed set of features that are also meaningfully related to human rubrics for scoring essays” . Their study shows that the advantages of the new features integrated into the latest version can be utilized to generate the automated essay scores and thus, are considered as standardized across the different stimuli without losing the performance. This is because the features of the new version of e-Rater have higher agreement rates than that of the human scores. On the other hand, Quinlan, Higgins, and Wolff (2009) hold the opinion that states the scoring of e-Rater contains a large set of measures having a proven ability to predict human holistic scores . Since this new version has its ‘features,’ large sets of scores can be aggregated into a small set of readily recognizable categories (See Appendix A). Simillary, Quinlan, Higgins, and Wolff (2009) also conclude that the prediction of human scores can be considered as one type of score validity, whereas the construct validity can be taken separately . For example, if the essay length is to be considered, it would not be wrong to state that predictors may have lesser or greater construct relevance in modeling human holistic scores .
3.4. Conclusion In the light of the secondary data, it can be concluded that the new version of the Automated Essay Scoring tool, e-Rater, contains more human-based holistic scores as compared to the older version. The main purpose of the tool is to give students instant diagnostic feedback as well as the opportunity to practice the writing skills at their individual pace. From this perspective, students may use the feedback of the e-Rater engine, in the setting of the Criterion application, to evaluate their writing skills and thus can identify the areas they need to improve. The study of various researchers conclude that the advantages of new features integrated into the new version of this Automated Essay Scoring tool, can be utilized to generate automated essay scores and thus be considered as standardized across the different stimuli. The new version of the Automated Essay Scoring tool, e-Rater, has played a remarkable role to ease the burden of language teachers, as it contains new features integrated that can be utilized to generate the automated essay scores and thus, are considered as standardized across the different stimuli without losing the performance.
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. The Journal of Technology, Learning and Assessment, 4(3).
Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27-40.
Burstein, J., Tetreault, J., & Madnani, N. (2013). The e-rater automated essay scoring system. Handbook of automated essay evaluation: Current applications and new directions, 55-67.
Dikli, S. (2006). An overview of automated scoring of essays. The Journal of Technology, Learning and Assessment, 5(1).
Dikli, S., & Bleyle, S. (2014). Automated Essay Scoring feedback for second language writers: How does it compare to instructor feedback?. Assessing writing, 22, 1-17.
ETS Research,. (2017). ETS Research: Automated Scoring of Writing Quality. Ets.org. Retrieved 18 March 2017, from https://www.ets.org/research/topics/as_nlp/writing_quality/
Roscoe, R. D., Crossley, S. A., Snow, E. L., Varner, L. K., & McNamara, D. S. (2014). Writing quality, knowledge, and comprehension correlates of human and automated essay scoring. In 27th International Florida Artificial Intelligence Research Society Conference, FLAIRS 2014. The AAAI Press.
Shermis, M. D., & Burstein, J. C. (Eds.). (2003). Automated essay scoring: A cross-disciplinary perspective. Routledge.
Streeter, L., Bernstein, J., Foltz, P., & DeLand, D. (2011). Pearson’s automated scoring of writing, speaking, and mathematics. Pearson.
Topol, B., Olson, J., & Roeber, E. (2010). The cost of new hiher quality assessments: A comprehensive analysis of the potential costs for future state assessments. Stanford, CA: Stanford Center for Opportunity Policy in Education.
Toranj, S., & Ansari, D. N. (2012). Automated versus human essay scoring: A comparative study. Theory and Practice in Language Studies, 2(4), 719.
Williamson, D. M., Bennett, R. E., Lazer, S., Bernstein, J., Foltz, P. W., Landauer, T. K., ... & Sweeney, K. (2010). Automated scoring for the assessment of common core standards. White Paper.
Zhang, M. (2013). Contrasting automated and human scoring of essays. R & D Connections, 21, 2.
Attali, Y., & Burstein, J. (2004). Automated essay scoring with e‐Rater® v. 2.0. ETS Research Report Series, 2004(2).
Brewka, G., Niemela, I., & Truszczynski, M. (2008). Nonmonotonic reasoning. Foundations of Artificial Intelligence, 3, 239-284.
Burstein, J., & Wolska, M. (2003). Toward evaluation of writing style: Overly repetitious word use in student writing. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics. Budapest, Hungary.
Burstein, J., Tetreault, J., & Madnani, N. (2013). The e-rater automated essay scoring system. Handbook of automated essay evaluation: Current applications and new directions, 55-67.
Crouch, S., & Housden, M. (2012). Marketing research for managers. Routledge.
MSG Experts,. (2017). Desk Research - Methodology and Techniques. Managementstudyguide.com. Retrieved 20 March 2017, from http://www.managementstudyguide.com/desk-research.htm
Murray, K. W., & Orii, N. (2012). Automatic Essay Scoring. Carnegie Mellon University.
Quinlan, T., Higgins, D., & Wolff, S. (2009). Evaluating the construct coverage of the e-rater R scoring engine. ETS Research Report Series, 2009(1).
Ramineni, C., & Williamson, D. M. (2013). Automated essay scoring: Psychometric guidelines and practices. Assessing Writing, 18(1), 25-39.
Smolentzov, A. (2013). Automated Essay Scoring: Scoring Essays in Swedish. Institutionen for lingvistik Examensarbete.
Weigle, S. C. (2013). English language learners and automated scoring of essays: Critical considerations. Assessing Writing, 18(1), 85-99.
International Journals of Modern Research & Development (IJMRND) Page