Abstract In this rep ort, we c ons ider the task of automated ass es sme nt of English as a Se c-
ond Language (ESOL) exam ination sc ripts written in re sp ons e to prompts e lic iting fre e text ans we rs . We re vie w and critic ally evaluate pre vious work on autom ated ass es sm ent for es says, es p ec ially when applie d to ESOL te xt. We form ally deﬁne the task as dis criminative preference ranking and de velop a ne w sys tem traine d and tes ted on a c orpus of manually-grade d s cripts . We s how exp e rime ntally that our b e st p e rforming sys tem is ve ry close to the upp e r b ound for the task, as deﬁne d by the agree me nt b etwee n human e xamine rs on the s ame corpus . Finally we argue that our approach, unlike extant solutions, is re lative ly prompt-ins ensitive and res istant to subve rs ion, even when its op e rating principles are in the public dom ain. T he se prop ertie s make our approach signiﬁcantly more viable for high-stake s as se ss ment.
1 Intro duction
The task of automated asses sment of free text passages or ess ays is distinct from that of scoring short text or multiple choice answers to a series of very sp eciﬁc prompts. Neverthe less, since Page (1966) des crib ed the Pro je ct Essay Grade (P EG) program, this has b een an active and fruitful area of research. To day there are at least 12 programs and ass o c iated pro ducts (Williams on, 2009), such as the Educ ational Testing Servic e’s (ETS) e -Rater (Attali and Burstein, 2006), PearsonKT’s KAT Engine / I ntelligent Es say Asse ssor (IEA) (Landauer et al , 2003) or Vantage Learning’s I ntellimetric (E lliot, 2003), which are deployed to as sess e ssays as part of self- tutoring systems or as a comp onent of examination marking (e.g. Kukich, 2000). Because of the broad p otential applic ation of automated asse ssment to essays, the se systems fo cus as much on asses sing the se mantic relevance or ‘topicality’ of e ssays to a given prompt as on assess ing the quality of the ess ay itself .
Many English as a Se cond Language (ESOL) e xaminations include free text ess ay-style answer c omp onents designe d to e valuate candidates’ ability to write , w ith a fo cus on sp e ciﬁc c ommunicative goals . For example, a prompt might s p ecify writing a letter to a friend describing a rece nt ac tivity or writing an email to a prosp ec tive employer jus tif ying a job application. The design, delivery, and marking of such examinations is the fo cus of cons iderable res earch into tas k validity for the sp eciﬁc skills and leve ls of attainment exp ected for a given qualiﬁcation (e.g. Hawkey, 2009). The marking scheme s f or such writing tasks typically emphas ise us e of varied and eﬀ ective language appropriate for the genre, exhibiting a range and complexity consonant with the level of attainme nt required by the examination (e.g. Shaw and Weir, 2007). T hus, the marking c riteria are not primarily prompt or topic sp eciﬁc but linguis tic . This makes automate d asses sment f or ESOL te xt (hereafter AAET) a distinct sub case of the general problem of marking ess ays, which, we argue, in turn require s a dis tinct technical approach, if optimal p erf ormance and eﬀec tivenes s are to b e achieved.
1Neverthe less, e xtant general purp ose systems, such as e- Rater and I EA have b een de ployed in self- ass essment or second marking roles for AAET. Furthermore , Edexcel, a division of Pe arson, has rec ently announced that f rom autumn 2009 a re vise d version of its Pearson Tes t of Englis h Acade mic (PTE Ac ademic ), a te st aime d at ESOL sp e akers see king e ntry to English sp eaking universities , w ill b e entirely asses sed us ing “Pe arson’s proven automated s coring technologies”2. This announcement f rom one of the ma jor providers of such high stake s tests makes investigation of the viability and accuracy of automated as sessme nt systems a res earch priority. I n this rep ort, we des crib e rese arch undertaken in c ollab oration with C ambridge ESOL, a divis ion of C ambridge Ass essment, which is, in turn, a division of the University of Cambridge, to de velop an accurate and viable approach to AAET and to asses s the appropriateness of more ge neral automated assess ment technique s for this task.
Section 2 provides s ome technical details of e xtant s ys tems and considers their like ly e ﬃcacy for AAET. Section 3 describ es and motivates the new mo del that we have develop ed for AAE T based on the paradigm of discriminative preference ranking using machine learning over linguistically-motivated text fe atures automatically e xtracted f rom scripts. Section 4 describ es an e xp erime nt training and testing this clas siﬁer on sample s of manually marke d scripts from candidates f or Cambridge E SOL’s First Ce rtiﬁcate of English (FCE) e xamination and the n comparing p erformance to human examine rs and to our reimplementation of the key comp one nt of Pears onKT ’s IEA. Sec tion 5 dis cusses the implications of these exp e riments within the wider context of op e rational deployment of AAET. Finally, sec tion 6 summarises our main conclusions and outlines areas of future research.
A full history of automate d asses sment is b eyond the sc op e of this rep ort. For recent re views of work on automated essay or free -text asses sment se e Dikli (2006) and Williamson (2009). In this section, we fo cus on the ETS’s e-Rater and PearsonKT’s IE A systems as these are two of the three main systems which are op erationally deployed. We do not consider IntelliMetric f urther as there is no precise and de tailed technical des cription of this system in the public domain (Williamson, 2009). However, we do disc uss a numb er of academic studies which ass ess and compare the p erformance of diﬀ erent technique s and as well as that of the public domain prototyp e s ys te m, BET SY (Rudner and Lang, 2002), which treats automated ass essment as a B ayesian text c lass iﬁc ation proble m, as this work sheds useful light on the p otential of approaches other than those deploye d by e -Rater and IEA.
2. 1 e -Rate r
e- Rater is extensively de scrib ed in a numb er of publications and patents (e.g. Burstein, 2003; Attali and Burstein, 2006; Burstein et al , 2002, 2005). The mos t recently describ ed version of e-Rater uses 10 broad feature typ es extracted from the text us ing NLP te chniques, 8 re pre sent writing quality and 2 content. These features corres p ond to high-level prop erties of a text, such as grammar, usage (errors), organisation or prompt/topic sp e ciﬁc c ontent. Each of thes e high-level features is broken down into a set of ground feature s; for ins tance, grammar is sub divided into features which c ount the numb er of auxiliary verbs, c omplement clauses, and so forth, in a text. These f eatures are extracted from the ess ay using NLP to ols which automatically ass ign part-of-sp eech tags to words and phras es, s earch for sp eciﬁc lexical items , and so f orth. M any of the feature extrac tors are manually w ritten and based on e ssay marking rubrics used as guides for human marking of essays for sp eciﬁc examinations. The res ulting counts for each feature are asso ciate d with cells of a vector which e nco des all the grammar feature s of a text. Similar vectors are cons tructed for the other high-le vel features .
The feature extraction s ys te m outlined ab ove, and des crib ed in more detail in the ref erence s provided, allows any text to b e repres ente d as a s et of vectors each repres enting a set of f eatures of a given high- level typ e. Each feature in each ve ctor is weighted using a variety of techniques drawn from the ﬁelds of information retrieval (IR) and machine learning (ML). For instanc e, content-based analysis of an e ssay is bas ed on vec tors of individual word frequency counts drawn f rom text. Attali and B urstein (2006) transf orm frequency counts to weights by normalising the word counts to that of the most frequent word in a training se t of manually-marked es says written in re sp onse to the s ame prompt, scored on a 6 p oint sc ale. Sp eciﬁcally, they re move stop words which are exp ected to o ccur with ab out equal frequency in all texts (such as the ), then for each of the sc ore p oints , the weight for word i at p oint p is
of any word at scop e p oint p , N is the total numb er of e ssays in the training set, and Niis the total numb er of e ssays having word i in all score p oints in the training se t3. For automated ass essment of the content of an unmarked e ssay, this weighted vector is compute d by dropping the c onditioning on p and the result is compared to aggregated vectors for the marked training essays in each class using cosine s imilarity. The unmarked ess ay is assigned a content s core corre sp onding to the most s imilar c lass . This approach transf orms an unsup ervis ed we ighting technique, w hich only requires an unannotated collection of essays or do cuments, into a sup ervised one w hich requires a set of manuallymarked prompt- sp ec iﬁc es says.
Other vectors are weighted in diﬀerent ways dep ending on the typ e of features extracted. Counts of grammatical, usage and s tyle feature s are smo othed by adding 1 to all c ounts (avoiding z ero counts for any feature), then divided by essay length word count to normalise for diﬀerent ess ay lengths , the n transformed to logs of counts to avoid skewing results on the bas is of abnormally high counts for a given feature. Rhetorical organisation is computed by random indexing (Kanerva et al, 2000), a mo diﬁc ation of latent semantic indexing (see s ection 2.2), which constructs word vectors based on c o o cc urrence in texts. Words c an b e we ighted us ing a wide variety of weight functions (Gorman and Curran, 2006). Burstein et al (2005) desc rib e an approach which c alculates mean vectors for words from training es says which have b e en manually marke d and se gmented into passage s p erforming diﬀe rent rhetorical functions. M ean vectors for each score p oint and pas sage typ e are normalise d to unit length and transf ormed so they lie on the origin of a graph of the transf ormed geometric space. This c ontrols f or diﬀering pass age lengths and incorp orates inve rs e do cument frequency into the word weights . The re sulting passage ve ctors c an now b e used to compare the similarity of passage s within and across e ssays, and, as ab ove, to score e ssays for organisation via similarity to me an vec tors for manually- marked training passages .