Lower-c ase d but not lemmatised lexical terms (i.e. unigrams) are extracted along with their frequency counts, as in a standard ‘bag-of-words’ mo de l. The se are s upple mented by bigrams of adjacent lexical terms. Unigrams, bigrams and trigrams of adjacent sequenc es of PoS tags drawn from the RASP tags et and most likely output se quenc e are extracted along with their frequency counts. All instances of these f eature typ es are inc luded with their c ounts in the ve ctors repre senting the training data and also in the ve ctors extracted for unlab elled test instance s.
Lexical term and ngram features are weighted by frequency c ounts from the training data and then scaled us ing tf · idf weighting (Sparck-Jones, 1972) and normalised to unit length. Rule name counts, s cript length and error rate are linearly scaled s o that their weights are of the same order of magnitude as the s caled term/ngram c ounts.
Pars e rule name s are e xtracte d from the phrase structure tre e for the most likely analys is found by the RA SP parse r. For example, the f ollowing s ente nce from the training data, Then some though occured to me. , receives the analysis given in Figure 1, whilst the correc te d version, Then a thought occurred to me. receives the analysis give n in Figure 2. In this represe ntation, the no des of the pars e tre es are decorated with one of ab out 1000 rule names, which are semi- automatically generated by the parse r and w hich enco de quite detaile d information ab out the grammatical constructions found. However, in common with ngram f eatures, these rule names are extracted as an unordered list f rom the analys es for all sentences in a given s cript along with their f requenc y counts. Each rule name
✭ ✭✭✭ ✭✭ ✭✭ ✭
T/frag Tph/np✭ ✭❤ ❤❤ ❤❤ ❤❤ ❤❤ ❤ ❤❤ ✭
✥ ✥NP/a1-c at np-r✥ ✥✥ ✥❵ ❵❵ ❵❵ ❵
✘ ✘ ✘NP/det a1- r✘ ✘❳ ❳❳ ❳❳ ❳ ✘
PP/p1 P1/p np-pro❍ ✟
some DD A1/advp ppart-r
to I I I+ PPI O1
A1/a Then RR
✏ ✏ ✏
o ccur+e d VVN
❍ ❍✭ ✭ ✭
Figure 1: Then some though occured to me
✭ ✭T/txt-s c1 S/adv s✭ ✭❤ ❤✭ ✭❤ ❤❤ ❤❤ ❤ ❤
✭ ✭ ✭✭ ✭S/np vp✭ ✭❤ ❤✭ ✭❤ ❤❤ ❤❤ ❤ ❤
✘ ✘✘✘ ✘❳ ❳ ❳❳
❳NP/det n1✦ ✦ ✦❛ ❛❛ ❛ ✦
a AT 1 N1/n thought NN1
o ccur+e d VVD P P/p1 P1/p np-pro❍ ❍
to I I I+ PPI O1 Figure 2: Then a thought occurred to me
together with its frequency c ount is represe nted as a cell in the vector derived from a script. T he script length in words is us ed as a feature less for its intrinsic informativenes s than for the nee d to balance the eﬀect of script le ngth on othe r fe atures. For example, error rates, ngram frequencies, etc w ill tend to rise with the amount of text, but the overall quality of a script must b e asse ssed as a ratio of the opp ortunitie s aﬀorded for the o ccurrence of some feature to its actual o c curre nce.
The automatic identiﬁcation of grammatical and lexical e rrors in text is far from trivial (Andersen, 2010). In the e xis ting systems reviewed in section 2, a fe w sp eciﬁc typ es of well- known and relatively frequent errors, such as s ub ject- verb agreement, are c aptured explicitly via manually- cons tructed e rror-sp e ciﬁc fe ature extractors. Otherwise , errors are captured implicitly and indire ctly, if at all, via unigram or other f eature typ es. Our AAET system already improves on this approach b ecause the RASP parser rule names explicitly represe nt marked, p eripheral or rare construc tions using the ‘-r’ s uﬃx, as wel
as c ombinations of extragrammatical subsequences suﬃxed ‘frag’, as can b e seen by comparing Figure 2 and Figure 1. The se c ues are automatically extracted without any need for error- sp eciﬁc rules or e xtrac tors and c an capture many typ es of long-distance grammatical e rror. Howeve r, we also include a single numerical feature re pres enting the ove rall error rate of the s cript. This is estimate d by counting the numb er of unigrams, bigrams and trigrams of lexical terms in a script that do not o ccur in a very large ‘background’ ngram mo del for E nglish which we have cons tructed from approximately 500 billion words of E nglish sampled from the world wide web. We do this eﬃciently using a Blo om Filter (Blo om, 1970). We have also e xp erimenente d with us ing frequency counts for smaller mo dels and measures such as mutual information (e.g. Turne y and Pantel, 2010). How ever, the most eﬀ ective metho d we have found is to use simple prese nce/absenc e over a very large dataset of ngrams which unlike, say, the Go ogle ngram corpus (Franz and Brants , 2006) retains low frequency ngrams .
Although we have only de scrib ed the fe ature typ es that we used in the exp eriments rep orted b elow, b ecause the y proved usef ul with res p ect to the comp etenc e level and text typ es inves tigate d, it is likely that others made available by the RASP s ys te m, such as the c onnected, directed graph of grammatical relations over s ente nces, the degre e of ambiguity within a se ntence , the lemmas and/or morphological complexity of words, and so forth (see Brisco e 2006 for a fuller desc ription of the range of feature typ es, in principle, made available by RASP), will b e discriminative in other AAET sc enarios. The system we have de velop ed inc ludes automated feature e xtrac tors for mos t typ es of f eature made available through the various representations provided by RASP. T his allows the rapid and largely automated discovery of an appropriate feature set for any given ass essment tas k, using the exp erimental metho dology e xe mpliﬁe d in the next section.
4 The FCE Exp erime nts
4. 1 D ata
For our exp e riments we made use of a se t of trans crib ed handwritten sc ripts pro duc ed by candidates taking the First Certiﬁcate in English (FCE ) examination written comp onent. Thes e were extracted from the Cambridge Learner Corpus (CLC) de velop ed by Cambridge University Pres s. T hese sc ripts are linked to metadata giving details of the candidate, date of the e xam, and so forth, as well as the ﬁnal scores given for the two written questions attempted by candidates (se e Hawkey, 2009 for details of the FCE ). The marks assigned by the examiners are p ostpro c essed to identify outliers, sometimes second marked, and the ﬁnal score s are adjusted us ing RASCH analysis to improve consistency. In addition, the scripts in the CLC have b e en manually e rror- co ded using a taxonomy of around 80 error typ es providing corrections for each error. The errors in the e xample from the previous sec tion are co ded in the f ollow ing way:
where RD denote s a determiner replacement e rror, SX a s p elling error, and I V a verb inﬂection error (see Nicholls 2003 for full de tails of the s cheme). In our exp eriments, we
used around three thousand scripts from examinations s et b etween 1997 and 2004, each ab out 500 words in length. A sample sc ript is provided in the app endix.
In order to obtain an upp er b ound on examiner agree ment and also to provide a b etter b enchmark to as sess the p erformance of our AAET s ys te m compared to that of human examine rs (as recomme nded by, for example , Attali and Bernstein, 2006), Cambridge ESOL arranged for four senior e xaminers to remark 100 FCE scripts drawn from the 2001 examinations in the CLC using the marking rubric from that year. We know, for example, from analysis of these marks and comparison to those in the CLC that the correlation b etween the human marke rs and the CLC sc ores is ab out .8 (Pearson) or .78 (Sp earman’s Rank), thus establishing an upp er b ound for p erformance of any classiﬁe r trained on this data (see sec tion 4.3 b elow).
4. 2 Bi nary Cl assi ﬁcati on
In our ﬁrst exp eriment we traine d ﬁve c lass iﬁer mo dels on 2973 FCE scripts drawn f rom the years 1999–2003. T he aim was to apply well- know n classiﬁcation and evaluation techniques to explore the AAET task from a disc riminative machine learning p ersp ective and also to inve stigate the eﬃcacy of individual feature typ e s. We use d the feature typ es desc rib e d in se ction 3.4 with all the mo dels and divided the training data into pass (mark ab ove 23) and fail classes . B ecause there was a large s kew in the training classes , with ab out 80% of the scripts falling into the pass clas s, we use d the Break Even Precision (BEP ) meas ure , de ﬁne d as the p oint at which ave rage precision=rec all, (e.g. Manning et al , 2008) to evaluate the p erformance of the mo dels on this binary clas siﬁcation task. This measure favours a clas sifer which lo cates the decision b oundary b etween the two classes in s uch a way that false p os itives / negative s are evenly distributed b e twee n the two class es.
The mo dels trained were naive B ayes, Bays ian logistic regres sion, maximum entropy, SVM, and TAP. Cons istent with much pre vious work on te xt clas siﬁcation tasks, we found that the TAP and SVM mo de ls p erformed b es t and did not yield signiﬁcantly diﬀerent results. For brevity, and b ecause TAP is f aster to train, we rep ort results only for this mo del in what follows.
Figure 3 shows the contribution of fe ature typ es to the overall accuracy of the classiﬁer. With unigram terms alone it is p ossible to achieve a BE P of 66.4%. The addition of bigrams of te rms improves p e rf ormanc e by 2.6% (repre senting ab out 19% relative error reduction (RER) on the upp er b ound of 80%). The addition of an error es timate fe ature based on the Go ogle ngram corpus furthe r improves p erformance by 2.9% (further RER ab out 21%). Addition of pars e rule name features further improves p e rf ormanc e by 1.5% (furthe r RE R ab out 11%). The remaining fe ature typ es in Table 1 contribute another 0.4% improvement (further RER ab out 3%).
Thes e res ults provide some supp ort for the choic e of feature typ es desc rib ed in se ction 3.4. Howe ver, the ﬁnal datap oint in the graph in Figure 3 s hows that if we substitute the error rate predicted f rom the CLC manual error co ding for our corpus de rived es timate, then p erformance improves a further 2.9%, only 3.3.% b elow the upp e r b ound deﬁned by the de gree of agreement b e tween human marke rs . This strongly sugges ts that the error
CLC Ra ter 1 R ater 2 Ra ter 3 Ra ter 4 Aut o-mark CLC 0.80 0.79 0.75 0.76 0.80
Ra ter 1 0.80 0.81 0.81 0.85 0.74 Ra ter 2 0.79 0.81 0.75 0.79 0.75 Ra ter 3 0.75 0.81 0.75 0.79 0.75 Ra ter 4 0.76 0.85 0.79 0.79 0.73 Aut o-mark 0.80 0.74 0.75 0.75 0.73
Average : 0.78 0.80 0.78 0.77 0.78 0.75
Table 3: C orrelation (Sp earman’s Rank)
Thes e results suggest that the AAET system we have de velop ed is able to achieve levels of correlation similar to thos e achieved by the human markers b oth with e ach other and with the RASCH-adjusted marks in the CLC. To give a more concrete idea of the ac tual marks assigned and their variation, we give marks assigned to a random sample of 10 scripts from the test data in Table 4 (ﬁtted to the appropriate score range by simple linear regres sion).
The training data we have us ed so far in our exp eriments is draw n from examinations b oth b ef ore and after the test data. In order to investigate b oth the e ﬀect of diﬀerent amounts of training data and also the e ﬀect of training on scripts drawn f rom e xaminations at increasing temp oral distance from the test data, we divided the data by ye ar and trained and tested the c orrelation (Pears on) with the C LC marks. Figure 4 shows the results – clearly there is an eﬀect of training data size , as no re sult is as go o d as those rep orted using the full datase t for training. Howeve r, there is also a s trong eﬀect for temp oral distance b e tween training and te st data, re ﬂecting the fact that b oth the typ e of prompts used to e licit text and the marking rubrics e volve over time (e.g. Hawkey, 2009; Cec il and We ir, 2007).
0.69 0.60 0.60
1998 1999 2000 2001 2002 2003 2004
Figure 4: Training Data Eﬀects
4. 5 E rr or Est im at io n
In order to explore the eﬀect of diﬀerent datasets on the error prediction e stimate, we have gathered a large corpus of Englis h te xt f rom the web. Estimating e rror rate using a 2 billion word sample of text sampled f rom the UK domain re taining low frequency unigrams, bigrams, and trigrams we were able to improve p e rformanc e over estimation using the Go ogle ngram corpus by 0.09% (Pearson) in exp eriments which were othe rw ise identical to those re p orted in section 4.3
To date we have gathered ab out a trillion words of sequence d text f rom the web. We exp ect future exp eriments with error estimates based on larger sample s of this corpus to improve on these results f urther. Howeve r the results rep orted here demonstrate the viability of this approach, in combination with pars er-based feature s which implicitly c apture many typ es of longer distanc e gramatical error, compared to the more lab our intensive one of manually co ding feature extractors for known typ es of stereotypical learner error.
4. 6 Incr em ental Sem ant ic Analysis
Although, the fo cus of our exp eriments has not b een on content analysis (see section 2.3.3), we have undertaken some limited exp e riments to compare the p erformance of an AAET system based primarily on such technique s (such as PearsonKT’s , IE A, see se ction 2) to that of the system pres ente d here.
We used ISA (see section 2.3.3) to c onstruct a system w hich, like IEA, uses similarity to an average vector cons tructed us ing I SA from high scoring FCE training scripts as the bas is for assigning a mark. The cosine similarity scores were the n ﬁtted to the FCE scoring scheme. We trained on ab out a thousand scripts drawn f rom 1999 to 2004 and tested on the s tandard test se t from 2001. U sing this approach we were only able to obtain a correlation of 0.45 (Pe arson) with the CLC scores and and average of 0.43 (Pearson) with the human e xaminers. This contras ts with score s of 0.47 (Pe arson) and 0.45 (Pears on)
4 Number of samples
training the TAP ranked pre ference clas siﬁer on a similar numb e r of scripts and using only unigram term feature s.
Thes e res ults , taken with those rep orted ab ove, s uggest that there isn’t a clear advantage to us ing techniques that cluster terms according to the ir c ontext of o c currenc e, and compute te xt similarity on the basis of thes e c lusters, over the text clas siﬁcation approach deployed here. Of course, this exp e rime nt do es not demonstrate that clustering te chniques c annot play a us eful role in AAET, howeve r, it do e s suggest that a straightforward applic ation of latent or distributional s emantic metho ds to AAE T is not guarantee d to yield optimal res ults .
4. 7 Oﬀ -Pr om pt E ssay D et ecti on
As disc ussed in se ction 2.4, one is sue with with the deployment of AAET for high s takes examinations or other ‘adve rs arial’ contexts is that a non-prompt sp eciﬁc approach to AAET is vulne rable to ‘gaming’ via submiss ion of linguistically e xc ellent rote-learned text regardless of the prompt. To detect such oﬀ-prompt te xt automatically do e s require content analys is of the typ e discusse d in s ection 2.3.3 and explored in the previous sec tion as an approach to grading.
Given that our approach to AAET is not prompt- sp eciﬁc in terms of training data, ide ally we would like to b e able to de tec t oﬀ-prompt scripts with a s ys tem that do esn’t require retraining for diﬀerent prompts. We would like to train a system w hich is able to compare the que stion and answer s cript within a ge neric dis tributional semantic space. B ecause the prompts are typically quite s hort we c annot exp ect that in gene ral there will b e much direct ove rlap b etween contentful terms or lemmas in the prompt and those in the answer text.
We trained an I SA mo del using 10M words of diverse E nglish te xt using a 250- word s top list and ISA parame te rs of 2000 dimens ions, impac t factor 0.0003, and dec ay constant 50 with a context window of 3 words . Each question and answer is represented by the s um of the his tory vectors corres p onding to the terms they contain. We als o included additional dimensions representing actual terms in the overall mo del of dis tributional semantic space to capture cas es of literal overlap b etween terms in questions and in answe rs . The res ulting vectors are then compared by calculating their cosine similarity. For comparison, we built a standard vector s pace mo de l that meas ures semantic s imilarity using cosine distance b etween vec tors of terms for que stion and answer via literal term overlap.
To test the p erformance of these two approache s to oﬀ-prompt ess ay detection, we extracted 109 pas sing FC E sc ripts from the CLC answering four diﬀerent prompts :
1. During your holiday you made some new f riends . Write a letter to the m saying how you enjoyed the time sp e nt w ith them and inviting them to visit you.
2. You have b een asked to make a sp eech welcoming a well-know n w riter who has come to talk to your class ab out his /her work. Write what you say.
3. “Put that light out!” I shouted. Write a s tory which b egins or e nds w ith these words .
4. Many p eople think that the car is the greatest danger to human life to day. What do you think?
Each system was use d to assign each answer text to the most similar prompt. The acc uracy (ratio of correct to all assignme nts) of of the standard ve ctor space mo de l was 85%, whilst the augmented ISA mo de l achieved 93%. T his pre liminary exp eriment suggests that a generic mo del for ﬂagging putative oﬀ-prompt ess ays for manual checking could b e construc te d by manual selec tion of a set of prompts from past pap ers and the c urrent pap er and then ﬂagging any ans wers that matched a past prompt b etter than the c urrent prompt. There will b e some false p os itives, but these initial results s uggest that an augmented I SA mo del could p erform we ll enough to b e use ful. Further exp e rimentation on larger se ts of generic training text and on optimal tuning of ISA parameters may also improve accurac y.
In this re p ort, we have intro duc ed the discriminative TAP prefe re nce ranking mo del for AAET. We have demons trated that this mo del can b e coupled with the RASP text pro cessing to olkit allowing fully automated extraction of a wide range of feature typ es many of which we have shown exp erimentally are disc riminative for AAET. We have also intro duc ed a generic and fully automated approach to error e stimation based on eﬃcient matching of text s eque nces with a ve ry large background ngram corpus derived from the web using a B lo om ﬁlter, and have shown exp erimentally that this is the single most discriminative fe ature in our AAET mo del. We have also show n exp erimentally that this mo del p e rf orms s igniﬁcantly b etter than an otherwise equivalent one based on classiﬁcation as opp os ed to prefe re nc e ranking. We have also show n exp erimentally that text classiﬁcation is at le ast as eﬀe ctive for AAET as a mo del base d on ISA, a recent and improved latent or dis tributional semantic content-based text similarity me tho d akin to that used in IEA. However, ISA is use ful for de tec ting oﬀ- prompt es says using a generic mo del of dis tributional s emantic space that do e s not require retraining for new prompts.
Much further work remains to b e done. We b elie ve that the feature s as sess ed by our AAET mo del make subversion by students diﬃcult as they more dire ctly asse ss linguistic comp etence than pre vious approaches. However, it remains to tes t this e xp erime ntally. We have shown that e rror estimation against a background ngram c orpus is highly informative, but our fully automated technique still lags error e stimates bas ed on the manual error co ding of the CLC. Further exp e rimentation with larger background corp ora and weighting of ngrams on the basis of their frequency, p ointwise mutual inf ormation, or similar meas ure s may he lp clos e this gap. Our AAET mo del is not traine d on promptsp e ciﬁc data, w hich is op erationally advantageous, but it do e s not inc lude any mechanism for detecting text lacking overall inter- sentential c oherence . We b elieve that ISA or other recent dis tributional s emantic te chniques provide a go o d basis for adding such fe atures to the mo del and plan to test this exp e rime ntally. Finally our current AAET system simply returns a s core, though implicit in its computation is the identiﬁc ation of b oth negative and p ositive feature s that contribute to its c alculation. We plan to explore metho ds f or automatically providing feedback to students based on these features in order to fac ilitate
deployment of the system f or se lf-asses sment and self- tutoring. In the near f uture, we inte nd to re leas e a public- domain training se t of anonymis ed FCE
l ig h tscripts from the CLC together with an anonymis ed version of the te st data des crib ed in sec tion 4. We also intend to rep ort the p erformance of preference ranking with the SVMpackage (Joachims, 1999) based on RASP-derived features, and error estimation using a public domain corpus trained and tested on this data and compared to the p erformance of our b est TAP-based mo de l. This w ill allow b etter re plication of our results and facilitate further work on AAET.
The research and exp eriments rep orted he re were partly funded through a contract to iLexIR Ltd from Cambridge ESOL, a divis ion of Cambridge Asse ssment, w hich in turn is a subsidiary of the University of Cambridge. We are grateful to Cambridge University P res s for p ermiss ion to us e the subset of the C ambridge Learners’ Corpus for these exp eriments. We are also grateful to Cambridge Asses sment f or arranging for the test sc ripts to b e remarked by f our of their s enior examiners to f acilitate their evaluation.
Refer ence s
Ande rs en, O. (2010) Grammatical error prediction, Cambridge University, C omputer Lab oratory, PhD Dis sertation.
Baroni, M., and Lenci, I. (2009) ‘One dis tributional me mory, many se mantic spaces ’, Proceedin gs of t he Wkshp on Geometrical Models of Natural Language Semantics, Eur. Ass o c . for Comp. Linguistics , pp. 1–8.
Bos , S. and Opp er, M. (1998) ‘Dynam ic s of batch training in a p erce ptron’, J. P hysics A: Math . & Gen., vol.31(21 ), 4835–4850.
Burstein, J. (2003) ‘T he e- rate r s coring e ngine: automated es say s coring w ith natural language pro ce ss ing’ in (e ds ) Shermis, M.D. and J. B urste in (eds.), Aut omated Essay Scoring: A cross-Discip linary Perspective, Lawre nc e Erlbaum As so ciates Inc., pp. 113–122.
Burstein, J., Brade n- Harder, L., Cho dorow, M.S., Kaplan, B.A., Kukich, K., Lu, C., Ro ck, D.A., and Wolﬀ, S. (2002) System and method fo r computer-ba sed aut oma tic essay scorin g, US Patent 6,366,759, April 2.
Burstein, J., Higgins , D., Gentile, C., and Marc u, D. (2005) Method a nd syst em fo r determining text coherence, US Patent 2005/0143971 A1, June 30.
Collins , M. (2002) ‘Disc riminative training m etho ds for hidde n Markov mo de ls: the ory and exp erim ents with Pe rc eptron algorithm s’, Proceedin gs of the E mpirical Methods in Nat. Lg. Processing (EMNL P), Ass o c. for Comp. Linguistics , pp. 1–8.
Coniam, D. (2009) ‘Exp e rime nting with a c omputer es say-s coring program bas ed on ESL s tudent writing s cripts’, ReCALL , vol.21(2), 259–279.
Dikli, S. (2006) ‘An ove rview of automate d sc oring of es says’, Journ al of Techn ology, Learning and A ssessment, vol.5(1),
Elliot, S. (2003) ‘IntellimetricTM: From He re to Validity’ in (e ds ) She rm is , M.D. and J. Burs tein (eds.), Aut omated E ssay S co ring: A cross-Disciplin ary P erspective, Lawre nc e Erlbaum Asso c iate s Inc ., pp. 71–86.
Foltz , P.W., Landauer, T.K., Laham, R.D., Kintsch, W., and Rehde r, R.E. (2002) Methods for analysis an d evalua tio n of t he semantic con tent of a w riting based on vector length, US Patent 6,356,864 B1, March 12.
Franz, A. and Brants , T. (2006) Al l our N-gram are Belo ng to Yo u, http://go ogle rese arch.blogsp ot.c om/2006/08/all-our- n- gram -are-b elong- to-you.html.
Fre und, Y. and Schapire, R. (1998) ‘Large margin clas siﬁc ation using the p erce ptron algorithm’, Comp uta tio nal Learning Theory, vol.209–2 17,
Gorm an, J. and Curran, J.R. (2006) ‘Random indexing us ing s tatis tic al we ight func tions’, Proceedin gs of the Conf. on Empirical Methods in Na t. Lg. Proc., Ass o c. for C omp. Linguistics , pp. 457–464.
Hawke y, R. (2009) Examining FCE and CAE : Studies in Language Test ing, 28, Cambridge Unive rs ity Pre ss.
Joachims , T. (1998) ‘Te xt categorization w ith supp ort vector machines : le arning w ith many relevant fe atures ’, Proceedin gs of t he P roc. of Eur. Conf. on Ma ch. Learnin g, Springe rVerlag, pp. 137–142.
Joachims , T . (1999) ‘Making large -sc ale s upp ort vec tor machine le arning practical’ in (e ds ) Scholkopf, S.B. and C. B urges (eds.), Advan ces in kernel methods, MIT Press .
Joachims , T. (2002) ‘Optimiz ing search e ngine s using c lickthrough data’, Proceedin gs of t he SIGKDD, Ass o c. C omputing Machinery.
Kakkone n, T., Mylle r, N., Sutine n, E. (2006) ‘Applying Late nt Dirichlet Allo c ation to autom atic es say grading’, Proceedin gs of the FinTA L, Springe r- Ve rlag, pp. 110–120.
Kane jiya, D., Kam ar, A. and Pras ad, S. (2003) ‘Autom atic Evaluation of Stude nts’ Answers using Syntac tic ally Enhanc ed LSA’, Proceedin gs of t he H LT-NAACL 0 3 Workshop on Buildin g Educational Ap plications U sing N atural Lan guage Processing, Ass o c. for C omp. Linguistics .
Kane rva, P., Kris tofe rs son, J., and Holst, A. (2000) ‘Random inde xing of text s ample s for latent se mantic analysis’, Proceedin gs o f th e 22nd Annual Con f. of the Cognit ive S cience S ociety, Cognitive Science So c ..
Krauth, W. and Mez ard, M. (1987) ‘Le arning algorithms w ith optimal s tability in ne ural ne tworks’, J. o f Physics A ; Math. Gen ., vol.20,
Kukich, K. (2000) ‘Be yond automate d e ssay s coring’ in (ed.) Hearst, M. (e ds .), The debat e on automated essay grading, IEEE Intelligent Sys tem s, pp. 27–31.
Landauer, T.K., Laham, D., and Foltz, P.W. (2000) ‘The I nte lligent Ess ay Ass es sor’, IEEE Intel ligent Systems, vol.15(5),
Landauer, T.K., Laham, D. and Foltz, P.W. (2003) ‘Autom ate d scoring and annotation of es says with the Intelligent Essay Ass es sor’ in (e ds ) Shermis , M.D. and J. Burstein (eds.), Aut omated Essay Scorin g: A cross-Discip linary Perspective, Lawre nc e Erlbaum As so ciate s Inc., pp. 87–112.
Leake y, L.S. (1998) ‘Automatic es say grading using te xt c ategorization technique s’, Proceedin gs of the 21st ACM-SIGIR , Ass o c. for Computing Machine ry.
Lew is , D.D., Yang, Y., Rose , T. and Li, T. (2004) ‘RC v1: A new b enchmark c olle ction for text cate goriz ation res earch’, J. Mach. Learning res., vol.5, 361–397.
Li, Y., B ontcheva, K. and Cunningham, H. (2005) ‘Us ing uneve n margins svm and p e rc eptron for inform ation extraction’, Proceedin gs of the 9th Con f. on Nat. Lg. Learning, Ass o c. for Com p. Ling..
Lonsdale , D. and Strong-Kraus e, D. (2003) ‘Autom ate d Rating of ESL Es says ’, Proceedin gs of the HLT-N AACL 03 Workshop on Building Educational Applications Using Natural Language Processing, Ass o c. for Comp. Linguistics .
Manning, C ., Raghavan, P.,and Schutze , H. (2008) Introduct ion to Info rmation Retrieval, Cam bridge University Pre ss .
Nicholls, D. (2003) ‘The Cambridge Le arner Corpus: Error c o ding and analys is for lexicography and ELT ’ in Corpus Linguistics I I (eds.), Archer, D, Rayson , P., Wilson, A. and McCenery T. (eds.), UCREL Te chnic al Rep ort 16, Lanc aster University.
Page, E.B. (1966) ‘The imm inence of grading ess ays by compute r’, Phi Delta Kap pan, vol.48 , 238–243.
Page, E.B. (1994) ‘C omputer grading of s tudent pros e, us ing m o de rn c onc epts and software’, Journ al of Experimental Education, vol.6 2(2), 127–142.
Powe rs, D.E., Burs tein, J., Cho dorow, M., Fowles , M.E., Kukich, K. (2002) ‘Stum ping e-rater: challenging the valdity of automated e ss ay s coring’, Comp uters in Human Behavior, vol.18, 103–134.
Ros enblatt, F. (1958) ‘The p e rc eptron: A probabilis tic mo del for information storage and organiz ation in the brain’, Psychological Review , vol.65,
Ros´e, C.P., Ro que , A., Bhe mb e, D. and VanLe hn, K. (2003) ‘A Hybrid Text Class iﬁc ation Approach for Analysis of Stude nt Es says ’, Proceedin gs of the H LT-NAACL 03 Wo rkshop on Building Educational Applicat ion s U sing N atural Language Processing, Ass o c. for Comp. Linguistics .
Rudner, L.M. and Lang, T. (2002) ‘Automate d es say scoring using Baye s’ the ore m’, Journ al of Technology, Learnin g an d Assessment , vol.1(2),
Shaw , S and Weir, C. (2007) Examining Writing in a S econ d Language, Studies in Language Testing 26, Cambridge Unive rsity Pre ss .
Sparck Jones , K. (1972) ‘A statistic al inte rpre tation of term sp e ciﬁcity and its application in retrie val’, Journ al of Documenta tio n, vol.28(1), 11–21.
Sle ator, D. and Te mp e rle y, D. (1993) ‘Parsing Englis h with a Link Grammar’, Proceedin gs of the 3rd Int . Wkshp on Pa rsing Technologies, Ass o c. for Comp. Ling..
Turney, P. and Pante l, P. (2010) ‘From freque nc y to m eaning’, Jnl. of Art iﬁcial Intel ligen ce Research, vol.37, 141–188.
Vapnik, V.N. (1995) The n ature of st atist ical learning theory, Springe r- Ve rlag. Williams on, D.M. (2009) A framew ork for implementing a utomat ed sco ring, Educational Te sting
Servic e, Te chnic al Rep ort. Yang, Y., Zhang, J. and Kisiel, B. (2003) ‘A sc alability analys is of c las siﬁers in text cate goriza-
tion’, Proceedin gs of the 26th ACM-SIGIR, Ass o c. for Computing Machine ry, pp. 96–103.
App endix: Sample Scr ipt
The following is a sample of a FCE s cript with error annotation drawn from the CLC and conve rte d to XML. The full e rror annotation s che me is des crib ed in Nicholls (2003).
0100First Certificate in EnglishFCE 011German18M 134.2 Dear Mrs Ryan|,
Many thanks for your letter.
I would like to travel in July because I have got |my summer holidays from July to August and I work as a bank clerk in August. I think a tent would suit my personal life-style|lifestyle better than a log cabin because I love the nature.
I would like to play basketball during my holidays at Camp California because I love this game. I have been playing basketball for 8 years and today I am a member of an Austrian basketball-team| basketball team. But I have never played golf in my life but|though with your help I would be able to learn how to play golf and I think this could be very interesting.
I also would|would also like to know how much money I will get from you for those|these two weeks because I would like to spend some money for|on clothes.
I am looking forward to hearing from you soon.
430.0 Dear Kim
Last month I enjoyed helping at a pop concert and I think you want to hear some funny stories about the experience|experiences I made| had.
At first I had to clean the three private rooms of the stars. This was very boring but after I left the third room I met Brunner and Brunner. These two people are stars in our country... O.K. I am just kiding|kidding. I don’t like the songs of Brunner and Brunner|Brunner and Brunner’s songs because this kind of music is very boring.
I also had to clean the washing rooms|
washrooms. I will never ever help anybody to organice| organise a pop concert |again.
But after this serville|servile work I met Eminem. I think you know his popular songs like "My Name Is". It was one of the greatest moments in my life. I had to bring| take him something to eat.
It was a hard but also afunny| fun work. You should try to called| call|get some experience during|at such a concert you|. You would not regret it.