Ucam-cl-tr-790 issn 1476-2986 Computer Laborator

Download 0.52 Mb.
Size0.52 Mb.
1   2   3   4   5   6

:ht+ = i (mchc



+ (1 - mc)sc

F r eq ( c )

K) (4) where i is a cons tant impact rate and mdetermines how much the history of one word influences the his tory of another word – the more frequent a context word the less it will influence the history of the target word. The m weight of c decreases as follows :

= 1

exp (


) (5)where Km

is a parameter determining rate of de cay. ISA has the advantage that it is fully incremental, do es not rely on weighting schemes that require global c omputations over contexts, and is therefore efficient to compute . It extends RI by up dating the ve ctor for t with the signature and history of c so that se cond order effe cts of the context word’s distribution are f actore d into the repre sentation of the target word.

As well as exploring improved clustering techniques over LSA or RI such as ISA, b oth the weighting functions us ed for mo delling c o o ccurrence (e.g. Gorman and Curran, 2006), and the conte xts used to as sess co o ccurre nce (e.g. Baroni and Lenci, 2009), which has b ee n exclusively base d on an entire ess ay in automated ass essment work, should b e varied. For ins tance, the b est mo de ls of semantic s imilarity often me asure co o ccurrence of words in lo cal syntactic conte xts, such as those provided by the grammatic al relations output by a parser. Finally, though prompt-sp ec ific content analysis is clearly imp ortant f or assess ment of many es says typ es, it is not s o c lear that it is a c entral as p ect of E SOL assess ment, where demonstration of communicative comp etence and linguis tic varie ty without excessive errors is arguably more imp ortant than the sp ecific topic addressed.

2 .4 E val uat io n

The e valuation of automated asses sment systems has largely b ee n base d on analys es of corre lation with human marke rs . Typically, systems are traine d on premarked ess ays f or



a s p ecific exam and prompt and their output scaled and fitted to a partic ular grade p oint scheme using regression or e xp ert rubric- based weighting. Then the Pearson correlation co efficient is calculate d for a set of test e ssays for which one or more human gradings are available. Using this me asure, b oth e-Rater, I EA and other approaches discusse d ab ove have b een show n to corre late well w ith human grades. Of te n they c orrelate as we ll as the grade s ass igned by two or more human markers on the same essays. Additionally, the rates of exact re plication of human s cores , of deviations by one p oint, and so forth can b e calculated. These may b e more informative ab out causes of large r divergences given sp e cific phenomena in e ssays (e.g. Williamson, 2009; Coniam, 2009).

A weakness of the ab ove approach is that it is clear that it is re lative ly eas y to build a system that will correlate well with human markers unde r ideal conditions. Eve n the original PEG (Page, 1966) obtained high c orrelations using ve ry sup erficial textual features such as e ssay, word and s entenc e length. Howe ver, such feature s are easily ‘gamed’ by s tudents and by instructors ‘te aching to the exam’ (asses sment regime) once it is public knowle dge w hat feature s are e xtracted for automated as sessme nt. As automated assess ment is not based on a full understanding of an essay, the f eatures extracted are to some extent proxies f or such understanding. The degree to which such proxies can b e manipulated indep endently of the features that they are intended to measure is c learly an imp ortant factor in the analysis of systems, esp ecially if they are inte nded for use in high-stakes asse ssment. Powers et al (2002) conducted an exp e riment in which a varie ty of exp erts we re invited to design and submit e ssays that they b elieved would either b e under- or over- scored by e- Rater. The res ults showed that e- Rater was relative ly robust to such ‘gaming’, though those with intimate knowledge of e-Rate r were able to trick it into assigning score s de viating from human markers, even by 3 or more p oints on a 6- p oint scale.

A furthe r weakness of comparison w ith human markers, and inde ed with training such systems on raw human marks, is that human markers are relatively inc onsis te nt and show comparatively p o or corre lation w ith each other. Alternatives, have b een prop os ed such as training and/or testing on averaged or RASCH- corrected s cores (e .g. Coniam, 2009), or evaluating by correlating system grades on one task, such es say w riting, with human scores on an inde p endent task, such as sp oken comprehension (Attali and Burstein, 2006). Finally, many non-technical prof essionals involved in asses sment ob ject to automated assess ment, arguing, f or example , that a c omputer can neve r rec ognise creativity. In the end, this typ e of philosophical ob jection tends to dissipate as algorithms b ecome more effe ctive at any given task. For e xample, few argue that computers will never b e able to play chess prop erly now that ches s programs regularly de feat grand masters, though some will argue that prowess at ches s is not in fact a sign of ‘genuine intelligence’.

Neverthe less, it is clear that very thorough evaluation of asses sment systems will b e re quired b efore op erational, es p ecially high stake s, deployment and that this should include evaluation in adversarial sce narios and on unusual ‘outlier’ data, whether this b e highly creative or deviant. From this p ers p ective it is s urprising that Powe rs et al (2002) is the sole study of this kind, though b oth e- Rater and IE A are claimed to incorp orate mechanisms to flag such outliers for human marking.



3 AAET using Discriminative Pre fe re nce Ranking

One of the key weakness es of the text classification me tho ds deployed s o far for automated assess ment is that they are bas ed on non-discriminative machine learning mo dels.

Non-discriminative mo dels often e mb o dy incorre ct as sumptions ab out the underlying prop erties of the texts to b e classified – f or e xample, that the probability of each feature (e .g. word or ngram) in a text is indep endent of the others, in the cas e of the NB classifier (se e se ction 2.3). Such mo dels als o weight fe atures of the text in ways only lo os ely connected to the clas sification task – f or e xample, p ossibly smo othed class conditional maximum like liho o d estimates of fe atures in the cas e of the NB clas sifier (se e again sec tion 2.3).

In this work, we apply discriminative machine learning metho ds , s uch as mo dern variants of the Large Margin Pe rc eptron (Freund and Schapire, 1998) and the Supp ort Ve ctor Machine (SVM, Vapnik, 1995) to AAET. To our knowledge, this is the first such application to automated es say assess ment. Disc riminative c lass ifiers make weake r assumptions concerning the prop e rties of texts , dire ctly optimiz e clas sification p erformance on training data, and yie ld optimal pre dictions if training and test material is draw n from the same distribution (se e Collins (2002) for exte nde d theoretical dis cussion and pro ofs ).

In our des cription of the classifiers, we will use the following notation: N numb er of training samples

Dν avg. numb er of unique f eatures / training sample X ∈ P Rreal D -dimens ional sample s pace Y = { +1, -1} binary target lab el space xi∈ X vector repres enting the i th training sample yi∈ { +1, -1} binary c ategory indicator f or i th training sample f : X → Y classification function

3 .1 Supp or t Vect or Machine

Linear SVM s (Vapnik, 1995) le arn wide margin classifiers based on Struc tural Ris k M inimization and c ontinue to yield state- of- the-art results in text clas sification exp eriments (e.g. Le wis et al , 2004). In its dual form, linear SVM optimiz ation equates to minimizing the f ollow ing expres sion

:= 0 where the a ’s are the weight co efficients. The prediction is given by:if (x) = sig n (  aiyixi· x + b ) (7)

where b is the bias and sig n (r ) ∈ {- 1, +1} dep e nding on the sign of the input. The prac tic al use of the SVM mo del re lie s on efficient metho ds of finding approximate

solutions to the quadratic programming (QP) problem p osed by (6). A p opular s olution is


2 sub ject to the constraint  iaiiyiai



· xj


implemented in Joachims ’ SVMl ig h t2package (Joachims, 1999), in which the QP problem is dec omp osed into small constitue nt subproblems (the ‘working s et’) and solved se que ntially. This yie lds a training complexity at each iteration of O (q1 . 5· ν ) where q is the size of the working set. The effic iency of the pro cedure lie s in the fact that q   N . The numb er of iterations is governed by the choice of q which makes it difficult to plac e a theoretic al complexity b ound on the overall optimization pro cedure, but exp erimental analysis by Yang et al (2003) sugges ts a s up er-linear b ound of approximately O (N) with resp ec t to the numb er of training samples , though in our exp erience this is quite heavily dep endent on the se parability of the data and the value of the regularization hyp erparameter.

The p er s ample time complexity for prediction in the SVM mo del is O (M · ν ) whe re M is the numb er of categories, as a separate c lass ifie r must b e trained f or each c ategory.

3. 2 T im ed Agg re gate Per cept ro n

We now present a des cription of a novel variant of the batch p e rc eptron algorithm, the Timed Aggregate Perce ptron (TAP, M edlo ck, 2010). We will first intro duce the ideas b ehind our mo del and then provide a formal description.

The online p erceptron learning mo de l has b e en a mainstay of artificial intellige nce and machine le arning rese arch since its intro duction by Rosenblatt (1958). The basic principle is to iteratively up date a vector of weights in the sample space by adding s ome quantity in the direc tion of misclassifie d samples as they are identified. The Perceptron with Margins (PAM ) was intro duce d by Krauth and Mez ard (1987) and show n to yield b ette r ge neralisation p erformance than the basic p erce ptron. More recent developme nts include the Voted Perceptron (Freund and Schapire, 1998) and the Perceptron with Uneven Margins (PAUM), applied with some succes s to text categorization and inf ormation extrac tion (Li et al , 2005).

The mo del we present is base d on the batch training metho d (e.g. Bos and Opp er, 1998) where the weight vec tor is up date d in the direc tion of al l misclassified ins tances simultaneously. In our mo de l an aggregate vector is created at each iteration by summing all misc lass ified s amples and normalising according to a timing variable which controls b oth the magnitude of the aggre gate vec tor and the stopping p oint of the training pro ces s. The weight vector is the n augme nted in the direction of the aggregate vector and the pro cedure iterates. T he timing variable is re sp onsible for protection against overfitting; its value is initialised to 1, and gradually diminishes as training progresse s until reaching zero, at which p oint the pro cedure terminates.

Given a set of N data samples paired with target lab els ( xi, yDi) the TAP learning pro cedure returns an optimized weight vector ˆ w ∈ R. The predic tion for a ne w sample x ∈ RDis given by:

f (x) = sig n ( ˆ w · x) (8) where the sig n function converts an arbitrary real numb er to +/ - 1 base d on its sign.The default de cision b oundary lies along the unbiased hyp erplane ˆ w · x = 0, though a threshold can easily b e intro duced to adjust the bias

.At each iteration, an aggregate vec tor ˜at

is constructed by s umming all mis clas sified



samples and normalising: ˜at

xi= nor m (  ∈ Qt


, t ) (9)

nor m (a, t ) normalises a to magnitude t and Qtis the set of mis clas sified samples at iteration t, w ith the misclassification condition given by:


· xiyi

governe d by: tt

The clas s-normalise d empirical los s, Lt



= 1


- L

+ t+| N

+) N

t- 1

+ |Q| N-

> Lt


)< 1 (10) A margin of +/ - 1 p erp endicular to the decis ion b oundary is required for correct class ifi-cation of training sample s. The timing variable t is s et to 1 at the start of the pro cedure and gradually diminishes ,

= tt- 1


2  |Q

0 L)ß othe rwis e (11)

, f alls within the range (0, 1) and is define d as

:with N


t- 1


  (12)denoting the numb er of c lass +/-1 training sample s re sp ectively. ß is a measure of the balance of the training distribution siz es:

ß = min(N

, N


with an upp er b ound of 0.5 repres enting p erfe ct balance. Te rmination o cc urs when e ither t or the empirical los s reaches zero. How well the TAP solution fits the training data is governed by the rapidity of the timing schedule; earlier stopping leads to a more approximate fit.

In some cases , it may b e b e neficial to tune the rapidity of the timing schedule to achieve optimal p erformance on a sp ec ific problem, particularly when cross validation is feasible. In this instanc e we prop ose a mo dified version of express ion (11) that includes a timing rapidity hyp erparameter, r


> Lt


= tt- 1


r - 1 Lr t (Lt- Lt- 1t- 1

)ß othe rwis e (14)

Note that this expression is equivalent to (11) in the case that r = 1. An ove rview of the TAP le arning pro c edure is given in Algorithm 1. The timing mechanism us ed in our algorithm is motivated by the princ iple of early stop-

ping in p erceptron training (Bos and Opp er, 1998), w he re the pro cedure is halted b e fore reaching the p oint of minimum empirical los s. In our formulation, t also governs the length of the aggregate vector, which is analogous to the learning rate in the standard p erceptron. t is decreased only when the clas s-normalised empirical los s increas es. An increase in emprical loss is an indication either that the mo del is b e ginning to overfit or that the le arning rate is to o high, and a c onsequent decrease in t works to counte r b oth p ossibilities. T he sc ale of the decrease is governed by three heuristic factors

1   2   3   4   5   6

The database is protected by copyright ©sckool.org 2022
send message

    Main page