Crossley, S. A., & McNamara, D. S. (2014). Does writing development equal writing quality? A computational investigation of syntactic complexity in L2 learners. Journal of Second Language Writing, 26 (4), 66-79.
Title: Does writing development equal writing quality? A computational investigation of syntactic complexity in L2 learners.
Abstract: This study examines second language (L2) syntactic development in conjunction with the effects such development has on human judgments of writing quality (i.e., judgments of both overall writing proficiency and more fine-grained judgments of syntactic proficiency). Essays collected from 57 L2 learners in a longitudinal study were analyzed for growth and scoring patterns using syntactic complexity indices calculated by the computational tool Coh-Metrix. The analyses demonstrate that significant growth in syntactic complexity occurred in the L2 writers as a function of time spent studying English. However, only one of the syntactic features that demonstrated growth in the L2 learners was also predictive of human judgments of L2 writing quality. Interpretation of the findings suggest that over the course of a semester, L2 writers produced texts that were increasingly aligned with academic writing (i.e., texts that contain more nouns and phrasal complexity), but that human raters assessed text quality based on structures aligned with spoken discourse (i.e., clausal complexity). Thus, this study finds that the syntactic features that develop in L2 learners may not be the same syntactic features that will assist them in receiving higher evaluations of essay quality.
Syntactic development is an important component of second language (L2) acquisition and one that has received considerable attention in previous research (Hawkins, 2001; Lu, 2010) in both longitudinal and cross-sectional studies. Researchers have focused on L2 syntactic development under the notion that the ability to arrange words syntactically into phrases and phrases into clauses demonstrates the capacity to manipulate a language’s combinatorial properties, which is argued to be a strong indicator of general language acquisition. One of the primary questions addressed by syntactic research is how syntactic knowledge develops over time and, more specifically, what syntactic features develop early and which develop later for L2 learners (Hawkins, 2001). Examinations into the development of syntactic features often focus on the variation and sophistication of the phrases and clauses produced by L2 learners. The basic premise underlying such examinations is that syntactic complexity can be used to directly measure L2 learner proficiency (Foster & Skehan, 1996; Lu, 2011; Ortega, 2003; Wolfe-Quintero, Inagaki, & Kim, 1998).
While a number of studies have examined longitudinal growth in L2 learners using both spoken and written corpora, few studies have examined L2 syntactic development in conjunction with the relationships such developments have with human judgments of writing quality (both judgments of overall writing proficiency and more fine-grained judgments of syntactic proficiency). That is to say, while past research has focused on L2 learner development, it has rarely linked the effects of such development to assessments of language proficiency. However, such an approach is important because it can afford an opportunity to examine not only syntactic growth, but also the relations of such growth with the judgments of expert raters. To address this research gap, this study examines L2 writing samples using computational indices of syntactic complexity to understand how syntactic complexity changes over time in L2 writers (i.e., longitudinal growth) and to understand how changes in syntactic complexity are related to human ratings of language use in L2 writing.
As mentioned earlier, syntactic complexity refers to the sophistication of syntactic forms produced by a speaker or writer and the range or variety of syntactic forms produced (Lu, 2011; Ortega, 2003). Analysis of L2 output in terms of its syntactic complexity is a common means to investigate L2 growth because language development in L2 learners is argued to entail the acquisition and production of less frequent syntactic features along with the use of a greater variety of syntactic features. Many features related to syntactic complexity are relatively easy to investigate using both hand- and automated-coding of texts which allows for the sampling of a variety, but by no means all, of available syntactic features.
The traditional method of measuring syntactic complexity is with T-units (Biber, Gray, & Poonpon, 2011), which can be defined as the shortest allowable grammatical units that can be punctuated at the sentence level (i.e., the main clause plus additional, embedded subordinated clauses; Street, 1971 as cited in Larsen-Freeman, 1978, p. 441). T-units were initially used to assess writing development in first language (L1) writers (Hunt, 1965) and were later adopted for use by the L2 research community (Casanave, 1994; Henry, 1996; Lu, 2011; Ortega, 2003; Stockwell & Harrington, 2003). The use of T-units as measures of syntactic complexity for L2 learners has provided mixed results, with some studies demonstrating no links between classic T-unit measures such as mean length of T-unit and measures of L2 syntactic growth (Bardovi-Harlig, 1992; Casanave, 1994; Ishikawa, 1995) and other studies finding strong links (Ortega, 2003; Stockwell & Harrington, 2003).
The most promising T-unit indices are error-free T-units (Larsen-Freeman, 1978), but such indices are not strictly syntactic and focus more on accuracy than T-units. Additionally, such indices are difficult, if not impossible, to implement computationally and require expert hand coding, which is prone to subjectivity and error. The use of T-units to investigate L2 writing has also been called into question recently by Biber et al. (2011). They found that the clausal subordination measured by T-unit indices is more common in conversation whereas academic writing is characterized syntactically by the use of noun phrase constituents and complex phrases.
Other measures of syntactic complexity that are not specifically based on T-units but are commonly used in L2 writing studies include indices that measure the length of syntactic structures, the types and incidence of embeddings, the types and number of coordinations between clauses, the range and types of phrasal units produced, and the frequency of clauses and phrases used (Ortega, 2003). Such indices can be accessed in computational tools such as the Biber tagger (Biber, 1988) and Coh-Metrix (Graesser, McNamara, Louwerse, & Cai, 2004; McNamara & Louwerse, 2012; McNamara, Graesser, McCarthy, & Cai, 2014).
Syntactic development in L2 learners
Previous research into L2 syntactic acquisition has focused on syntactic development in both spoken and written L2 language samples and has demonstrated that L2 learners follow general patterns of syntactic development that occur in identifiable stages. For instance, English speakers learning French must acquire the rule that direct and indirect object pronouns come before the verb (as compared to after the verb in English). When learning such a rule, L2 learners generally first produce postverbal pronouns, followed by preverbal pronouns. However, when preverbal pronouns do occur, they compete with omitted objects (Selinker, Swain, & Dumas, 1975; White, 1996). L2 learners of English also generally follow the accessibility hierarchy with respect to the acquisition of relative clauses (Gass, 1979) in which L2 learners first acquire subject relative clauses followed by direct-object, indirect-object, and object-of-a-preposition relative clauses. Other syntactic patterns demonstrated by L2 learners include the development of question formations (from wh-fronting, to auxiliary verb before the subject, to the subject verb inversion found in yes/no questions; Eckman, Moravcsik, & Wirth, 1989) and negation formations (from no, to don’t, to not, to auxiliary verbs plus not; Schumann, 1979).
Patterns in syntactic development have also been noted in numerous longitudinal studies of L2 writing (e.g., Casanave, 1994; Ishikawa, 1995; Stockwell & Harrington, 2003). Casanave (1994) examined growth in syntactic complexity by examining the journal writing of intermediate Japanese English learners over the course of three semesters of instruction. Casanave found that as L2 learners developed over time, they began to produce longer and more complex syntactic clauses (as measured by T-unit indices) that were also more accurate. Ishikawa (1995) examined two groups of low proficiency L2 English learners at the beginning and at the end of a semester of instruction. Ishikawa found that two accuracy indices (total words in error-free clauses and error-free clauses per composition) best discriminated between writings produced at the beginning of the semester and end of the semester. Lastly, Stockwell and Harrington (2003) investigated L2 syntactic growth in e-mail exchanges over a five-week period. Syntactic complexity was measured using T-unit indices and human judgments of quality (but links were not made between the two). Stockwell and Harrington found that L2 learners showed differences in the average number of words per T-unit, the average number of words per error-free T-unit, and the percentage of error-free T-units as a function of time spent writing. They also reported that human ratings of syntactic complexity increased over the same five-week period.
Another approach to investigating syntactic development in L2 learners is through cross-sectional studies, which can be used to investigate differences between proficiency levels in L2 writers (e.g., Ferris 1994, Larsen-Freeman 1978; Ortega, 2003; Lu, 2011). Larsen-Freeman (1978) used T-unit indices to discriminate between essays based on the placement levels of L2 learners (212 learners placed into five proficiency levels). The results demonstrated that the percentage of error-free T-units and the average length of error-free T-units were the best discriminators of proficiency. Ferris (1994) examined essays written by 160 L2 learners that were divided into high and low proficiency groups. Using a variety of lexical and syntactic indices, Ferris found that high proficiency L2 writers differed from low proficiency L2 writers in their more frequent production of passives, nominalizations, conjuncts, and prepositions (see Connor, 1990 for similar findings). More proficient L2 writers also produced a greater number of relative and adverbial clauses. Ortega (2003), in a synthesis study, found that length and T-unit syntactic indices such as mean length of sentence, mean length of T-unit, mean length of clause, and clauses per T-unit were reliable indicators of proficiency level differences for L2 writers. More recently, Lu (2011) investigated the performance of 14 T-unit indices to distinguish between grade levels for essays written by university level L2 learners. Lu found that 10 of the 14 indices discriminated between grade level, but only seven of the 10 indices progressed linearly across proficiency levels. These indices included three indices of length production, two indices of complex nominals, and two indices of coordinated phrases.
Syntactic features and human judgments of writing quality
Another approach to assessing writing development is to examine how linguistic features in a text can predict human ratings of essay quality. Such an approach is built on the notion that syntactic features of texts are prime indicators of syntactic development because the presence of more sophisticated syntactic features will lead to higher ratings of essay quality.
Such predictions have been borne out in studies of both L2 and L1 writing. For instance, studies have indicated that higher rated L2 essays contain greater subordination (Grant & Ginther, 2000), use of passive voice (Connor, 1990; Ferris, 1994; Grant & Ginther, 2000), and instances of prepositions, (Connor, 1990) while containing fewer present tense forms (Reppen, 1994), and base verb forms (Crossley & McNamara, 2012). Similar findings have been reported in L1 studies of writing quality with higher quality L1 essays containing greater syntactic complexity (as measured by the number of words before the main verb; McNamara, Crossley, & McCarthy, 2010) and a greater incidence of verb base forms (Crossley, Roscoe, McNamara, & Graesser, 2011).
The purpose of this study is to assess syntactic development in L2 writers as a function of time spent in a writing course. To this end, we use a number of automated syntactic complexity indices to assess syntactic differences in descriptive essays written by L2 learners at the beginning and at the end of a semester-long writing course. We complement this analysis by assessing how well the same syntactic indices are able to predict the variance in human ratings of essay quality for essays written throughout the course. In doing so, we address two key questions: 1) Do L2 writers demonstrate syntactic development over the course of a semester (i.e., longitudinal growth) and 2) Does this growth correspond to syntactic features that predict human ratings of writing proficiency.
The data for this analysis were collected from 70 university-aged L2 writers at Michigan State University during a single semester of instruction in an intensive writing class. The participants were from the two highest levels at a university ESL program and from one level of an English for Academic Purposes (EAP) program (see Connor-Linton & Polio, 2014 this volume for additional information about the dataset used in this study). From this dataset, we selected writing samples from the 57 participants who completed all three writing assignments collected at the beginning, middle, and end of the semester. These essays were timed descriptive essays written in 30 minutes. The essays averaged 335.4 words (SD = 97.5) and 5.4 paragraphs (SD = 4.165) in length. Prior to analysis the corpus was cleaned to eliminate formatting and spelling errors.
Two expert raters assessed the quality of each essay using a composition grading scale that required the raters to rate each essay on five different analytical features: content, organization, vocabulary, language use, and mechanics (see Connor-Linton and Polio, 2014 this volume for additional information about the grading scale). These analytic ratings were combined into an overall rating for each essay. Of interest for this study is the combined rating for each essay and the Language Use rating, which includes assessments of syntactic properties. Briefly, the Language Use rating equates higher writing proficiency with no errors that interfere with comprehension, few morphological errors, no major errors in word or structure, the use of more complex sentences, and excellent sentence variety. The latter three properties are strongly related to syntactic complexity while the former two are linked to syntactic complexity, but are not exclusively syntactic (i.e., they also have links to grammar, morphology, and the lexicon). Interrater reliability between the two raters for the essays written by the 57 participants in this study was strong: r = .767 for Language Use ratings and r = .880 for overall ratings. These two ratings also demonstrated strong multicollinearity, r = .914.
Selected syntactic indices
We selected syntactic indices from Coh-Metrix, an advanced computational tool that measures cohesion and linguistic sophistication at various levels of language, discourse and conceptual analysis (Graesser et al., 2004; McNamara & Graesser, 2012; McNamara et al., 2014). To ensure we were assessing syntactic complexity, we selected only those syntactic indices that measure clausal and phrasal level syntactic features. These indices include incidence counts taken from the Charniak (2000) part of speech tagger/parser (normed for text length) in addition to ratio scores, raw scores, and length counts.
We used an automatic approach to assessing syntactic complexity because it affords speed, flexibility, and reliability. In addition, human raters are prone to subjectivity and require training, time to score, and monitoring, all of which consume resources (Higgins, Xi, Zchner, & Williamson, 2011). However, one potential problem with using a parser to investigate syntactic complexity is accuracy. For texts written by L1 speakers, the Charniak parser reports an average accuracy of 89% for expository and narrative texts (with greater accuracy reported for narrative texts; Hempelmann, Rus, Graesser, & McNamara, 2006). No studies that we know of have investigated the accuracy of the Charniak parser or similar parsers for L2 writing, but it can be presumed that the accuracy would decrease. Thus, a question remains about the degree to which parser accuracy is affected by L2 writing and, more importantly, how this accuracy compares with hand-coded ratings of syntactic complexity, which are also subject to accuracy limitations.
In total, we selected 11 Coh-Metrix indices that measure clausal and phrasal features of language. These indices include measurements of syntactic variety, syntactic transformations (e.g., negations and questions), syntactic embeddings, incidence of phrase types, and phrase length. Each index in Coh-Metrix is computed using the output produced by the Charniak (2000) parser for both lexical (i.e., part of speech tags) and syntactic categories (phrasal and clausal components). This indices selected are similar to Bulté and Housen (this issue), but are calculated automatically as compared to manually. The selected indices also target clausal, phrasal, and sentential elements, while Bulté and Housen’s indices focus more on sentential elements. The selected indices are discussed in greater detail in the following section.
Sentence variety. Sentence variety is assessed in Coh-Metrix by measuring the consistency and uniformity of the clausal, phrasal, and part of speech (POS) constructions located in the text (i.e., syntactic similarity). The syntactic similarity indices in Coh-Metrix assess syntactic similarity by comparing adjacent sentences for similar clausal, phrasal, and POS constructions. More uniform syntactic constructions result in less complex syntax that is easier for the reader to process (Crossley, Greenfield, & McNamara, 2008). However, less syntactic similarity is a hallmark of advanced writers (Crossley, Weston, McClain-Sullivan, & McNamara, 2011).
Syntactic transformations. Coh-Metrix measures a number of syntactic elements related to syntactic transformations. These include negations and wh-questions. Such transformations represent a syntactic complexity beyond the use of simple declarative sentences. These indices are computed using normalized incidences of occurrences.
Syntactic embeddings. Coh-Metrix also calculates syntactic embeddings as calculated by the Charniak parser. The embeddings reported by Coh-Metrix are in the form of normalized incidence counts and include counts for all clauses (including matrix clauses, coordinated clauses and embedded clauses), infinitive clauses, S-bar counts (i.e., embedded sentences that can be marked with complementizers such as that,for, who and when, prepositions such as after and before, conditionals such as if and then, and subordinating conjunctions such as because or however), ‘that’ verb complements, and relative clauses. Such embeddings generally indicate greater syntactic complexity.
Phrase types. Coh-Metrix computes incidence scores for a variety of phrase types. These phrase types include noun phrases (NP: related to density of propositions), verb phrases (VP: related to the number of clauses in a sentence), and preposition phrases (PP: related to the number phrases that provide adjectival and adverbial information). In the sentence The boy eats the pepperoni pizza under the tree, the phrasal count for NP would be two (i.e., The boy and the pepperoni pizza), the phrasal count for VP would be one (i.e., eats the pepperoni pizza under the tree), and the phrasal count for PP would be one (i.e., under the tree). An example of multiple clauses in one sentence is found in the example She sees that the boy is eating pepperoni pizza under the tree in which the phrasal count for NP would increase by one (i.e., with the inclusion of she) and the phrasal count for VP would increase by one (i.e., with the inclusion of sees that…).
Phrase length. Coh-Metrix reports a variety of indices related to syntactic complexity that result from phrase length calculations. These include the length of noun phrases and verb phrases (under the hypothesis that longer phrases are more difficult to process) and the number of words before the main verb (under the hypothesis that the main verb controls the arguments in the sentence and the longer it takes to access the main verb, the more complex the sentence is; McNamara et al., 2010).
Our statistical analyses address two principal questions. Our first question is whether growth in the syntactic patterns of L2 learners’ is evident in their writing. For this analysis, we first conducted within-subjects Analysis of Variance (ANOVA) using the selected Coh-Metrix indices focusing on the first and the last essays written over the course of the semester (n = 114). We did not focus on the middle essays because we did not expect syntactic growth to occur within an 8-week period. The ANOVA analysis provided us with information about which syntactic indices demonstrated significant growth patterns. Those indices that demonstrated significant growth patterns were then entered into a Naïve Bayes classifier to assess how well the indices predicted if an essay was written at the beginning of the semester or at the end of the semester. A Naïve Bayes classifier produces a statistical learning model that assigns a probability to each text given a number of instances (i.e., what is the probability, based on the syntactic features found within that text, that a text was written at the beginning or the end of the semester).
Our second question addressed whether syntactic indices were predictive of human ratings of essay quality. To answer this question, we conducted regression analyses to examine if the selected Coh-Metrix indices were predictive of human ratings of essay quality (both language use and combined score ratings). For this analysis we used all the rated essays in the analysis (N = 171). We first conducted Pearson Product Moment Correlations between the human ratings for the essay and the syntactic indices. Those indices that demonstrated significant correlations were then included in a stepwise regression analysis to examine how well the indices could predict the variance in the human ratings.
For the Naïve Bayes analysis, we tested the predictive strength of the indices using a leave-one-out-cross-validation (LOOCV) analysis (Witten, Frank, & Hall, 2011). In this analysis, we chose a fixed number of folds that equaled the number of observations (i.e., 114 essays). In LOOCV, one observation in turn is left out for testing and the remaining instances are used as the training set (i.e., in the case of the Naïve Bayes analysis, the 113 remaining essays). We assess the accuracy of the model by testing its ability to predict the classification (the human rating) of the omitted instance. Such an approach affords us the opportunity to test the models generated by the Naïve Bayes classifier on an independent data set (i.e., on essays that are not used to train the model). If the LOOCV results demonstrate significant classification results as reported by a Chi-squared test, our level of confidence in the model increases supporting the extension of the analysis to external data sets.
For the regression analysis, we used training and test sets to assess the generalizability of the regression model to an outside corpus. We divided the corpus of 171 essays into training and test sets following a 67/33 split (Witten, Frank, & Hall, 2011). For the training set, we first conducted Pearson correlations to assess relationships between the selected variables and the human ratings. Those variables that demonstrated significant correlations with the human ratings were retained as predictors in a subsequent regression analysis. We next conducted a stepwise regression analysis using the essays in the training set only. The model from this regression analysis was then applied to the held back essays in the test set to predict their ratings.
For each analysis, we also control for multicollinearity by examining correlations between indices and ensuring that the indices were not strongly related (i.e., r > .070). In addition, we control for overfitting in the models by ensuring a 15/1 item to predictor ratio. Such controls allow us to use only variables that contribute uniquely to the models and to verify that the findings of the analysis are not the result of random noise in the data.
We first conducted repeated measure ANOVAs on the human ratings assigned to the essay to examine if, according to expert raters, there was an improvement between essays written at the beginning of the semester (1st essays) and at the end of the semester (3rd essays). Language use ratings increased significantly from the first essay (M = 9.9, SD = 2.1) to the third essay (M = 11.1, SD = 1.9), F (1, 56) = 27.815, p < .001, hp2(partial eta squared)= .332, and the combined ratings increased from the first essay (M = 47.4, SD = 10.3) to the third essays (M = 57.1, SD = 8.3), F (1, 56) = 47.378, p < .001, hp2= .458.
Repeated measure ANOVAs were then conducted on the selected Coh-Metrix indices to examine if significant differences in syntactic features existed between essays written at the beginning of the semester (1st essays) and at the end of the semester (3rd essays). Those indices that showed significant differences were then used in a confirmatory Naïve Bayes classifier algorithm to predict whether the essays were written at the beginning or end the semester. Of the 11 indices, six demonstrated significant differences between the 1st and 3rd essays (see Table 1 for details): the incidence of all clauses, the number of modifiers per noun phrase, syntactic similarity, number of verb phrases, number of words before the main verb, and incidence of the negation word ‘not.’
The Naïve Bayes classifier using the six significant syntactic indices and LOOCV correctly allocated 71 of the 114 essays in the total set, χ2 (df=1, n=114) = 7.737, p < .010, for an accuracy of 62.28% (the chance level for this analysis is 50%). The results from the Naïve Bayes classifier are reported in the confusion matrix found in Table 2. The measure of agreement between the actual essay number and that assigned by the model produced a Cohen’s Kappa of 0.246, demonstrating a fair agreement.
[Insert Table 2 Here]
To illustrate the Naïve Bayes classifier results, we provide precision, recall, and F1 scores (see Table 3). Recall scores represent the number of true positive hits (i.e., correct predictions) over the number of hits + false negatives (e.g., the number of first essays that were misclassified as last essays) while precision scores represent the number of hits divided by the number of hits + false positives (e.g., the number of last essays that were classified as first essays). Thus, for the 45 essays out of the 57 essays written at the beginning of the semester that were correctly classified, the recall score is (45/(45+12)) = 79%. For the same essays, the precision score is (45/(45+31)) = 59%. The F1 score is a weighted average of the precision and recall results. The model performed best at classifying essays written at the beginning of the semester. The overall accuracy of the model was .606 (the average F1 score).
[Insert Table 3 Here]
Regression Analyses: Language Use
Correlations training set. Correlations were conducted between the syntactic indices and the human ratings of language use for the 113 essays in the training set. Six Coh-Metrix indices demonstrated significant correlations with the human ratings while not demonstrating multicollinearity with one another (see Table 4).
Regression analysis training set. A stepwise regression analysis using the six indices as the independent variables to predict the human ratings of language use yielded a significant model, F(3, 109) = 13.013, p < .001, r = .514, r2= .264. Three syntactic indices were included as significant predictors of the human ratings: Incidence of all clauses, infinitives, and ‘that’ verb complements. The model demonstrated that the three indices explained 26.4% of the variance in the human ratings of language use for the essays in the training set (see Table 5 for additional information).
[Insert Table 5 Here]
Regression analysis test set. We used the model reported for the training set to predict the human ratings of Language Use in the test set. To determine the predictive power of the three variables retained in the regression model, we computed an estimated rating for each essay in the test set using the B weights and the constant from the training set regression analysis. A Pearson’s correlation was then conducted between the estimated rating and the actual rating of each of the essays in the test set. This correlation together with its r2 was then calculated to determine the predictive accuracy of the training set regression model on the independent data set.
The regression model, when applied to the test set, reported r = .457, r2 = .208. The results from the test set model demonstrated that the combination of the three syntactic indices accounted for 20.8% of the variance in the language use ratings of the essays in the test set.
Regression Analyses: Combined ratings
Correlations training set. Correlations were conducted between the syntactic indices and the combined human ratings of the 113 essays in the training set. Seven Coh-Metrix indices demonstrated significant correlations with the human ratings while not demonstrating multicollinearity with one another (see Table 6).
Regression analysis training set. A stepwise regression analysis using the seven indices as the independent variables to predict the combined human ratings yielded a significant model, F(3, 109) = 19.659, p < .001, r = .593, r2= .350. Three syntactic indices were included as significant predictors of the essay ratings: Incidence of all clauses, infinitives, and ‘that’ verb complements. The model demonstrated that the three indices explained 35.0% of the variance in the combined human ratings of the essays in the training set (see Table 7 for additional information)
[Insert Table 7 Here]
Regression analysis test set. The regression model, when applied to the test set, reported r = .562, r2 = .316. The results from the test set model demonstrated that the combination of the three syntactic indices accounted for 31.6% of the variance in the combined ratings of the essays in the test set.
This analysis has demonstrated that significant growth in syntactic complexity occurs in L2 writers as a function of time spent in a writing class. From the beginning of the semester until the end of the semester, L2 writers in this study produced fewer incidences of all clauses, longer noun phrases, less syntactic similarity between sentences, fewer verb phrases, more words before the main verb, and more negation. However, only one of these syntactic features was also predictive of human judgments of L2 writing quality: fewer incidences of all clauses. The other two predictors of writing quality, incidence of infinitives and incidence of ‘that’ verb complements did not demonstrate significant growth patterns between the first and last essays for the L2 writers in this analysis. Thus, while L2 learners’ writing does become more syntactically complex, most of the syntactic features demonstrating growth are not predictive of human judgments of writing quality.
In reference to L2 syntactic development, this study has shown that a number of syntactic complexity indices demonstrate growth in predicted directions from the first to the final essay of the semester. The strongest growth, as indicated by the effect size, was for the use of all clauses. Over the course of the semester, the L2 writers in this study produced fewer clauses overall (i.e., fewer matrix and embedded clauses). L2 writers also showed changes in their use of phrasal complexity, producing longer noun phrases and more words before the main verb. Conversely, while producing more complex noun phrases, L2 writers began to produce fewer verb phrases, which is indicative of fewer embedded clauses. At both the clausal and phrasal level, L2 writers produced sentences that demonstrated less syntactic similarity with each other over time, indicating the production of a greater variety of syntactic constructions. Lastly, L2 writers produced a greater number of sentences containing ‘not’ negations. Overall, the longitudinal analysis demonstrates that over the course of a semester L2 writers produced text that depended more on noun phrases than verb phrases and text that contained greater phrasal modifications (similar, supporting results are presented in Bulté and Housen, this issue). Knowing that nouns are more important than verbs in academic writing (i.e., academic writing has a nominal style; Fang, Schleppegrell, & Cox, 2006; Halliday, 1989; Halliday & Matthiessen, 1999; Wells, 1960), such a finding provides evidence that advancing L2 writers move toward the production of text that better aligns with an academic writing style. The longitudinal analysis also demonstrates that L2 writers move toward the development of more phrasal components as compared to clausal components as they advance in writing. This notion is evidenced by the movement away from a production of a greater number of matrix and embedded clauses to denser phrasal components such as noun phrases and an increase in the number of words before the main verb. These findings, taken in conjunction with the notion that academic writing relies more on phrasal modification (as compared to dependent clauses, which are typical in speech; Biber, 1985, 1986, 1988, 2006; Biber et al., 2011), provides further evidence that developing L2 writers move toward academic writing as they advance. Lastly, L2 writers began to produce sentences that demonstrate greater variety of structure and more transformations (e.g., negations).
We find a different pattern of results when we analyze the syntactic features that are most predictive of human judgments of syntactic properties and combined writing scores. Similar to our longitudinal analysis, we find that the overall production of clauses is the strongest predictor of both judgments of syntactic proficiency and combined writing proficiency. The negative correlation with human ratings signifies that fewer matrix and embedded clauses are indicative of increased writing quality. However, unlike longitudinal growth patterns, human judgments of writing proficiency are not strongly predicted by nominal style (i.e., a greater emphasis on nouns) or complex phrasal elements. Rather, higher ratings of writing quality are predicted by dependent clause features such as the incidence of infinitives and ‘that’ verb complements (see Table 8 for a comparison of the predictive indices found in each of the analyses). This is similar to the findings of Friginal and Weigle (this issue) in which they reported that higher rated essays and essays written at the end of the semester contained a greater number of complex syntactic structures including syntactic structures related to clause complexity (that clauses and to clauses).
What is remarkable about this contrast is that increased dependent clauses are argued to be characteristic of speech (i.e., interpersonal spoken registers) and not academic writing (Biber, 1985, 1986, 1988, 2006; Biber et al., 2011). Thus, we are left with the finding that whereas L2 writing develops to more closely match academic writing, human judgments of writing quality (at least in these essays) are not predicted by most of the syntactic features that develop longitudinally in L2 writers. A similar finding is reported in Bultéand Housen (this issue). In fact, it appears that the expert raters in this study did not evaluate L2 essays based on the syntactic features that are common in academic writing at all (i.e., nouns and phrases), but rather they evaluated writing samples syntactically based on the use and complexity of clausal components. Therefore, we find a disassociation between L2 syntactic development and judgments of L2 writing quality that leads to the conclusion that the syntactic features that develop in L2 learners are not the same syntactic features that will assist them in receiving higher evaluations for essay quality.
[Insert Table 8 Here]
In Appendix A, we provide examples of this disassociation with two essays written by one participant (participant 15) at the beginning of the semester and at the end of the semester. Table 9 provides the syntactic complexity scores reported by Coh-Metrix for these two essays. The essays do not demonstrate differences in the use of the negation ‘not.’ However, all indices related to nominal style demonstrate growth in the predicted directions, as do all phrasal indices. The samples also demonstrate differences in syntactic similarity. Conversely, the samples show no differences in reference to dependent clause indices (‘that’ verb complements and incidence of infinitives). Likewise, the samples demonstrate no differences in the human scores for Language Use and the combined scores. Thus, while participant 15 shows developmental patterns in both nominal and phrasal elements in writing, this development appears to have little influence on human judgments of essay quality.
[Insert Table 9 Here]
The question is then, if academic writing is defined by a nominal style and the use of complex phrasal elements, why are the human ratings in this study predicted by clausal elements that are more indicative of speech? The answer likely lies in the genre of descriptive writing, which may not be a prototypical academic genre, especially for intermediate proficiency writers. Describing homes, campuses, teachers, family, and friends may lead to writing that is characteristic of a more interpersonal register (Biber, 1992). Such a register would likely influence how the human raters judge the quality of the writing. In the case of descriptive writing, it is likely that raters do not expect features of academic writing and thus evaluate the writing based on features more common in spoken discourse. As a result, essays containing more dependent clauses are judged to be of higher quality. In contrast, our findings indicate that lower quality essays contain more clauses in general, including matrix clauses, coordinating clauses, and dependent clauses.
From a practical perspective, such a finding suggests that descriptive writing tasks may not best assess writing development for intermediate level L2 writers. While evaluations of L2 writing development may center on clausal features rather than nominal and phrasal features (Biber et al., 2011), awareness of how different writing tasks may influence human judgments of quality should be an important pedagogical consideration undertaken before assignments are developed and assigned. If the goal of an L2 writing class is to transition L2 writers toward academic writing, then assignments that revolve around comparing and contrasting ideas, producing persuasive arguments, or integrating outside information into an essay may be better suited to evaluate developing syntactic proficiency in L2 writers than descriptive writing. Such proficiency seems to develop relatively quickly and as a result of instruction. Ortega (2003) preliminarily found that two to three months of university level instruction would result in “negligible to small-sized change (p. 511)” and that the rate of change may be greater for L2 learners studying in an English as a second language (ESL) environment as compared to an English as a foreign language (EFL) environment. Our findings support this notion, with small effect sizes reported for learners’ syntactic development over a four-month semester in an ESL environment. Knowing that syntactic proficiency can develop in such a small window of time, it seems imperative that assessments of proficiency mirror this development.
This study has provided further evidence of syntactic development in L2 writers. Key to this study was the attempt to link such growth to human ratings of writing and syntactic proficiency. We found that, in most cases, features of syntactic complexity that demonstrated growth patterns in L2 writers were not the same features that predicted judgments of proficiency. In fact, there was a disassociation between L2 syntactic development and judgments of proficiency such that L2 learner growth was associated with greater nominal style and phrasal complexity whereas human judgments were predicted by clausal features (with the exception of the incidence of all clauses, which was an important indicator of L2 growth and human judgments of proficiency).
Some caution should be taken in interpreting these findings. While we investigated a number of indices related to syntactic complexity, we could not examine all potential syntactic complexity indices and not all features of syntactic growth are easily automated within computational tools. Thus, we analyzed a number of local metrics, which have strong links to global metrics of syntactic knowledge, but may not be inclusive of syntactic proficiency (i.e., we did not examine all elements of syntactic complexity). Additionally, we examined a small sample of writers over a relatively brief span of time in an ESL environment only. Future studies may benefit from a larger sample population that is investigated over the course of a year-long program of study. Future studies may also benefit from comparing ESL to EFL learners and instructed versus uninstructed learners. Lastly, this study focused on timed, descriptive writing. Future studies should consider assessing proficiency using a variety of different speaking and writing tasks to test the effects of genre and task. Such methodological changes would allow for falsification studies that could provide additional evidence in relation to syntactic development and its effects on human judgments of proficiency.