Develop additional item types that use student performance to assess more demanding constructs across all content.
Smarter Balanced has already begun the work of addressing the more demanding constructs of the CCSS in English language arts and mathematics. Their plans to use technology-enhanced items and performance tasks will provide students an opportunity to demonstrate their knowledge and skills in constructs that may have not previously been well-measured using more traditional means such as multiple-choice items. While some individual states have engaged in these methods, the consortia tests will be the largest application of these item types in a K-12 setting to date.
The use of performance tasks in large-scale assessments introduces the potential to enhance the assessment experience for students, expand the wealth of information on student understanding that could be accessed by educators and other interested parties, and influence in positive ways the direction of instruction and learning in the classroom. Performance tasks can take on a variety of forms that depend in part on the standards to be assessed, an assessment’s reporting goals, the extent to which the performance tasks are designed to complement other items in an assessment, and real-world considerations such as available monetary and time resources.
Standards documents such as the Common Core State Standards (for English language arts and mathematics), the Next Generation Science Standards, and the National Curriculum Standards for Social Studies all clearly communicate the importance of well-developed reasoning, analytical, and research skills, in addition to strong discipline-based content knowledge and competence. And, more generally, the Partnership for 21st Century Skills promotes “fusing the 3Rs and 4Cs (Critical thinking and problem solving, Communication, Collaboration, and Creativity and innovation)” (http://p21.org/). These standards documents along with others suggest a potentially significant role for performance tasks in the larger assessment picture.
Shorter performance tasks might, within a time-constrained interval such as 10, 15, or even 30 or more minutes, ask the student to construct a mathematical argument that synthesizes knowledge across mathematical content domains, analyze particular aspects of several literary works or historical pieces, or use his or her knowledge of science to critique the design of a system and suggest an improvement to one or more features of that system. More extended performance tasks, however, offer greater opportunities to assess students’ capabilities to think deeply and may reveal new insights into their critical and creative thought processes. Consider, for example, a performance task that spans a period of several days or even weeks in which a student is required to provide interim products at specific milestones and a final product. A possibly valuable byproduct of such a task is that it creates a path of observable behaviors from which data may be collected for later analysis.
Additionally, certain kinds of extended performance tasks might introduce opportunities for small groups of students to collaborate over a period of days or weeks toward a common goal, such as the submission of a product prototype that they have developed to satisfy a particular set of design requirements. Part of such an exercise might involve not presenting the student or group of students with all of the information and resources at the outset that they will need to achieve their end goal, but instead having them decide what is needed initially to carry out their task and then deciding how to utilize those materials and resources most efficiently. These types of extended performance events support the assessment of standards such as “the 4 Cs” mentioned earlier and of discipline-specific standards in ways that are much more authentic than attempting to assess communication or creativity and innovation in a discrete item that is severely time-constrained. Additionally, when thoughtfully designed into an assessment, the combination of short and extended performance tasks with discrete items and smaller item sets can support the efficient assessment of a wide range of content along with the more targeted assessment of particular aspects of disciplinary habits of mind. Those habits of mind in mathematics, for example, might include evaluating to what extent a student approaches the solution of a problem that is not well-specified mathematically using the same thought processes that a skilled mathematician might.
One additional benefit related to the inclusion of performance tasks on large-scale assessments is the impact on classroom learning and instruction. If there is even a grain of truth to the statement that “what gets assessed is what gets taught,” then the need for presenting students with opportunities to demonstrate their academic competence in more real-world settings (and that demand the integration of knowledge, skills, and thought processes consistent with those required in university-level studies and in their careers) would seem to support the inclusion of a range of well-designed performance tasks in large-scale assessment.
Use artificial intelligence scoring of constructed responses when appropriately reliable, available, and beneficial.
Current visions for assessments measuring the proficiencies important in the CCSS call for the use of constructed-response items that require students to produce written or spoken responses, draw figures representing numerical information, or write an equation detailing a specific relationship among given variables. In the past, only human scoring of such responses was used, requiring extensive training, cost, and time for scoring; however, recent advances in the use of computers to administer and score constructed-response questions have made the implementation and use of such questions more efficient. Automated scoring of student-produced responses has the potential to support valid and efficient measurement of knowledge and skills that are best assessed by constructed-response questions.
The Smarter Balanced Assessment Consortium expects to use this technology in ELA and mathematics, and it would be logical for California to investigate using this methodology in other content areas where the benefits can be shared. For example, in science and social studies there are likely to be items in which a constructed response will include specific terminology from that field of study. Through thoughtful and deliberate item design, artificial intelligence (AI) scoring can score these types of items in these domains with a high degree of reliability and efficiency.
An initial definition is in order: when we discuss “AI scoring” here, we focus on response types that are open-ended enough that they cannot be scored by means of simple rules or other deterministic procedures. Technology-enhanced items based on a drag-and-drop interface, hotspots, or text highlighting do not fall into this category. These items can be scored without human intervention. In addition, other categories of items (e.g., numeric entry, graphical entry, or equation entry) require algorithmic or pattern-based scoring approaches that are easily used given current tools. AI systems are used to score essays for writing quality and both short- and longer-written answers for content accuracy. It is likely that California may incorporate such items and scoring systems in the future: they measure key elements of the construct and are important to use if possible. However, such items come with inherent challenges.
Applications of automated scoring to new item types or populations may require additional research so that the fairness, validity, and reliability of scoring — in addition to efficiency — can be supported by evidence. Future contractors should demonstrate valid, fair, and reliable scoring of constructed responses. Note that a scoring system can “work well” for an aggregate population and still introduce biases for certain key subgroups. Therefore, technical data at the sub-population level must be examined before non-algorithmic scoring systems are placed into operational use.
To help California best position itself for success in large-scale use of computer-based scoring technologies, we recommend a considered approach for the future use of artificial intelligence scoring. There is little doubt that planning for the use of this technology now as the state sets a course in its use in other content areas will push California to the front of the development field in this work: few, if any, states are giving consideration to this technology in other content areas, and California would play a leading role in advancing its assessment program with this technology.
Consider metacognitive factors in determining college and career readiness.
The K-12 assessment is sometimes criticized because the instruments used measure only a portion of the knowledge, skills, and abilities that are necessary to be successful in college or career. There is little dispute that a student who has ample command of the CCSS content knowledge in mathematics and English language arts, but has minimal ability to employ other cognitive strategies successfully, may be at a disadvantage.
California has been involved in the work of the Educational Policy Improvement Center (EPIC) through projects that investigate the full breadth of college and career readiness domains. Through this work, the state is very familiar with the work of Dr. David Conley, who would advocate that the determination of college and career readiness is not solely based on performance in content knowledge, but on the evaluation of a profile of characteristics categorized into four domains, of which content knowledge is only one: key cognitive strategies, key content knowledge, key learning skills and techniques, and key knowledge and skills (Conley, 2012). These four keys are depicted in the figure below.
Conley’s “Four Keys to College and Career Readiness”
Not all of these attributes can be assessed — or necessarily should be assessed — with a large-scale assessment tool. Yet, California has the opportunity to build an assessment system that collects information appropriately to provide a more complete picture of a student’s college and career readiness. As it did with the introduction of the state’s Early Assessment Program, California can once more lead the nation in developing the most advanced college and career readiness evaluation for students. Investigating methodologies to determine readiness in other domains such as those articulated at EPIC could provide groundbreaking profiles that would further strengthen the alignment between the state’s K-12 and postsecondary systems.