NCEO Logo

 

States' Procedures for Ensuring Out-of-Level Test Instrument Quality


Out-of-Level Testing Project Report 14

Published by the National Center on Educational Outcomes

Prepared by:
Jane E. Minnema • Martha L. Thurlow • Ross E. Moen • Gretchen R. VanGetson

September 2004


Any or all portions of this document may be reproduced and distributed without prior permission, provided the source is cited as:

Minnema, J.E., Thurlow, M.L., Moen, R.E., & VanGetson, G.R. (2004). States' procedures for ensuring out-of-level test instrument quality (Out-of-Level Testing Project Report 14). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved [today's date], from the World Wide Web: http://education.umn.edu/NCEO/OnlinePubs/OOLT14.html


Executive Summary

The present study was initiated to gather information about states’ out-of-level testing practices. Specifically, we wanted to see whether states attempted to align out-of-level tests with grade of enrollment content standards, what processes are used to complete this task, and what psychometric information states offer as evidence of that alignment. We collected narrative data in this study from two sources: technical information about states’ large-scale statewide assessment and information gleaned from interviews with state assessment directors or other individuals knowledgeable about the state’s large-scale assessment. We used the technical information gathered prior to the interviews to provide context and support for our interview data analysis process. We then compiled the results thematically.

Five critical issues were highlighted in the results.

  1. There was an increasing need for states to provide easily accessible technical information that includes out-of-level testing information.

  2. States’ arguments supporting their decision to not equate out-of-level test scores with on-level test scores were stronger than arguments supporting this practice.

  3. States provided incomplete and inconclusive information about the psychometric properties of out-of-level test scores.

  4. States were not consistent in their opinions about the use of out-of-level tests.

  5. States made questionable assumptions about out-of-level testing.

Overall, the wide variability of out-of-level testing practices among the states raises many concerns about the practice of out-of-level testing.
 


Background

Out-of-level testing, “the administration of a test at a level above or below the level generally recommended for students based on their age-grade level” (Study Group on Alternate Assessment, 1999), first arose in the 1960s as a way to measure Title I program effectiveness (Cleland & Idstein, 1980; Crowder & Gallas, 1978; Jones, Barnette, & Callahan, 1983; Long, Schaffran, & Kellogg, 1977). The logic behind this was that out-of-level testing would yield more reliable and valid test results for students who were not achieving at grade level (Ayrer & McNamara, 1973). With the advent of standards-based reform and large-scale statewide assessments ushered in by recent legislation, including the No Child Left Behind (NCLB) Act of 2001, out-of-level testing seemed like an appealing option for fulfilling the legal requirement to include all students in statewide testing. Thus, students achieving below-grade level, typically students with disabilities, who were historically omitted from large-scale assessment practices, were thought to show with increased participation and performance.

The increased popularity of out-of-level testing has occurred amidst controversy within highly politicized settings (Thurlow & Minnema, 2001). States differed greatly in their use of out-of-level testing, which was reflected in their preferred terms for below grade level testing. Multiple terms were used, from instructional level testing to alternate assessment. There was variability in other areas that are of concern, such as the alignment of the test content with state content standards and the psychometric properties of out-of-level test scores.

Standards-based large-scale assessments are used as indicators that, under NCLB, all students are working toward proficiency on rigorous on-grade level curriculum guided by states’ academic content standards (Linn, Baker, & Betebenner, 2002). Curriculum and assessment are two of the three elements of education that, when joined by the third element (instruction) comprise a triad of core educational elements (Pellegrino, 2002). Alignment is the process of ensuring the agreement between these three elements, defined as “the degree to which expectations and assessments are in agreement and serve in conjunction with one another to guide the system toward students learning what they are expected to know and do” (Webb, 2002, p. 3). The purpose of content standards is to provide clear and concise guidelines for instructional and curricular development serving as the foundation of the alignment process.

Many state assessments have been custom built to align with the state’s content standards even though this process has its shortcomings. Many states’ content standards are too broad to provide clear and concise guidelines for alignment (Popham, 2001). There are multiple processes for aligning large-scale assessments with content standards. Some are less rigorous than others, which introduces a lack of consistency across states about their thoroughness (Council of Chief State School Officers, 2002). There is continuing concern that failing to ensure proper alignment between assessments and content standards will result in students being “taught to the tests” and not participating in a rigorous on-grade level curriculum based on challenging content standards (Rothman, Slattery, Vranek, & Resnick, 2002).

Alignment causes even more of a concern when considering out-of-level testing. In addition to the challenges of alignment faced by on-grade level assessments, alignment of out-of-level tests also begs the question as to which grade level content standards the out-of-level test should align. Out-of-level tests should then be aligned with alternate achievement standards that are “clearly different from the achievement standards in the target grade” (Federal Register, 2003). It is important to note that data collection for this report was conducted prior to the release of these regulations; therefore, the results of this study may not necessarily reflect this federal mandate.

Along with test alignment, another issue in developing and demonstrating the quality of the test instrument is psychometric soundness. Two aspects of psychometric properties that have been emphasized in the area of out-of-level testing are the concepts of precision and accuracy. Precision is concerned with random error and accuracy is concerned with systematic error or bias. These two concepts can be thought of as roughly comparable to reliability and validity. Validity (accuracy) speaks to whether you are hitting the right target and reliability (precision) speaks to how consistently you are hitting one target.

Asserting that on-level tests yield imprecise measures for students who are instructed at levels below the grade in which they are enrolled in school, proponents of out-of-level testing claim that testing students at the level of instruction is a more precise measure of what they know and can do. Psychometric theory and research agree that there is more random measurement error when students take tests that are much too hard for them (Bielinski, Thurlow, Minnema, & Scott, 2000). It follows then that giving students a test closer to their achievement level will produce less random error in measuring students’ ability. But the picture increases in complexity when inferences are made beyond the test results, such as when below grade level tests are used to infer achievement on content standards set for grade of enrollment. When out-of-level test results are used to infer how a student would perform on an on-level test, the measurement error of both tests must be taken into account. It is imperative that states consider whether the precision gained by using an assessment closer to a student’s achievement level (out-of-level test) outweighs the added error introduced by inferring on-level test performance from out-of-level test results.

The issue of psychometric test score accuracy in aligning out-of-level tests with grade of enrollment content standards falls in the realm of equating, and more particularly, vertically equating test scores. In the area of general equating tests, the Standards for Educational and Psychological Testing state that “the fundamental concern is to show that equated scores measure essentially the same construct, with very similar levels of reliability and conditional standard errors of measurement” (AERA/APA/NCME, 1999, p. 57). Measurement specialists are able to address this concern when they create parallel test forms intended to measure the same difficulty level on the same constructs for the same population. Annually produced versions of college entrance examinations such as the SAT or ACT tests are examples of successfully meeting these conditions. Moving away from any of these three conditions—comparable difficulty, construct, and population—complicates the process. For example, a National Research Council (NRC) committee was charged by Congress to find a common scale to equate tests such as a variety of 4th grade reading tests. After reviewing a variety of equating issues, methods, and studies, the committee concluded that “comparing the full array of currently administered commercial and state achievement tests to one another, through the development of a single equivalency or linking scale, is not feasible” (Feuer, Holland, Green, Bertenthal, & Hemphill, 1999, p. 4).

Even reducing random measurement error to acceptable limits, as might be accomplished through methods such as item response theory (IRT) scaling, would not settle the issue of whether the same construct is being measured by different levels of testing. The NRC committee spoke clearly to this issue by describing an extreme case of creating a formula that links the scores from a reading test with the scores from a mathematics test (Feuer et al., 1999). Although reading and mathematics are obviously different constructs, it is arithmetically possible to link the two scores together, resulting in a deceptive representation of student performance in one of the two content areas. Developers of out-of-level tests need to show whether, for example, 5th grade students who receive a 3rd grade reading test out of level are being assessed on the same constructs as the majority of 5th graders who receive the on-level reading test. If the same constructs are not being measured, inferring anything about proficiency on 5th grade content standards from performance on a 3rd grade test is compromising. Of particular concern is that some students who are tested on lower level standards are prevented from demonstrating proficiency on grade of enrollment content standards because they had no opportunity to show what they could do on grade level standards.

The present study was initiated to gather information about states’ out-of-level testing practices. Specifically, we wanted to see whether they attempted to align out-of-level tests with grade of enrollment content standards, what processes were used to complete this task, and what psychometric information they offer as evidence of that alignment. The specific research questions that guided this study were:

(1) What processes do states use to ensure alignment of out-of-level tests with state standards?
(2) What is the grade level of the standards with which the out-of-level tests are aligned?
(3) What evidence do states offer that their assessments are psychometrically sound?
(4) Are scores from out-of-level tests equated with scores from on-grade level tests? And what is the rationale for this process?
 


Method

We collected narrative data in this study from two different sources: technical reports that contained information about states’ large-scale test instruments and information gleaned from interviews with state assessment directors or other individuals knowledgeable about the state’s large-scale assessment program.
 

Procedure—Task One

The first task in this study was to gather the technical information. At the beginning of this study in 2002, 14 states were identified as using out-of-level testing in their statewide assessment programs (Thurlow & Minnema, 2003). We attempted to obtain technical reports, as well as any other technical information about the state’s assessment program, from each of these states. To accomplish this task, we searched each state education agency’s Web site and downloaded any technical information or test development information, including test blueprints. If detailed technical information was not posted online, we contacted each state or the state’s test contractor directly in an attempt to obtain a hard copy of the technical report. We received technical reports from two states; two other states indicated that the report was not available at that time, but would be available within the year. Unfortunately, the window of time for collecting technical reports for our study closed before these reports were completed. Two test publishers sent a copy of the technical report for two norm-referenced tests used in some states that test out of level. We reviewed all of the technical information before conducting the telephone interview. For states from which technical information was not received, we reviewed the information we had found on the states’ Web site.
 

Procedure—Task Two

Assessment directors in states that tested out of level were selected as prospective participants for the telephone interviews. An NCEO researcher contacted each state assessment director by e-mail. We attached a copy of the interview questions and the study’s research proposal to the e-mail for advanced information about the study. Participants were given the option of a telephone interview or responding to the interview questions via e-mail. They could designate another individual to participate instead if someone else was more knowledgeable on out-of-level testing in that state. A follow-up e-mail identical to the initial recruitment e-mail with a brief reminder note was sent to non-respondents after one month.

Nine states agreed to participate in the interview (California, Connecticut, Delaware, Iowa, Mississippi, Oregon, South Carolina, Utah, Vermont); three states declined participation, and two states failed to respond. One assessment director completed the interview via e-mail while eight states participated in the telephone interview. Of the eight states that participated in the telephone interviews, two assessment directors, one program associate of an educational testing company, one university professor, and two assessment specialists participated individually, and two states requested group interviews. Of these group interviews, one state included the assessment director, two assessment consultants, and one manager of communications. The other state included the assessment director, one assessment consultant, and one assessment coordinator. The telephone interviews typically lasted 20 to 30 minutes and were tape recorded for transcription and qualitative data analysis.

After the transcribed interviews were read, the primary participant was contacted once again via e-mail for any additional follow-up questions or clarifications. We began data analysis by revisiting the technical information gathered prior to the interviews. This information provided context and support for our analysis process. Next, we reviewed each interview transcript and coded the narrative data into subcategories of information. We then compiled the results thematically. Each states’ final set of results were e-mailed to the primary interview participant for final review prior to drafting the report.
 


Results

States’ alignment and technical information for their large-scale statewide assessments was generally available online or in hard copy. Table 1 provides details on the availability of states’ alignment and technical information, including what information was available online and what information was available in hard copy for those who request it. Some states (California, Connecticut, Mississippi, Oregon, South Carolina, Utah) posted online their test blueprints or a similar form of test specifications that explained the alignment between the state’s content standards and test items. Five states (Delaware, Mississippi, Oregon, South Carolina, Utah) provided more detailed alignment and test development information. This information was presented as a stand-alone document with information such as phases of test instrument development or descriptions of test development committees and panels (Oregon, Utah). Other stand-alone documents were in the form of technical manuals (Delaware, South Carolina) or summaries of technical information (Connecticut, Mississippi, South Carolina, Utah, Vermont).

Five states (Connecticut, Delaware, Iowa, Utah, South Carolina) indicated that test blueprints were available in hard copy, while five states (Connecticut, Iowa, Mississippi, Oregon, South Carolina) indicated that some form of more detailed technical information was also available upon request. The majority of the states (California, Connecticut, Iowa, Mississippi, South Carolina, Utah, Vermont) responded that the complete technical manual for the state’s assessment was available in hard copy from either the state educational agency or the test publisher.
 

Table 1. Availability of States’ Large-Scale Assessment Information

 

CA

CT

DE

IA

MS

OR

SC

UT

VT

Online information

 

 

 

 

 

 

 

 

 

Blueprints or test specifications

X

X

 

 

X

X

X

X

 

Technical manual

 

 

X

 

 

 

X

 

 

Test development information

 

 

X

 

X

X

X

X

 

Some technical information

 

X

 

 

X

 

X

X

X

Hard copy materials on request

 

 

 

 

 

 

 

 

 

Blueprints

 

X

X

X

 

 

X

X

 

Technical manual

X

X

 

X

X

 

X

X

X

Detailed technical information

 

X

 

X

X

X

X

 

 



States made technical information available to the public. Overall, states had responded to their consumers’ need for technical information by making the information available in consumer-friendly formats, either online or in hard copy. Three states (Connecticut, Oregon, South Carolina) attempted to put information online that was useful for a broad audience while providing instructions on how to obtain more specific information available in hard copy. While one state (Delaware) preferred to respond to the public need for such information by posting the entire technical manual online as a way to answer common questions, another state (Iowa) commented that consumers (i.e., district test coordinators) could more readily access the information in hard copy format. Making technical information, and especially test development information and blueprints, available to public consumers offered these consumers the opportunity to access states’ procedures for test alignment with state content standards.

We gathered information about states’ processes for aligning on-level tests to content standards to provide the context for a discussion of aligning out-of-level tests with content standards. Table 2 displays states’ information regarding on-level test alignment. All of the states involved in this study, with the exception of one (Iowa), had developed statewide content standards. Every state with statewide content standards also had some type of documented link between those standards and the state’s large-scale statewide assessment. These states had blueprints or test specifications based on the content standards that guided test item development. In these documents, each content standard was divided into testable portions, and each portion was assigned one or more test items to assess this section of the content standard.

States organized groups to conduct alignment procedures in a variety of ways. Many states (California, Delaware, Oregon, South Carolina, Utah) deployed special panels or committees to review test alignment as shown in Table 2. For example, Oregon used both content and sensitivity panels to check for alignment, South Carolina developed an Education Oversight Committee to, in part, ensure test alignment to content standards, and Utah used an advisory committee for this purpose. Two states (Connecticut, Vermont) relied on stakeholders in the assessment development process to review alignment during and at the conclusion of the actual test development process. Further, one state (Iowa), because it does not have statewide standards, provided training programs to each district in the state to teach about aligning the district assessment to district standards. One state (Mississippi) did not cite a specific review process to ensure alignment of assessments with content standards.

Table 2. State Practices In Aligning On-Level Tests to Content Standards

 

CA

CT

DE

IA

MS

OR

SC

UT

VT

Adopted Statewide Content Standards

 

 

 

 

 

 

 

 

 

Yes

X

X

X

 

X

X

X

X

X

No

 

 

 

X

 

 

 

 

 

Documented Link Between Standards and Assessment

 

 

 

 

 

 

 

 

 

Yes

X

X

X

 

X

X

X

X

X

No

 

 

 

X

 

 

 

 

 

Alignment Review Process in Place

 

 

 

 

 

 

 

 

 

Yes

X

X

X

X

 

X

X

X

X

No

 

 

 

 

X

 

 

 

 

Individuals Involved in Test Development and Alignment Process

 

 

 

 

 

 

 

 

 

Content specialists

X

 

X

X

 

X

X

X

 

State department of education staff

 

X

X

 

 

 

X

X

X

Test publisher staff

X

X

X

X

 

 

X

X

 

Educators

X

X

X

X

X

X

X

X

X

Administrators

 

 

 

 

 

 

X

X

X

 

Each state included a unique combination of individuals in their alignment and test development processes. In particular, content specialists were designated as having a role in these processes (California, Delaware, Iowa, Oregon, South Carolina, Utah) as well as state educational agency staff members (Connecticut, Delaware, South Carolina, Utah, Vermont). Additionally, test publisher staff members contributed to these processes (California, Connecticut, Delaware, Iowa, South Carolina, Utah) along with educators (California, Connecticut, Delaware, Iowa, Mississippi, Oregon, South Carolina, Utah, Vermont) and administrators (South Carolina, Utah, Vermont).

States assumed that out-of-level tests were aligned with students’ grade level of instruction. As indicated in Table 3, all the states responded that the tests used for out-of-level assessment were the same tests that were used for on-level assessment, just presented at a grade level below a student’s enrollment grade. For instance, an 8th grade student administered a 5th grade out-of- level test would take the same test as 5th grade students who were participating in the general assessment on grade level. In most cases, states indicated that students were tested out of level at their instructional level, meaning that an 8th grade student would be administered a 5th grade test out-of-level because instruction was delivered at the 5th grade level.

Table 3. Out-of-Level Test Characteristics

 

CA

CT

DE

IA