Click here to jump to main content.WHO WE ARE    |   PUBLICATIONS   |  PRESENTATIONS   |  PROJECTS   |   RELATED SITES   |   STAFF   |   SITE MAP   |   SEARCH   |   WHAT'S NEW   |   HOME 

Models for Understanding Task Comparability
in Accommodated Testing

 

Council of Chief State School Officers
State Collaborative on Assessment and Student Standards
Assessing Special Education Students (ASES) - Study Group III
Patricia Almond-Chairperson

 

Gerald Tindal
University of Oregon

March, 1998

 

Behavioral Research and Teaching
College of Education - 232 Education
5262 University of Oregon
Eugene, OR 97403-5262
(541) 346-1640
geraldt@darkwing.uoregon.edu


Preface

This paper was commissioned by a subgroup of the State Collaborative on Assessment and Student Standards (SCASS) focused upon Assessing Special Education Students (ASES). When students with disabilities take large scale assessments, issues arise about testing accommodations and modifications. The SCASS-ASES group sought guidance as they addressed concerns about the effect of accommodations on the validity of test performance for students taking them. Knowledge about the impact of accommodations on assessment results was limited so a paper about task comparability was conceived.

This document represents the resulting paper and it reflects the search for understanding about accommodations in large scale assessments for students with disabilities. The SCASS-ASES Study Group Three proposed the project and its members acted as advisors during the development of the paper. They hope that the ideas presented in this paper spur discussion, inspire new research, and move educators forward in implementing the intent of the 1997 Amendments to the Individuals with Disabilities Education Act (IDEA).

The group wishes to acknowledge the dedication of Dr. Gerald Tindal, who researched and wrote the paper. He saw the project to completion, posed provocative questions, probed for answers, and provided rare insights into arguments that were raised by members of the group. He synthesized a collection of questions, conversations, as well as existing theories and knowledge in the field. His work has surpassed the initial vision, producing a fundamental perspective that already has filtered into understanding large scale assessments and students with disabilities.

Marth Thurlow also should be acknowledged for the careful and insightful editing she completed with this paper. She provided extremely insightful edits that made the language and ideas much more clear and readable. Her work was both careful and thoughtful.

Study Group Three

Patricia Almond (Chairperson, OR), Sue Bechard (CO), Jana Deming (TX), Edna Duncan (MS), John Haigh (MD), Jan Kirkland (MS), Ken Olsen (MSRRC), Paula Ploufchan (CCSSO), Martha Thurlow (NCEO), Becca Walk (WY).


Abstract

In this paper, three models are described for making decisions about the comparability of tasks when accommodations are used in large-scale assessments: (a) descriptive, (b) comparative, and (c) experimental. Within each of these models, criteria are presented that can be used to determine whether assessment tasks are considered similar (comparable) or different (noncomparable). The descriptive model involves the presentation or analysis of policy, providing historical or contextual information, and documentation. The comparative model involves the interpretation of policy, with or without data. When data are available and used, they provide post-hoc evaluations and the beginning of an empirical approach in which data are used in the decision making process. Finally, in the experimental model, not only are data used in decision making, but threats to validity are controlled, allowing for statements of relationship or cause-effect to evaluate the impact of accommodations on decisions. The use of three models should help to clarify the reasoning behind judgments of accommodations and task comparability in large-scale assessment programs.


Models for Understanding Task Comparability
in Accommodated Testing

The 1997 amendments to the Individuals with Disabilities Act (IDEA) contain a strong directive for the participation of students with disabilities in large-scale testing. Yet, the format of the assessments themselves is not specified. For example, the language of the reauthorized legislation includes the following statement: "Children with disabilities are included in general State and district-wide assessment programs, with appropriate accommodations, where necessary" [IDEA, section 612 (a)(17)(A)]. The question then, is what constitutes an "appropriate" accommodation?

To what degree are accommodated and non-accommodated tasks within large-scale assessment programs comparable? If slight changes are made in the manner in which tasks are configured, administered, or responded to, perhaps these variations can be ignored and scores considered comparable. In contrast, if significant changes exist in the accommodated tasks, then the scores may not be similar and thus should not be compared. When are two tasks considered comparable and when are they considered not comparable?

The National Center on Educational Outcomes (NCEO) has reported that there "appears to be no formal consensus on the use of the terms accommodation, modification, and adaptation, [and] they are used interchangeably" (NCEO, 1993, p. 2). In a later draft of a position presented at a June, 1996, State Collaborative on Assessment and Student Standards (SCASS), Ysseldyke describes an accommodation as "an alteration in how a test is presented to or responded to by the person tested; [it] includes a variety of alterations in presentation format, response format, setting in which the test is taken, timing, or scheduling. The changes are made in order to provide a student equal access to learning and equal opportunity to demonstrate what is known" (p. 1). Further clarification is provided in that the alterations should not substantially change the level of the test, the content of the test, or the performance criteria (what the test measures–i.e., construct validity).

Phillips (1994) argues that a differential effect is needed to help determine whether an accommodation is appropriate or not appropriate: The accommodation is effective for students with disabilities but is not effective for students without disabilities. If changing the test (how it is given and how it is completed) increases performance across the board for all students, then it may not be an appropriate accommodation. She specifically poses five issues to be addressed when considering accommodations that depart from normally used testing procedures:

  1. The measures used within any given eligibility area must be technically adequate (have established reliability and validity). For example, when students are found eligible to receive special education services, we need to consistently and truthfully document a disability and its adverse educational impact.
  2. To the greatest degree possible, students should be able to adapt to the standard testing situation, and if any changes are made they should be only minor ones. That is, changes should not be made if they don’t need to be made.
  3. The skill being tested should be the same regardless of any changes made in the way the test is given or taken. Changes in testing must be made that are limited to the removal of "irrelevant sources of difficulty but still measure the same construct" (Thurlow, Ysseldyke, & Silverstein, 1995, p. 264).
  4. The meaning of scores should be the same regardless of any changes being made in the manner in which the test is given or taken. As in the third point above, not only should the skill remain the same, but it’s meaning and the implications for using the scores to make decisions should not be different with changes in the test administration or response.
  5. The accommodation should not have the potential for benefit for students without disabilities. Any change must, therefore, reflect differential performance among different students: Some students should be affected positively by the change and others should be unaffected.

This approach to determining appropriate accommodations is controversial for two reasons: (a) it is uncertain what the best comparison groups are - all students in general education, just low performers, or just average performers; and (b) it is not clear how students with disabilities should be sorted for comparisons - by disability category or by area in which services are provided (e.g., reading decoding, language processing, etc.). The finding of differential effects implies that there has been differential access when no accommodations are used. Equal access is achieved, for example, when an accommodation improves the performance of a student in that area where the student receives special education assistance, but does not improve the performance of average students not on IEPs.


A Continuum of Testing Options

A framework can be constructed for placing accommodations on a continuum with the following definition used to create four groups: An accommodation is a change that (a) provides unique and differential access (to performance) so certain students may complete the tests and tasks without other confounding influences but (b) does not change the nature of the construct being tested. Such changes typically are designed for specific individuals and for particular purposes.

 

Figure 1. Accommodations and Modifications

Accommodations & Modifications Graphic

 

On the left side of the continuum (Figure 1 above) is a standard test and an accommodated test in which minor changes are made in the way it is given or taken and may be considered a "standard" assessment, and reflect the same construct. The purpose of making accommodations is to provide access to participation in an assessment program with a primary test or instrument. The changes may be implemented within or across special education populations, referring to the unique needs of students in a manner that may or may not be disability oriented and/or related to an adverse educational impact. Because the construct has not changed, scores can be aggregated.

Changes also can be made for specific students that are substantial and modify the construct being tested. Although the assessment still focuses on documenting performance as part of an assessment program with a primary instrument or test, substantial changes are made in the administration-response of it. The net effect is that different tasks and noncomparable scores are created compared to those generated in the standard-accommodated assessment. As a result, some type of dissagregated reporting system may need to be considered.

At the most extreme end of the continuum, changes in test administration and responses lead to alternate assessments. For changes made with students with severe disabilities and with unique needs for whom the primary measure is not appropriate, such alternate assessments may be needed. Rather than documenting performance on the primary instrument or test, one might sample behavior using tasks uniquely created to meet the needs of the student. The tasks and the scores would be noncomparable to those reflected in the primary measures and a different set of measures as well as dissaggregated reports may be needed. These changes in the assessment program are depicted in the figure above as falling on a continuum that quite likely reflects both decreasing numbers of students and increasing amounts of change in moving from left to right.

As can be seen in the table of accommodations provided by NCEO (see Table 1), tests can be administered in a variety of ways (e.g., individually or in whole class, in one session or in multiple sessions); such changes also can reflect the way in which the test is taken or behavior is sampled (e.g., orally responding for a scribe to write rather than writing). Four general classes of changes in testing practices have been used by various state departments: (a) timing/scheduling, (b) setting, (c) response, and (d) presentation (Thurlow, Ysseldyke, & Silverstein, 1995). Furthermore, responses can be changed through different test formats or with the use of assistive devices. Examples of the more obvious response accommodations include increasing the spacing on the test, using graph paper, using wider lines and/or wider margins, giving the response orally, using paper in an alternative format (word or line processed, Braille, etc.), and allowing the student to mark responses in a booklet instead of on an answer sheet. Likewise, within presentation changes, a distinction is made between changes to the test directions and the use of assistive devices or support changes. These authors also list a number of examples in which the test presentation format is changed by (a) using Braille, magnifying equipment, or large print; (b) signing directions; (c) interpreting directions; and finally, (d) orally reading the directions.

 

Table 1. Assessment Accommodations NCEO/1996-Working Draft

Timing/Scheduling
  • Flexible schedule
  • Allow frequent breaks during testing
  • Extend the time allotted to complete the test
  • Administer the test in several sessions, specify duration
  • Provide special lighting
  • Time of day
  • Administer test over several days, specify duration
  • Provide special acoustics
Setting
  • Administer the test individually in a separate location
  • Administer the test to a small group in a separate location
  • Provide adaptive or special furniture
  • Administer test in locations with minimal distractions
  • In a small group, study carrel, individually



Presentation
  • Braille edition or large-type edition
  • Prompts available on tape
  • Increase spacing between items or reduce items/page-line
  • Increase size of answer bubbles
  • Reading passages with one complete sentence/line
  • Multi-choice, answers follow questions down bubbles to right
  • Omit questions which cannot be revised, prorate credit
  • Teacher helps student understand prompt
  • Student can ask for clarification
  • Computer reads paper to student
  • Highlight key words or phrases in directions

Test Directions

  • Dictation to a proctor/scribe
  • Signing directions to students
  • Read directions to student
  • Reread directions for each page of questions
  • Simplify language in directions
  • Highlight verbs in instructions by underlining
  • Clarify directions
  • Provide cues (e.g. arrows and stop signs) on answer form
  • Provide additional examples

Presentations-Assistive Devices/Supports

  • Visual magnification devices
  • Templates to reduce visible print
  • Auditory amplification device, hearing aid or noisebuffers
  • Audiotaped administration of sections
  • Secure papers to work area with tape/magnets
  • Questions read aloud to student
  • Masks or markers to maintain place
  • Questions signed to pupil
  • Dark heavy or raised lines or pencil grips
  • Assistive devices (please specify)
  • Amanuenis (scribe)
Response

Test Format

  • Increase spacing
  • Wider lines and/or wider margins
  • Graph paper
  • Paper in alternative format (word processed, Braille, etc.)
  • Allow student to mark responses in booklet instead of answer sheet

Responses-Assistive Devices/Supports

  • Word processor
  • Student tapes response for later verbatim transcription
  • Typewriter
  • Communication device
  • Alternative response such as oral, sign, typed, pointing
  • Brailler
  • Large diameter, special grip pencil
  • Copy assistance between drafts
  • Slantboard or wedge
  • Tape recorder
  • Calculator, arithmetic tables, abacus
  • Spelling dictionary or spell check





















 

Although these sets of accommodations appear to make considerable sense in the practical world of large-scale assessment and while many states have adopted some of these and others (Siskind, 1993b), it is difficult to justify or explain why they are not uniformly adopted: Some states allow use of some of these accommodations and others do not allow them to be used (Thurlow, Scott, & Ysseldyke, 1995; Thurlow, Seyfarth, Scott, & Ysseldyke, 1997).


Models for Making Decisions about Task Comparability

In this paper, three models are presented for determining task comparability, reflecting both the current state of decision-making and providing a strategy for enhancing current practices. The first model is descriptive. It focuses on policy presentation, interpretation, and analysis. This approach is similar to that used in setting standards, in which judgments are made for passing scores. When setting standards, the judgment is about the cut-score; in identifying accommodations, the judgment is about task and response (score) comparability. The second major model in determining task comparability is comparative. This approach provides empirical, post-hoc evaluation information on implementation of accommodations to help develop or revise policies. The third model is experimental. It provides a more formal data collection system to control threats to validity (internal and external). This model implies control over the selection of participants rather than the use of intact groups and the assignment of subjects to conditions rather than post-hoc evaluations; both features help establish cause-effect relationships. Two types of experimental designs are available for studying groups and single cases.

In all three models, the common focus is on determining whether the construct being measured is modified when testing conditions or tasks are changed. Although differences exist among models in the emphasis given to data to inform the decision-making process, all three models or combinations of them can be used to make decisions on task comparability. Therefore, the models should not be viewed as completely separate. For example, with a policy model, judgments are made through simple reference to the policy to a more formal analysis. From a post-hoc evaluation perspective, decisions about accommodations are based on multiple data sources and the relationship among variables, ranging from process-implementation data to student outcome data, though in all instances the data are within extant structures and procedures. Finally, with an experimental approach, task comparability rests on well-controlled designs for collecting data and making inferences from findings. For both types of experiments, group or single case, threats to internal validity are minimized so that cause-effect statements can be made. When groups of students are studied, the designs call for matching treatments (types of accommodations), students, and outcomes. When single cases are studied, a functional analysis is used to either hypothesize and/or verify important distinctions in tasks or student responses.

In Figure 2 below, these models have been placed on a continuum from less to more data-based decisions. On the left, policy and data to inform policy are endlessly intertwined; moving to the right, data are used to justify, explicate, and validate policy, first being created from within policy and then being generated outside of the policy with increasing levels of sophistication and use.

 

Figure 2. Models and Types of Evidence

Models & Types of Evidence Graphic

 

This process for making decisions should help schools fulfill the mandates of the newly reauthorized IDEA legislation. In particular, it should help states ensure inclusion, provide accommodations or modifications in large-scale assessment programs, and report outcomes. Schools now are required to include students with disabilities in all district and state testing programs, as well as provide appropriate accommodation when necessary; therefore, systematic procedures are needed for classifying accommodations. Furthermore, for students exempted from taking such tests, not only must this decision be explained but alternate, modified measures need to be provided; these measures should be sensitive to individual student needs. Finally, performance must be reported for students with disabilities, whether participating in accommodated or modified testing programs. Performances will have to be reported both aggregated with other students, and disaggregated. If performance cannot be reported in the aggregate, then it still needs to be disaggregated and reported.

In Figure 2, the standard for making judgments appears as a continuum with an increasing scale of evidence. Furthermore, as states formulate policy, a very uneven mix of evidence may co-exist across the decision areas of inclusion, accommodation-modification, and reporting outcomes. Some states may have a better evidentiary base in some areas than others. Furthermore, with states changing rapidly in their ramping up to meet the legislative mandates, a quickly changing landscape is in the offing. Therefore, in the examples noted below, actual state policy may be different now; nevertheless, at the time to which it is referenced, it provides a good example of the six types of evidence within the three models.


Descriptive Model

Three types of evidence are used in a descriptive model. They rely on current policy for decision-making. They range from a simple presentation of policy with no other information, to a justified presentation, and finally an analysis of policy. They all use a common language that informs others through policy, with varying degrees of explanation or justification. Little external information, however, is presented in the policies to ascertain the worth of either the judgments or the policies, hence the term descriptive. Examples of external information include presentation of information outside of the policy itself (e.g., from other states’ policies).

Policy Presentation

In this type of evidence, decisions about task comparability are made by referring to policy. Only slight variations in the degree of justification for the decisions are made. In one extreme of the descriptive model is a simple list of allowable accommodations that are assumed to define task comparability.

 

Policy Presentation Example One.

In an August 4th, 1997 memorandum on allowable test accommodations from the Nevada Department of Education, the following statement is made about the guidance available for accommodations: "We do not have a system for answering questions. Any questions regarding accommodations are directed to Special Education consultants and our consultant in charge of statewide assessments...We have no resources available to answer questions outside of the enclosed document and the individual consultants." An accompanying appendix is presented with the memorandum outlining the policy for testing exceptional students, including permissible accommodations for exceptional students, which "have been judged as not violating the nature, content, or integrity of the test" (p. D2): test setting, scheduling, test directions, test format, test answer mode, and use of mechanical and non-mechanical aides. Within each of these accommodations, several specific applications are presented with a directive that they should not be interpreted broadly.

 

Some states simply present policy to define comparability, with little justification provided. In general, the major guiding premise has been that the accommodation be explicitly listed as allowable in state or district policy or test publisher’s manual and that it be described on the IEP, therefore making it part of the daily instruction and test situations. In the other extreme are policy directives with explanatory notes or interpretations. In these cases, the reasoning behind assumed comparability or noncomparability is explained in the policy.

 

Policy Presentation Example Two.

The Hawaii State Test of Essential Competencies (HSTEC) contains a similar policy presentation in its Guidelines and Procedures for Students with Disabilities (Hawaii Department of Education, September, 1996). Although slightly more expansive with justifications, the decision to consider accommodation as changing the nature of the task rests on two principles: (a) "The HSTEC may never be read to a student because of variation provided by different readers" (p. 3), and (b) consideration of whether the skill being tested is changed by the use of an accommodation. For example, "essential competency #1 requires the student read and use printed material from daily life. Because of the demands of the competency, Essential Competency #1 items must always be read independently by each student" (p. 3). Essential Competency #5, Math Computation, requires that students demonstrate their use of computational skills and therefore any items within this competency must be performed without a calculator.

 

Policy Presentation Example Three.

Mississippi includes a list of allowable accommodations organized around seating/setting, scheduling, format, and recording/transferring. In general, appropriate accommodations must (a) not affect the validity of the test, (b) function only to allow the test to measure what it purports to measure, and (c) be narrowly tailored to address a specific need in order to justify the request (Mississippi Assessment System Exclusions and Accommodations, revised, 1995). All three criteria must be met for the accommodation to be allowed. Two of the four tests used in this accountability system, The Iowa Tests of Basic Skills and the Tests of Achievement and Proficiency, allow for very limited accommodations in any of those areas listed. In some specific examples of accommodations, the student’s scores are not to be included in summary statistics because the student’s results ‘cannot be interpreted in the same manner’ as the results of students who meet the qualifications for test standardization procedures." Whenever unallowable accommodations are utilized or when any Special Education, 504, or LEP student who has met the criteria for exclusion but elects to take the test anyway, the scores will be excluded from the summary statistics. While these students’ scores are not included in the summary statistics, we believe that the results provide valuable information that should be used as a tool to tailor the student’s educational goals.

 

Policy Interpretation

When policy is interpreted, not just documented, explanations are provided that help guide the decision-making process. Although little external information is presented outside of the policy, the judgment process is clearly more transparent and can be internally cross-referenced. The two examples of such systems are from Texas and Maryland. Policy interpretation may anticipate implementation strategies, for example, with a list of allowable accommodations to be distributed to parents, students, and school personnel; training and technical support may be provided then to help educators select appropriate accommodations. Interpretations may help teachers understand which accommodations are appropriate for students: how to base them on the needs of individual students and how to ensure that accommodations have been used during instruction prior to testing.

 

Policy Interpretation Example One.

In Texas, a list of accommodations is presented in a test manual followed by a review of written comments from stakeholders suggesting the need for more clear guidance in the manual on what accommodations are allowable; it was reported that accommodations were interpreted differently across the state. On one end was the argument for individualized decisions about use of specific accommodations while the other end was based on an argument for all accommodations and modifications used in instruction to be allowed. The following revised policy on test accommodations was proposed. Disseminate widely a comprehensive list of allowable test modifications and train educators to use them. Continue to allow schools or districts to request additional accommodation. Justification. Current terminology regarding the use of accommodations that do not affect the validity of the assessment may lead to varying interpretations of what is and is not allowed. Some stakeholders asserted that teachers or administrators might be unwilling to provide accommodations currently allowed by the agency because there is little evidence available concerning the effects of accommodations on test validity (Thurlow, Ysseldyke, & Silverstein, 1995). The stakeholders also suggested that other teachers or administrators might be unaware of what is and is not permissible. This proposed change directly ties assessment accommodation to the IEP and to classroom instruction and testing. It will promote testing situations that reflect classroom practice and will preserve the ARD committee's primary role in decisions about appropriate accommodations. Also, by providing any additional detail in the list of accommodations permitted on assessments, this policy will increase the participation of students receiving special education services in the assessments, and students receiving special education services will have a better chance to demonstrate what they know and can do (p. 18).

 

Policy Interpretation Example Two.

In Maryland’s Guidelines for Accommodations, Excuses, and Exemptions (revised, 9/3/96) a listing of general principles is provided, as well as definitions and procedures for the Maryland Functional Testing Programs (MFTP), the California Test of Basic Skills (CTBS, 5th edition), and the Maryland School Performance Assessment Program (MSPAP). The general principles include (p. 2):

  • "accommodations are made to ensure valid assessment of a student’s real achievement."

  • "accommodations must not invalidate the assessment for which they are granted...accommodations must be based upon individual needs and not upon a category of disability, level of instruction, environment, or other group characteristics."

  • "accommodation must have been operational in the student’s ongoing instructional program and in all assessment activities during the school year; they may not be introduced for the first time in the testing of an individual."

  • "the decision of the validity or efficacy of not allowing an accommodation for testing purposes does not imply that the accommodation cannot be used for instructional purposes."

A summary is presented of five broad categories of accommodation: scheduling, setting, equipment, presentation, and response, which further include a list of specific accommodations permitted for all three types of tests. In general, few differences exist for the three types of tests, though interesting inferences can be deduced where allowable accommodations are differentiated. For example, calculators can be used with the Functional Testing Program, cannot be used with the CTBS/5, and can be used but invalidate the mathematics score with the MSPAP. A similar pattern exists for the use of electronic devices (allowed with MFTP, not allowed with CTBS/5, and invalidate the language usage score with the MSPAP).

Finally, in explicating the decision-making process, a series of case studies is presented in which students are described with information on student background and classroom functioning. For example, for Student #3 from an elementary school, a calculator is recommended as an allowable accommodation: He is described as a student with learning disabilities, memory problems with no mastery of facts, and an Individualized Educational Plan (IEP) addressing goals and objectives in reading, mathematics, and written expression. No statement is made about participation in statewide assessments. For middle school student # 4, who is identified with learning disabilities and IEPs in written communication/basic reading skills, participation is required in all classroom, system, and state testing programs. Furthermore, dictation of a response is allowed (with verbatim transcription by school personnel) for extended response tasks. In contrast, another middle school student (#7) is described with similar needs. Although, it is recommended that the student participate in all classroom, system, and functional tests with similar accommodations (dictating the response to an examiner for verbatim transcription), the student is exempted from the reading and written language and oral presentation of the MSPAP "due to required accommodations that invalidate the test."

 

The use of case studies to explicate the decision-making process provides an interesting array and range of different examples by focusing on the minimal differences (only slight nuances that distinguish the two examples from each other, one of which reflects a positive instance and the other of which represents a negative instance). This strategy is an extremely efficient and smart way to explicate the central feature of a concept (i.e., accommodation versus modification) because it is possible to ignore other (irrelevant) features of the concept except those that are minimally different.

 

Policy Implementation Analysis

At some level, task comparability may be considered by tracking the implementation of various accommodations. In conducting this kind of analysis, data are collected to understand factors associated with the use of accommodations. The focus of this perspective is to understand the degree to which accommodations are proposed and/or used within the context of other (student) mitigating factors so that one could begin to explain why some accommodations tend to be selected.

Policy Implementation Example One.

In Rhode Island, the following questions focus on student needs with respect to the assessment requirements. They represent questions the IEP team should be able to answer to identify needed accommodations:

  1. Can the student work independently?
  2. Can the student work in a room with 25 to 30 other students in a quiet setting?
  3. Can the student work continuously for 45-60 minutes?
  4. Can the student listen and follow oral directions?
  5. Can the student use paper and pencil to write paragraph length responses to open-ended questions?
  6. Based on the sample questions, can the student read and understand these questions?
  7. Can the student manipulate a tag board ruler and various tag board shapes in small sizes?
  8. Can the student operate a calculator?
  9. Can the student follow oral directions in English?
  10. Can the student write paragraph length responses to open-ended questions in English?
  11. Based on the sample questions, can the student read and understand these questions in English?

"No" to any of these questions for one or more students should result in identification of an appropriate accommodation. It is then recommended that the principal and/or relevant school and district staff be consulted on making any accommodations recommendations.

 

 

By analyzing the degree to which students arrive at testing situations with accompanying problems such as those noted above, it should be possible to understand why some accommodations are recommended while others are not recommended. For example, rather than simply noting that accommodations include separate testing with flexible scheduling, it would be possible to justify an accommodation based on the student’s capacity to work in a group, work independently, and remain on-task for 45-60 minutes. For students who cannot be engaged in this manner, it can be inferred that the performance is adversely affected and the test is not measuring what it should be measuring. This information, documented by the IEP team, would help teachers make decisions about accommodations.

In reporting accommodations data, meaning and the basis for understanding is provided in the relationships among the variables. For example, the same data reported for two groups of students may lead to an understanding why an accommodation is appropriate or permitted some of the time for some students and not at other times for different students. In the data reported by CCSSO, considerable differences exist in the number of states allowing various accommodations for students with disabilities versus those with limited English proficiency (see Table 3, which reflects accommodations permitted by various states, ordered from most to least frequent for students with disabilities).

 

Policy Implementation Example Two.

The Council of Chief State School Officers conducts an annual survey of state testing practices. In their latest published report, the following data have been presented on the number of states permitting various types of accommodation for students with disabilities and students with limited English proficiency (LEP). These data have been reprinted in the NCES document by Olson and Goldstein (1997).

Table 2. Number of States that Permit Accommodation for Students

Type of Accommodation With Disabilities LEP
Large Print

34

10

Braille or Sign Language

33

8

Small Group Administration

33

15

Flexible Scheduling

33

15

Separate Testing Session

31

17

Extra Time

30

14

Audiotaped Instructions/Questions

27

9

Multiple/Extra Testing Sessions

25

9

Word Processor

21

8

Simplification of Directions

15

11

Audiotaped Responses

12

4

Other Accommodation

12

10

Use of Dictionaries

9

9

Alternative Test

6

3

Other Languages

2

4

SOURCE: The Status of State Student Assessment Programs in the United States (CCSSO/NCREL 1996).

 

Further distinctions can be made comparing types of decisions (individual accountability versus group evaluation) or types of test instruments (published, norm-referenced versus state-specific, standards-based) to help inform policy about implementation, particularly the implications. In effect, task comparability is likely a function of the students being tested, the decision being made, and the type of test being used.

Clearly, policy implementation analysis begins to move task comparability judgments into a comparative realm, with data collected at various levels of the testing program. In the examples cited earlier (see Policy Implementation Analysis Example Two), the primary information was the number of states permitting an accommodation for various types of students. Another strategy would be to collect data on the actual frequency of use as done in Missouri (see Policy Implementation Analysis Example Three).

 

Policy Implementation Example Two.

The following data are reported in the 1996 Missouri Mathematics Field Test: Accommodations and Frequency of Use for 5th grade, presented in April, 1997 at the State Collaborative on Assessment and Standards (SCASS) meeting. Out of 8,700 students tested, administration of the test was changed with (a) oral reading for 212 students, (b) other unknown accommodation with 96 students, (c) repeated directions for 51 students, (d) oral translation with 12 students, and (e) amplification equipment for 4 students. Various other accommodations (e.g., Braille, signing, etc.) were implemented for a few students. Finally, other accommodations in the timing of the test also were implemented with varying frequencies as noted in Table 3 below.

Table 3. Missouri Data on Timing Accommodation Frequencies

TIMING

Accom. Freq.

Percent

Cum. Freq.

Percent

Total

8,151

89.8

8,151

89.8

Extended Time

98

1.1

8,249

90.9

More Frequent Breaks

6

0.1

8,255

90.9

Several Sessions

28

0.3

8,283

91.2

Different Days

795

8.8

9,078

100.0

 

Other perception and survey data also could have been collected. For example, how important are accommodation for helping students participate and succeed (as judged by students, parents, and teachers), and to what degree are there differences among various groups of people? How much better did certain students do when provided accommodations than comparable students who were not provided the same or similar accommodations? When questions like this are asked, judging comparability begins to move toward a comparative model involving post-hoc analyses. The shift is from collecting singular data within systems and making inferences across them to collecting multiple data within a system and co-relating the information together. In this shift, policy analysis moves to post-hoc evaluation.

 

Comparative Model

The comparative model shifts the definition of task comparability to include multiple data sources within a system. This model incorporates frequency data on use of accommodations, judgments about their appropriateness, and outcomes data on performance when accommodations are used. With these data sources, relational statements can be made about process and performance.

 

Comparative Model Example One.

In a study by Grise, Beattie, and Algozzine (1982), about 350 students in fifth grade took the Florida State Student Assessment Test with seven different changes made in the format of the test: (a) items were presented within a hierarchical progression of skills, (b) multiple-choice options had bubbles placed to the right, (c) the shape of the bubble was elliptical, (d) sentences were not broken to make a fill-justified paragraph, (e) reading passages were placed in shaded boxes, (f) examples were included for each skill, and (g) directional symbols were used for continuing and stopping; finally, this test also was enlarged by 30%.

They found that students with learning disabilities performed slightly higher on the regular print version (vs. the enlarged version) on only one of six subsections. Yet, they found 20% to 30% more students who were administered the modified version performed at mastery levels in various subsections of the test.

In a comparable study using the same modifications, Beattie, Grise, and Algozzine (1983) investigated the effects for a third grade sample of students (n= 345). Again, they found few differences on most subsections when comparing performance on the regular print version versus the enlarged print version. And, as in the other study, more students with learning disabilities mastered most of the skills when taking the modified test; on many skills, 20% more students reached mastery levels when the modified version was used than when taking the test under the standard conditions.

 

As the next example illustrates, one of the limitations of post-hoc evaluations is the sample used, in particular its representativeness to other samples. Furthermore, little information is known about how the accommodation was related to previous instructional programs and other sources of influence that may have been operational at the time of testing. Therefore, it is difficult to determine the exact cause of performance differences with and without accommodation.

 

Comparative Model Example Two.

Several studies on test accommodations were conducted by Educational Testing Services (ETS) on the Graduate Record Examination (GRE) and the Scholastic Aptitude Test (SAT) (Willingham, Ragosta, Braun, Rock, & Powers, 1988). They analyzed test scores of students who took the tests over several years, examining the scores of those with and without disabilities to compare the effects when accommodations were used versus when they were not used. The accommodations included alternative test formats (modifying the presentation by using Braille or audio presentations), assistive devices, and separate locations. While they considered task comparability (test content, testing accommodations, and test timing) their primary concern was score comparability using indicators like reliability, factor structure, differential item functioning, prediction of performance, and admissions decisions.

In general they found that between the standard and nonstandard (accommodated) administrations, there was (a) comparable reliability (Bennett, Rock, & Jirele, 1986; Bennett, Rock, & Kaplan, 1985, 1987); (b) similar factor structures (Rock, Bennett, & Kaplan, 1987); (c) similar item difficulties for examinees with and without disabilities (Bennett, Rock, & Kaplan, 1985, 1987); (d) noncomparable predictions of academic performance (with the nonstandard test scores less valid and test scores substantially underpredicting college grades for students with hearing impairments) (Braun, Ragosta, & Kaplan, 1986); and (e) comparable admissions decisions (Benderson, 1988). Furthermore, Willingham et al. (1988) found that although students with disabilities perceived the test to be harder, they performed comparable to peers without disabilities. They also found that college performance was overpredicted when extended time was allowed.

In the end, these researchers recommended that those analyzing any test results "(a) use multiple criteria to predict academic performance of disabled students, (b) give less weight to traditional predictors and more consideration to students' background and nonscholastic achievement, (c) avoid score composites, (d) avoid the erroneous belief that nonstandard scores are systematically either inflated or deflated, and (e) where feasible and appropriate, report scores in the same manner as those obtained from standard administrations" (ETS, 1990, Executive Summary Report).

 

Although these outcomes provide the field with a rich data source for considering accommodation, they represent post-hoc evaluation data that is confounded with several other variables. For example, the research is limited to college admission testing, all of which represents a limited group of tests for students with disabilities (e.g., those who are both secondary students and who are college bound). The proportions of those with disabilities who participate in such tests are very small and may not be representative of the larger group. And, the tests are unlike those used in statewide accountability systems.

 

Comparative Model Example Three.

Koretz (1996) analyzed student outcomes in assessment that categorized accommodations into four major classes: (a) dictation, (b) oral reading, (c) rephrasing, and (d) cueing. He examined data on these accommodations singly and in combination along with actual test performance in grades 4 and 8. He also provided a detailed analysis of the population of students with disabilities including information on (a) the degree to which participation in the testing program was inclusive, and (b) the comparability of the population receiving accommodations to both a national sample and to others in the testing program who did not receive any accommodations. Based on the frequency of accommodation use, several comparisons were made to determine the effect of single and multiple accommodations. Specifically, three major analyses were used: (a) comparisons of those with disabilities receiving the accommodation to those in general education not receiving the accommodation, (b) predictions of accommodation influence on outcome when applied singly, and (c) performance on specific items and differential item functioning when receiving accommodations.

He found that when fourth grade students with mild retardation were provided dictation with other accommodations, they performed much closer to the mean of the general education population, and actually above the mean in science. Similar results occurred for students with learning disabilities. For students in grade 8, the results were similar but less dramatic. In a second analysis using multiple regression to obtain an optimal estimate of each single accommodation and then comparing predicted performance with the accommodation to that without the accommodation, dictation appeared to have the strongest effect across the subject areas of math, reading, and science, as well as across grade levels. This influence was significantly stronger than that attained for paraphrasing and oral presentation, respectively.

Finally, for the item level analyses, descriptive statistics, item-to-total test correlations, and differential item functioning were used to explain why some accommodations may have worked better than others. Students with disabilities performed more poorly on common items as did students in general education. These common items, however, were found to correlate consistently and highly with the total test, reflecting their adequacy as measures of performance. Finally, performance on test items correctly predicted group membership (receiving or not receiving accommodations). For those receiving no accommodations, statistically significant differences in correct performance appeared on 5 of 22 items while for those receiving accommodations, significant differences in correct performance appeared for 13 of 22 items (an equal number were more difficult and more easy).

Koretz asserted that the frequency of accommodation use was high (80% in 4th grade and 67% in 8th grade). Furthermore, he suggested that the four accommodations were biased in that students with disabilities scored comparably to those without disabilities: "The highest scoring group of mentally retarded students (those assessed with oral presentation, paraphrasing, and dictation) scored near the mean for nondisabled students in all subjects other than mathematics - an implausible result given that these students have generalized cognitive defects" (Koretz, 1996, p. 64).

 

As can be seen with these three examples, a post-hoc evaluation perspective provides considerable data on both process and outcome. Generally, the results can be organized so that the findings may be somewhat conclusive though the explanations for them are not as certain. The major problems with this approach for examining task comparability is that cause-effect relationships are difficult to make using post-hoc evaluation data. For example, by using intact groups, many threats to internal validity exist about selection of subjects, as well as the mortality of subjects (those who drop out or appear with incomplete data), historical events that occur during the study period, and subject maturation (physical or psychological changes). Furthermore, subject selection may interact with any of these latter threats. For example, in many school systems, differential drop out rates occur for students from various ethnic backgrounds, reflecting a problem in which subject selection interacts with mortality. Another major limitation is the manner in which subjects are assigned to treatments (accommodation): It often is not clear and the representativeness of the subjects is uncertain. Finally, in most evaluations, no control or comparison groups are used to determine the differential effect of a treatment (accommodation). For example, while the two studies from Florida (Beattie, Grise, & Algozzine, 1983; Grise, Beattie, & Algozzine, 1982) provide important initial findings about the comparability of tasks, no general education students received the modified tests.

 

Experimental Model

In the experimental model, a research design is established before data are collected. Appropriate controls are implemented in the manner in which data are collected to ensure that conclusions have integrity. Two components to an experimental model include (a) a research design for organizing the investigation and controlling threats to the validity of the findings, and (b) a technically adequate measurement system to scale behavior so that inferences can be made from the outcomes.

Either group designs or single-subject designs can be used within the experimental model. Both designs should address four specific threats to validity identified by Cook & Campbell (1979): (a) internal, to allow appropriate inferences about cause and effect; (b) statistical conclusion, to ensure that data are appropriately analyzed; (c) external, to consider other populations and settings for applying the findings; and (d) construct, to explain the theoretical network in which the findings are placed. Group designs for assessing the effects of accommodations typically involve comparing students with disabilities using and not using accommodations to either themselves or another group of students. Single subject designs most often use the same student, comparing performance when an accommodation is used to performance when the accommodation is not used.

 

Group Designs

Three factors need to be considered in conducting a group design: (a) students, (b) treatments (accommodations), and (c) outcomes (measures). These three factors may be either crossed with each other (all factors are presented with each other) or nested (only some factors are presented with each other). Below is a study that depicts each of these relationships for two factors: students and treatment accommodations.

 

Group Designs Example One.

In a study of a response accommodation (marking the booklet versus using the standard bubble sheet) and an administration accommodation (reading the math test aloud versus the standard silently read administration), Tindal et al. randomly assigned students to various conditions. All students took the test with and without the response accommodation (crossed), but participated in only one of the two administration conditions (nested). While no differences were found between the two response conditions, both statistical and practical differences were found between the two administration conditions favoring the read-aloud accommodation (Tindal, Heath, Hollenbeck, Harniss, & Almond, in press).

The reasons for using these two designs (with students taking part in both response conditions while only participating in one administration condition) was to control for two critical threats to internal validity. To be certain students did not perform better in one response condition than the other, simply because it was provided first, the order of responding was counter-balanced: Half the group marked the booklet first and half bubbled the answer sheet first. Yet, the investigators also assumed that potential drift might occur in the administration accommodation (i.e., contamination from exposure to the accommodation) and, therefore, one of the factors (subjects) was nested within another factor (treatment): students were randomly assigned to participate in one or the other administration condition. These procedures are never used within program evaluations.

Because the study was done in fourth grade, however, generalization of the findings (external validity) may be limited to students who are learning to read versus reading to learn. Therefore, another read-aloud study was conducted. This second investigation used a video-taped read-aloud of math problems, in part to determine whether reading math helped older students and to ensure that the read-aloud condition was done consistently. In this study, 6th grade students were presented 30 math word problems in standard test booklet format and 30 problems read aloud by a trained reader (Helwig, Tedesco, & Tindal, in press). These researchers found that for math problems containing large numbers of both difficult and total words, students who had both low reading skill as well as high math skill performed significantly better by having the problems read aloud. For other students, the differences were either nonsignificant or in the opposite direction.

 

With an experimental model, not only must a research design be used to investigate the relationship among factors (students, treatments, and measures), but a valid measurement system must be used. Validation, however, is not of the tests or measures but of the inferences made from them (Messick, 1989). Messick employs a two-by-two matrix for understanding the construct validity of score meaning and social values. One facet is the source of justification for testing, which may consider evidence to understand score meaning or values to understand consequences. The other facet is the function or outcome from testing, which addresses how the test is to be interpreted or used. Crossing these two facets creates a four-celled table highlighting a unitary view of validity integrating meaning and values. The matrix is designed to build understanding in a progressive manner so that any construct being measured is neither underrepresented nor with irrelevant variance.

  1. The evidential basis of test interpretation reflects construct validity (CV) in addressing convergent-discriminant evidence; the focus of interpretation is primarily scientific and empirical.
  2. The evidential basis of test use focuses on the construct validity (CV) of performance in applied settings with the benefits of testing considered in relation to costs and relevance/utility (R/U).
  3. The consequential basis of test interpretation is comprised of construct validity (CV) with reference to broad theories and philosophical views, all of which address value implications (VI) and become embedded within score meaning. This block often "triggers" score-based actions.
  4. The consequential basis of test use considers construct validity (CV), relevance/utility (R/U) and value implications (VI), and potential as well as social consequences (SC) in applied settings, focusing on equity and fairness along with many other broad social interpretations, in a sense the functional worth of the test.

 

Figure 3. Facets of Validity as a Progressive Matrix

 

Test Interpretation

Test Use

Evidential Basis

1. Construct Validity (CV)

2. CV + Relevance/Utility (R/U)

Consequential Basis

3. CV + Value Implications (VI)

4. CV + R/U + VI + Soc. Conseq. (SC)

The validation process in this progressive matrix is never definitive but proceeds in an iterative manner with evidence interacting with values to create score meaning and interpretations. The validation process includes both evidence and value in an evaluative rhetoric of argumentation. Finally, validation is interpretative, addressing both the meaning of scores and the use of them to make decisions in applied settings (Messick, 1995).

 

Group Designs Example Two.

In this set of studies, a creative writing task was used that incorporates the statewide administration and scoring procedures with both handwritten and word-processed compositions. A total of 164 8th graders (152 general education, 12 special education) participated in the study. All students took part in two writing examinations. The first involved students composing handwritten essays over 3 days as part of the Oregon State Writing Assessment. The second occurred approximately 3 months later under identical conditions except students composed their essays on a word processor. All essays were scored by certified state raters using the state’s 6 trait analytic writing scale.

  1. CV: Handwritten and word processed essays were rated on six traits. A correlation matrix reveals stronger relationships between traits within a mode (handwritten or word processed) than between modes within traits (i.e., Ideas and Content, Voice, etc.). Factor analysis shows: (a) a handwritten factor with six traits and (b) a word-processed factor with the same six traits. The evidential basis for test interpretation is that separate trait scoring is not needed when students write their compositions by hand - they form one factor. The same is true when students write their compositions with computers - a single factor is found. However, these two factors are different from each other. On explanation is the tasks (writing by hand versus writing with a computer) are not comparable (Helwig, Stieber, Tindal, Hollenbeck, Heath, & Almond, 1997).
  2. CV+R/U: When the original handwritten compositions are transcribed into a typed (word-processed) format, and then rated by the state judges, the handwritten composition is rated significantly higher than the typed composition on four of the six traits. The evidential basis of test use implies that the two tasks (writing compositions by hand and writing them with computers) are not comparable and shouldn’t be used in the same evaluation system. Their use is likely a function of teacher sensitivity to the format and until this confound can be removed from the judgment process, instructional programs based on the outcomes would be incorrectly recommended. For example, it may be presumed that a student not passing on Conventions as judged on a word-processed composition, needs more instruction on this trait, when in fact, the handwritten composition reflects a passing score (Tindal, Hollenbeck, Heath, Stieber, 1997).
  3. CV+VI: When students are allowed to use computers differentially, the judged quality of some traits appears to vary. For example, when students are allowed to use a spellchecker as part of inputting the composition into the computer, ratings on Organization, Word Choice, Sentence Fluency, and Conventions are significantly higher than students who had no such spell checker available during three days of writing with a computer. The consequential basis for test interpretation is that the theory of writing itself may need to be reconsidered. Presently, the writing process is generally viewed as linear in which brainstorming, writing, and editing are built into the test administration. Yet, with findings such as these, the process may be much more recursive in which editing and writing co-occur in a non-linear fashion. Furthermore, writing tools may help with different functions in the process - with spellchecking not only serving to address problems in Conventions but also with word usage (Voice and Fluency) (Tindal, Hollenbeck, Heath, & Almond, 1997).
  4. CV+R/U+VI+SC: When the analysis of judgment reliability is not just score agreement (using both exact matches and adjacent agreements), but the consistency of making a decision about passing a standard, considerable discord is found. In fact, when disagreements occur, they are most likely to occur at the cut-score rather than at the extremes, ranging from 15% to 34%. The implications for test use from a consequential basis is that many students may be incorrectly denied a certificate of passing and be required to participate in alternative learning environments. Furthermore, curricular changes may be more systemically implemented, further limiting the options available in the language arts areas to primarily remediation rather than enrichment. As a consequence, students who exceed the cut score may not have enrichment courses available as schools focus their resources on students who have fail to pass at a minimum level (Hollenbeck, Tindal, & Heath, & Almond, 1997).

 

 

In summary, group design experimental studies need to rely not only on systematic and prospective data collection procedures but also on technically adequate measurement systems. By engaging in a prospective design, explanations of outcomes are less confounded and contain fewer threats to validity. Furthermore, experimental investigations with groups of students need to reflect a program of research in which studies are linked together not just to generate findings but also to inform decision-making. Tests need to be used and interpreted with classroom implications; they need to be validated in the context of both evidence and consequences in this interpretation and use. In the final analysis, however, group designs cannot be used to make predictions for individuals. Although the average (mean) performance may have been higher or lower in either of these groups, not all students may have been equally affected.

 

Single Case Studies

For individual decisions, only a behavioral approach, utilizing single case studies, can be considered. When the experimental research is aimed at judging task comparability for individual students, the research paradigm is anchored to the unique needs of the student and maximizes the relationship between the process and the outcomes. Using group designs an accommodation may be found to be effective for students with disabilities and not for students without disabilities. Nevertheless, this general finding cannot be used to make predictions of effect or lack of effect for every specific student. All effects are based on group averages and reflect a likelihood for finding that effect with individual students.

Functional analysis focuses on the function of various environmental variables in controlling behavior within the context of reinforcement contingencies. In a functional analysis, the single case is used to define the comparability of a task or response. Generalizations across individual cases can proceed only tentatively, because the hallmark of a functional analysis is the focus on the individual case. When the function of behavior is experimentally analyzed, the proof of comparability is documented consistently within and across phases in which tasks have been changed. Task comparability can be considered from two perspectives with a functional analysis, either in (a) defining a response class or (b) to infer the function of the behavior and response class.

Within a functional analysis approach, contingencies are considered in terms of specific reinforcement paradigms (i.e., positive and negative reinforcement, escape, avoidance, etc.). Basically, a class of responses has a common effect on the environment (the definition of an operant) and is therefore premised upon the assessment of environmental variables that influence the behavior or class of behaviors. Two caveats need to be considered in this perspective: (a) reinforcers are defined in terms of the probability of increasing or decreasing a behavior, and (b) a history of reinforcement "in turn, influences how an individual responds to current environmental contingencies" (Mace, 1994, p. 385): The same "reinforcers" may not be considered uniformly across individuals.

Response class analysis. In determining task comparability, it is important to entertain the possibility that different specific behaviors may function as part of the same response classes. This provides a useful way to think about slightly different behaviors that are maintained in a similar manner in the environment. Response classes are used in behavior analysis to describe discrete behaviors that (a) may have different topographies but serve the same function (i.e., are controlled by the same reinforcement paradigm) or (b) may have similar topographies but serve different functions (are controlled by different reinforcement paradigms). Individual behaviors may belong to more than one response class (i.e., may have multiple controlling contingencies).

Behaviors that form response classes are easy to describe with social behaviors, for example, many behaviors that reflect "compliance" include attending, making eye contact, initiating behaviors upon request, responding with speed and accuracy, etc., all of which have the same function: They reflect a connection between the "mand" (command or demand) and the response. In the case of students who "comply" when interacting with others, they perform upon request and with attention to the task; for those who do not "comply," they delay their response or fail to fully perform it.

 

Single Case Studies Example One.

How does this apply to testing students with disabilities in large scale assessments? Consider a student with Attention Deficit-Hyperactivity Disorder (ADHD). A number of behaviors may be exhibited in the presence of academic tasks with problems of considerable difficulty (worksheets, tests, assignments, etc.). When presented with a paper-pencil task and given directions to read and complete problems independently, the student may exhibit many different behaviors, all of which result in attention from the teacher and eventually removal of the task: fidgeting, playing with pencils (tapping, chewing, throwing), talking out, making distractive noises, etc. Although, these behaviors are different in terms of topography, duration, frequency, or intensity, they all have the same function of providing the student escape from an aversive stimulus. When an IEP team meets, it may decide that the student needs to take the test in a one-to-one situation, when in fact, the contingencies maintaining the behavior are laced into the classroom and contingencies are in place for maintaining the behavior. Therefore, a behavior management program also may be needed in the classroom to extinguish these problem behaviors.