WHO WE ARE
|
PUBLICATIONS | PRESENTATIONS | PROJECTS
| RELATED SITES | STAFF
| SITE MAP
| SEARCH
| WHAT'S NEW
| HOME
Council of Chief State School Officers
State Collaborative on Assessment and Student Standards
Assessing Special Education Students (ASES) - Study Group III
Patricia Almond-Chairperson
Gerald Tindal
University of Oregon
March, 1998
Behavioral Research and Teaching
College of Education - 232 Education
5262 University of Oregon
Eugene, OR 97403-5262
(541) 346-1640
geraldt@darkwing.uoregon.edu
This paper was commissioned by a subgroup of the State Collaborative on Assessment and Student Standards (SCASS) focused upon Assessing Special Education Students (ASES). When students with disabilities take large scale assessments, issues arise about testing accommodations and modifications. The SCASS-ASES group sought guidance as they addressed concerns about the effect of accommodations on the validity of test performance for students taking them. Knowledge about the impact of accommodations on assessment results was limited so a paper about task comparability was conceived.
This document represents the resulting paper and it reflects the search for understanding about accommodations in large scale assessments for students with disabilities. The SCASS-ASES Study Group Three proposed the project and its members acted as advisors during the development of the paper. They hope that the ideas presented in this paper spur discussion, inspire new research, and move educators forward in implementing the intent of the 1997 Amendments to the Individuals with Disabilities Education Act (IDEA).
The group wishes to acknowledge the dedication of Dr. Gerald Tindal, who researched and wrote the paper. He saw the project to completion, posed provocative questions, probed for answers, and provided rare insights into arguments that were raised by members of the group. He synthesized a collection of questions, conversations, as well as existing theories and knowledge in the field. His work has surpassed the initial vision, producing a fundamental perspective that already has filtered into understanding large scale assessments and students with disabilities.
Marth Thurlow also should be acknowledged for the careful and insightful editing she completed with this paper. She provided extremely insightful edits that made the language and ideas much more clear and readable. Her work was both careful and thoughtful.
Study Group Three
Patricia Almond (Chairperson, OR), Sue Bechard (CO), Jana Deming (TX), Edna Duncan (MS), John Haigh (MD), Jan Kirkland (MS), Ken Olsen (MSRRC), Paula Ploufchan (CCSSO), Martha Thurlow (NCEO), Becca Walk (WY).
In this paper, three models are described for making decisions about the comparability of tasks when accommodations are used in large-scale assessments: (a) descriptive, (b) comparative, and (c) experimental. Within each of these models, criteria are presented that can be used to determine whether assessment tasks are considered similar (comparable) or different (noncomparable). The descriptive model involves the presentation or analysis of policy, providing historical or contextual information, and documentation. The comparative model involves the interpretation of policy, with or without data. When data are available and used, they provide post-hoc evaluations and the beginning of an empirical approach in which data are used in the decision making process. Finally, in the experimental model, not only are data used in decision making, but threats to validity are controlled, allowing for statements of relationship or cause-effect to evaluate the impact of accommodations on decisions. The use of three models should help to clarify the reasoning behind judgments of accommodations and task comparability in large-scale assessment programs.
The 1997 amendments to the Individuals with Disabilities Act (IDEA) contain a strong directive for the participation of students with disabilities in large-scale testing. Yet, the format of the assessments themselves is not specified. For example, the language of the reauthorized legislation includes the following statement: "Children with disabilities are included in general State and district-wide assessment programs, with appropriate accommodations, where necessary" [IDEA, section 612 (a)(17)(A)]. The question then, is what constitutes an "appropriate" accommodation?
To what degree are accommodated and non-accommodated tasks within large-scale assessment programs comparable? If slight changes are made in the manner in which tasks are configured, administered, or responded to, perhaps these variations can be ignored and scores considered comparable. In contrast, if significant changes exist in the accommodated tasks, then the scores may not be similar and thus should not be compared. When are two tasks considered comparable and when are they considered not comparable?
The National Center on Educational Outcomes (NCEO) has reported that there "appears to be no formal consensus on the use of the terms accommodation, modification, and adaptation, [and] they are used interchangeably" (NCEO, 1993, p. 2). In a later draft of a position presented at a June, 1996, State Collaborative on Assessment and Student Standards (SCASS), Ysseldyke describes an accommodation as "an alteration in how a test is presented to or responded to by the person tested; [it] includes a variety of alterations in presentation format, response format, setting in which the test is taken, timing, or scheduling. The changes are made in order to provide a student equal access to learning and equal opportunity to demonstrate what is known" (p. 1). Further clarification is provided in that the alterations should not substantially change the level of the test, the content of the test, or the performance criteria (what the test measuresi.e., construct validity).
Phillips (1994) argues that a differential effect is needed to help determine whether an accommodation is appropriate or not appropriate: The accommodation is effective for students with disabilities but is not effective for students without disabilities. If changing the test (how it is given and how it is completed) increases performance across the board for all students, then it may not be an appropriate accommodation. She specifically poses five issues to be addressed when considering accommodations that depart from normally used testing procedures:
This approach to determining appropriate accommodations is controversial for two reasons: (a) it is uncertain what the best comparison groups are - all students in general education, just low performers, or just average performers; and (b) it is not clear how students with disabilities should be sorted for comparisons - by disability category or by area in which services are provided (e.g., reading decoding, language processing, etc.). The finding of differential effects implies that there has been differential access when no accommodations are used. Equal access is achieved, for example, when an accommodation improves the performance of a student in that area where the student receives special education assistance, but does not improve the performance of average students not on IEPs.
A framework can be constructed for placing accommodations on a continuum with the following definition used to create four groups: An accommodation is a change that (a) provides unique and differential access (to performance) so certain students may complete the tests and tasks without other confounding influences but (b) does not change the nature of the construct being tested. Such changes typically are designed for specific individuals and for particular purposes.
Figure 1. Accommodations and Modifications

On the left side of the continuum (Figure 1 above) is a standard test and an accommodated test in which minor changes are made in the way it is given or taken and may be considered a "standard" assessment, and reflect the same construct. The purpose of making accommodations is to provide access to participation in an assessment program with a primary test or instrument. The changes may be implemented within or across special education populations, referring to the unique needs of students in a manner that may or may not be disability oriented and/or related to an adverse educational impact. Because the construct has not changed, scores can be aggregated.
Changes also can be made for specific students that are substantial and modify the construct being tested. Although the assessment still focuses on documenting performance as part of an assessment program with a primary instrument or test, substantial changes are made in the administration-response of it. The net effect is that different tasks and noncomparable scores are created compared to those generated in the standard-accommodated assessment. As a result, some type of dissagregated reporting system may need to be considered.
At the most extreme end of the continuum, changes in test administration and responses lead to alternate assessments. For changes made with students with severe disabilities and with unique needs for whom the primary measure is not appropriate, such alternate assessments may be needed. Rather than documenting performance on the primary instrument or test, one might sample behavior using tasks uniquely created to meet the needs of the student. The tasks and the scores would be noncomparable to those reflected in the primary measures and a different set of measures as well as dissaggregated reports may be needed. These changes in the assessment program are depicted in the figure above as falling on a continuum that quite likely reflects both decreasing numbers of students and increasing amounts of change in moving from left to right.
As can be seen in the table of accommodations provided by NCEO (see Table 1), tests can be administered in a variety of ways (e.g., individually or in whole class, in one session or in multiple sessions); such changes also can reflect the way in which the test is taken or behavior is sampled (e.g., orally responding for a scribe to write rather than writing). Four general classes of changes in testing practices have been used by various state departments: (a) timing/scheduling, (b) setting, (c) response, and (d) presentation (Thurlow, Ysseldyke, & Silverstein, 1995). Furthermore, responses can be changed through different test formats or with the use of assistive devices. Examples of the more obvious response accommodations include increasing the spacing on the test, using graph paper, using wider lines and/or wider margins, giving the response orally, using paper in an alternative format (word or line processed, Braille, etc.), and allowing the student to mark responses in a booklet instead of on an answer sheet. Likewise, within presentation changes, a distinction is made between changes to the test directions and the use of assistive devices or support changes. These authors also list a number of examples in which the test presentation format is changed by (a) using Braille, magnifying equipment, or large print; (b) signing directions; (c) interpreting directions; and finally, (d) orally reading the directions.
Table 1. Assessment Accommodations NCEO/1996-Working Draft
Timing/Scheduling
|
Setting
|
Presentation
Test Directions
Presentations-Assistive Devices/Supports
|
Response Test Format
Responses-Assistive Devices/Supports
|
Although these sets of accommodations appear to make considerable sense in the practical world of large-scale assessment and while many states have adopted some of these and others (Siskind, 1993b), it is difficult to justify or explain why they are not uniformly adopted: Some states allow use of some of these accommodations and others do not allow them to be used (Thurlow, Scott, & Ysseldyke, 1995; Thurlow, Seyfarth, Scott, & Ysseldyke, 1997).
Models for Making Decisions about Task Comparability
In this paper, three models are presented for determining task comparability, reflecting both the current state of decision-making and providing a strategy for enhancing current practices. The first model is descriptive. It focuses on policy presentation, interpretation, and analysis. This approach is similar to that used in setting standards, in which judgments are made for passing scores. When setting standards, the judgment is about the cut-score; in identifying accommodations, the judgment is about task and response (score) comparability. The second major model in determining task comparability is comparative. This approach provides empirical, post-hoc evaluation information on implementation of accommodations to help develop or revise policies. The third model is experimental. It provides a more formal data collection system to control threats to validity (internal and external). This model implies control over the selection of participants rather than the use of intact groups and the assignment of subjects to conditions rather than post-hoc evaluations; both features help establish cause-effect relationships. Two types of experimental designs are available for studying groups and single cases.
In all three models, the common focus is on determining whether the construct being measured is modified when testing conditions or tasks are changed. Although differences exist among models in the emphasis given to data to inform the decision-making process, all three models or combinations of them can be used to make decisions on task comparability. Therefore, the models should not be viewed as completely separate. For example, with a policy model, judgments are made through simple reference to the policy to a more formal analysis. From a post-hoc evaluation perspective, decisions about accommodations are based on multiple data sources and the relationship among variables, ranging from process-implementation data to student outcome data, though in all instances the data are within extant structures and procedures. Finally, with an experimental approach, task comparability rests on well-controlled designs for collecting data and making inferences from findings. For both types of experiments, group or single case, threats to internal validity are minimized so that cause-effect statements can be made. When groups of students are studied, the designs call for matching treatments (types of accommodations), students, and outcomes. When single cases are studied, a functional analysis is used to either hypothesize and/or verify important distinctions in tasks or student responses.
In Figure 2 below, these models have been placed on a
continuum from less to more data-based decisions. On the left, policy and data to inform
policy are endlessly intertwined; moving to the right, data are used to justify,
explicate, and validate policy, first being created from within policy and then being
generated outside of the policy with increasing levels of sophistication and use.
Figure 2. Models and Types of Evidence

This process for making decisions should help schools fulfill the mandates of the newly reauthorized IDEA legislation. In particular, it should help states ensure inclusion, provide accommodations or modifications in large-scale assessment programs, and report outcomes. Schools now are required to include students with disabilities in all district and state testing programs, as well as provide appropriate accommodation when necessary; therefore, systematic procedures are needed for classifying accommodations. Furthermore, for students exempted from taking such tests, not only must this decision be explained but alternate, modified measures need to be provided; these measures should be sensitive to individual student needs. Finally, performance must be reported for students with disabilities, whether participating in accommodated or modified testing programs. Performances will have to be reported both aggregated with other students, and disaggregated. If performance cannot be reported in the aggregate, then it still needs to be disaggregated and reported.
In Figure 2, the standard for making judgments appears as a continuum with an increasing scale of evidence. Furthermore, as states formulate policy, a very uneven mix of evidence may co-exist across the decision areas of inclusion, accommodation-modification, and reporting outcomes. Some states may have a better evidentiary base in some areas than others. Furthermore, with states changing rapidly in their ramping up to meet the legislative mandates, a quickly changing landscape is in the offing. Therefore, in the examples noted below, actual state policy may be different now; nevertheless, at the time to which it is referenced, it provides a good example of the six types of evidence within the three models.
Descriptive Model
Three types of evidence are used in a descriptive model. They rely on current policy for decision-making. They range from a simple presentation of policy with no other information, to a justified presentation, and finally an analysis of policy. They all use a common language that informs others through policy, with varying degrees of explanation or justification. Little external information, however, is presented in the policies to ascertain the worth of either the judgments or the policies, hence the term descriptive. Examples of external information include presentation of information outside of the policy itself (e.g., from other states policies).
Policy Presentation
In this type of evidence, decisions about task comparability are made by referring to policy. Only slight variations in the degree of justification for the decisions are made. In one extreme of the descriptive model is a simple list of allowable accommodations that are assumed to define task comparability.
| Policy Presentation Example One. In an August 4th, 1997 memorandum on allowable test accommodations from the Nevada Department of Education, the following statement is made about the guidance available for accommodations: "We do not have a system for answering questions. Any questions regarding accommodations are directed to Special Education consultants and our consultant in charge of statewide assessments...We have no resources available to answer questions outside of the enclosed document and the individual consultants." An accompanying appendix is presented with the memorandum outlining the policy for testing exceptional students, including permissible accommodations for exceptional students, which "have been judged as not violating the nature, content, or integrity of the test" (p. D2): test setting, scheduling, test directions, test format, test answer mode, and use of mechanical and non-mechanical aides. Within each of these accommodations, several specific applications are presented with a directive that they should not be interpreted broadly. |
Some states simply present policy to define comparability, with little justification provided. In general, the major guiding premise has been that the accommodation be explicitly listed as allowable in state or district policy or test publishers manual and that it be described on the IEP, therefore making it part of the daily instruction and test situations. In the other extreme are policy directives with explanatory notes or interpretations. In these cases, the reasoning behind assumed comparability or noncomparability is explained in the policy.
| Policy Presentation Example Two. The Hawaii State Test of Essential Competencies (HSTEC) contains a similar policy presentation in its Guidelines and Procedures for Students with Disabilities (Hawaii Department of Education, September, 1996). Although slightly more expansive with justifications, the decision to consider accommodation as changing the nature of the task rests on two principles: (a) "The HSTEC may never be read to a student because of variation provided by different readers" (p. 3), and (b) consideration of whether the skill being tested is changed by the use of an accommodation. For example, "essential competency #1 requires the student read and use printed material from daily life. Because of the demands of the competency, Essential Competency #1 items must always be read independently by each student" (p. 3). Essential Competency #5, Math Computation, requires that students demonstrate their use of computational skills and therefore any items within this competency must be performed without a calculator. |
| Policy Presentation Example Three. Mississippi includes a list of allowable accommodations organized around seating/setting, scheduling, format, and recording/transferring. In general, appropriate accommodations must (a) not affect the validity of the test, (b) function only to allow the test to measure what it purports to measure, and (c) be narrowly tailored to address a specific need in order to justify the request (Mississippi Assessment System Exclusions and Accommodations, revised, 1995). All three criteria must be met for the accommodation to be allowed. Two of the four tests used in this accountability system, The Iowa Tests of Basic Skills and the Tests of Achievement and Proficiency, allow for very limited accommodations in any of those areas listed. In some specific examples of accommodations, the students scores are not to be included in summary statistics because the students results cannot be interpreted in the same manner as the results of students who meet the qualifications for test standardization procedures." Whenever unallowable accommodations are utilized or when any Special Education, 504, or LEP student who has met the criteria for exclusion but elects to take the test anyway, the scores will be excluded from the summary statistics. While these students scores are not included in the summary statistics, we believe that the results provide valuable information that should be used as a tool to tailor the students educational goals. |
Policy Interpretation
When policy is interpreted, not just documented, explanations are provided that help guide the decision-making process. Although little external information is presented outside of the policy, the judgment process is clearly more transparent and can be internally cross-referenced. The two examples of such systems are from Texas and Maryland. Policy interpretation may anticipate implementation strategies, for example, with a list of allowable accommodations to be distributed to parents, students, and school personnel; training and technical support may be provided then to help educators select appropriate accommodations. Interpretations may help teachers understand which accommodations are appropriate for students: how to base them on the needs of individual students and how to ensure that accommodations have been used during instruction prior to testing.
| Policy Interpretation Example One. In Texas, a list of accommodations is presented in a test manual followed by a review of written comments from stakeholders suggesting the need for more clear guidance in the manual on what accommodations are allowable; it was reported that accommodations were interpreted differently across the state. On one end was the argument for individualized decisions about use of specific accommodations while the other end was based on an argument for all accommodations and modifications used in instruction to be allowed. The following revised policy on test accommodations was proposed. Disseminate widely a comprehensive list of allowable test modifications and train educators to use them. Continue to allow schools or districts to request additional accommodation. Justification. Current terminology regarding the use of accommodations that do not affect the validity of the assessment may lead to varying interpretations of what is and is not allowed. Some stakeholders asserted that teachers or administrators might be unwilling to provide accommodations currently allowed by the agency because there is little evidence available concerning the effects of accommodations on test validity (Thurlow, Ysseldyke, & Silverstein, 1995). The stakeholders also suggested that other teachers or administrators might be unaware of what is and is not permissible. This proposed change directly ties assessment accommodation to the IEP and to classroom instruction and testing. It will promote testing situations that reflect classroom practice and will preserve the ARD committee's primary role in decisions about appropriate accommodations. Also, by providing any additional detail in the list of accommodations permitted on assessments, this policy will increase the participation of students receiving special education services in the assessments, and students receiving special education services will have a better chance to demonstrate what they know and can do (p. 18). |
| Policy Interpretation Example Two. In Marylands Guidelines for Accommodations, Excuses, and Exemptions (revised, 9/3/96) a listing of general principles is provided, as well as definitions and procedures for the Maryland Functional Testing Programs (MFTP), the California Test of Basic Skills (CTBS, 5th edition), and the Maryland School Performance Assessment Program (MSPAP). The general principles include (p. 2):
A summary is presented of five broad categories of accommodation: scheduling, setting, equipment, presentation, and response, which further include a list of specific accommodations permitted for all three types of tests. In general, few differences exist for the three types of tests, though interesting inferences can be deduced where allowable accommodations are differentiated. For example, calculators can be used with the Functional Testing Program, cannot be used with the CTBS/5, and can be used but invalidate the mathematics score with the MSPAP. A similar pattern exists for the use of electronic devices (allowed with MFTP, not allowed with CTBS/5, and invalidate the language usage score with the MSPAP). Finally, in explicating the decision-making process, a series of case studies is presented in which students are described with information on student background and classroom functioning. For example, for Student #3 from an elementary school, a calculator is recommended as an allowable accommodation: He is described as a student with learning disabilities, memory problems with no mastery of facts, and an Individualized Educational Plan (IEP) addressing goals and objectives in reading, mathematics, and written expression. No statement is made about participation in statewide assessments. For middle school student # 4, who is identified with learning disabilities and IEPs in written communication/basic reading skills, participation is required in all classroom, system, and state testing programs. Furthermore, dictation of a response is allowed (with verbatim transcription by school personnel) for extended response tasks. In contrast, another middle school student (#7) is described with similar needs. Although, it is recommended that the student participate in all classroom, system, and functional tests with similar accommodations (dictating the response to an examiner for verbatim transcription), the student is exempted from the reading and written language and oral presentation of the MSPAP "due to required accommodations that invalidate the test." |
The use of case studies to explicate the decision-making process provides an interesting array and range of different examples by focusing on the minimal differences (only slight nuances that distinguish the two examples from each other, one of which reflects a positive instance and the other of which represents a negative instance). This strategy is an extremely efficient and smart way to explicate the central feature of a concept (i.e., accommodation versus modification) because it is possible to ignore other (irrelevant) features of the concept except those that are minimally different.
Policy Implementation Analysis
At some level, task comparability may be considered by tracking the implementation of various accommodations. In conducting this kind of analysis, data are collected to understand factors associated with the use of accommodations. The focus of this perspective is to understand the degree to which accommodations are proposed and/or used within the context of other (student) mitigating factors so that one could begin to explain why some accommodations tend to be selected.
| Policy Implementation Example One. In Rhode Island, the following questions focus on student needs with respect to the assessment requirements. They represent questions the IEP team should be able to answer to identify needed accommodations:
"No" to any of these questions for one or more students should result in identification of an appropriate accommodation. It is then recommended that the principal and/or relevant school and district staff be consulted on making any accommodations recommendations. |
By analyzing the degree to which students arrive at testing situations with accompanying problems such as those noted above, it should be possible to understand why some accommodations are recommended while others are not recommended. For example, rather than simply noting that accommodations include separate testing with flexible scheduling, it would be possible to justify an accommodation based on the students capacity to work in a group, work independently, and remain on-task for 45-60 minutes. For students who cannot be engaged in this manner, it can be inferred that the performance is adversely affected and the test is not measuring what it should be measuring. This information, documented by the IEP team, would help teachers make decisions about accommodations.
In reporting accommodations data, meaning and the basis for understanding is provided in the relationships among the variables. For example, the same data reported for two groups of students may lead to an understanding why an accommodation is appropriate or permitted some of the time for some students and not at other times for different students. In the data reported by CCSSO, considerable differences exist in the number of states allowing various accommodations for students with disabilities versus those with limited English proficiency (see Table 3, which reflects accommodations permitted by various states, ordered from most to least frequent for students with disabilities).
| Policy Implementation Example Two. The Council of Chief State School Officers conducts an annual survey of state testing practices. In their latest published report, the following data have been presented on the number of states permitting various types of accommodation for students with disabilities and students with limited English proficiency (LEP). These data have been reprinted in the NCES document by Olson and Goldstein (1997). Table 2. Number of States that Permit Accommodation for Students
SOURCE: The Status of State Student Assessment Programs in the United States (CCSSO/NCREL 1996). |
Further distinctions can be made comparing types of decisions (individual accountability versus group evaluation) or types of test instruments (published, norm-referenced versus state-specific, standards-based) to help inform policy about implementation, particularly the implications. In effect, task comparability is likely a function of the students being tested, the decision being made, and the type of test being used.
Clearly, policy implementation analysis begins to move task comparability judgments into a comparative realm, with data collected at various levels of the testing program. In the examples cited earlier (see Policy Implementation Analysis Example Two), the primary information was the number of states permitting an accommodation for various types of students. Another strategy would be to collect data on the actual frequency of use as done in Missouri (see Policy Implementation Analysis Example Three).
| Policy Implementation Example Two. The following data are reported in the 1996 Missouri Mathematics Field Test: Accommodations and Frequency of Use for 5th grade, presented in April, 1997 at the State Collaborative on Assessment and Standards (SCASS) meeting. Out of 8,700 students tested, administration of the test was changed with (a) oral reading for 212 students, (b) other unknown accommodation with 96 students, (c) repeated directions for 51 students, (d) oral translation with 12 students, and (e) amplification equipment for 4 students. Various other accommodations (e.g., Braille, signing, etc.) were implemented for a few students. Finally, other accommodations in the timing of the test also were implemented with varying frequencies as noted in Table 3 below. Table 3. Missouri Data on Timing Accommodation Frequencies
|
Other perception and survey data also could have been collected. For example, how important are accommodation for helping students participate and succeed (as judged by students, parents, and teachers), and to what degree are there differences among various groups of people? How much better did certain students do when provided accommodations than comparable students who were not provided the same or similar accommodations? When questions like this are asked, judging comparability begins to move toward a comparative model involving post-hoc analyses. The shift is from collecting singular data within systems and making inferences across them to collecting multiple data within a system and co-relating the information together. In this shift, policy analysis moves to post-hoc evaluation.
Comparative Model
The comparative model shifts the definition of task comparability to include multiple data sources within a system. This model incorporates frequency data on use of accommodations, judgments about their appropriateness, and outcomes data on performance when accommodations are used. With these data sources, relational statements can be made about process and performance.
| Comparative Model Example One. In a study by Grise, Beattie, and Algozzine (1982), about 350 students in fifth grade took the Florida State Student Assessment Test with seven different changes made in the format of the test: (a) items were presented within a hierarchical progression of skills, (b) multiple-choice options had bubbles placed to the right, (c) the shape of the bubble was elliptical, (d) sentences were not broken to make a fill-justified paragraph, (e) reading passages were placed in shaded boxes, (f) examples were included for each skill, and (g) directional symbols were used for continuing and stopping; finally, this test also was enlarged by 30%. They found that students with learning disabilities performed slightly higher on the regular print version (vs. the enlarged version) on only one of six subsections. Yet, they found 20% to 30% more students who were administered the modified version performed at mastery levels in various subsections of the test. In a comparable study using the same modifications, Beattie, Grise, and Algozzine (1983) investigated the effects for a third grade sample of students (n= 345). Again, they found few differences on most subsections when comparing performance on the regular print version versus the enlarged print version. And, as in the other study, more students with learning disabilities mastered most of the skills when taking the modified test; on many skills, 20% more students reached mastery levels when the modified version was used than when taking the test under the standard conditions. |
As the next example illustrates, one of the limitations of post-hoc evaluations is the sample used, in particular its representativeness to other samples. Furthermore, little information is known about how the accommodation was related to previous instructional programs and other sources of influence that may have been operational at the time of testing. Therefore, it is difficult to determine the exact cause of performance differences with and without accommodation.
| Comparative Model Example Two. Several studies on test accommodations were conducted by Educational Testing Services (ETS) on the Graduate Record Examination (GRE) and the Scholastic Aptitude Test (SAT) (Willingham, Ragosta, Braun, Rock, & Powers, 1988). They analyzed test scores of students who took the tests over several years, examining the scores of those with and without disabilities to compare the effects when accommodations were used versus when they were not used. The accommodations included alternative test formats (modifying the presentation by using Braille or audio presentations), assistive devices, and separate locations. While they considered task comparability (test content, testing accommodations, and test timing) their primary concern was score comparability using indicators like reliability, factor structure, differential item functioning, prediction of performance, and admissions decisions. In general they found that between the standard and nonstandard (accommodated) administrations, there was (a) comparable reliability (Bennett, Rock, & Jirele, 1986; Bennett, Rock, & Kaplan, 1985, 1987); (b) similar factor structures (Rock, Bennett, & Kaplan, 1987); (c) similar item difficulties for examinees with and without disabilities (Bennett, Rock, & Kaplan, 1985, 1987); (d) noncomparable predictions of academic performance (with the nonstandard test scores less valid and test scores substantially underpredicting college grades for students with hearing impairments) (Braun, Ragosta, & Kaplan, 1986); and (e) comparable admissions decisions (Benderson, 1988). Furthermore, Willingham et al. (1988) found that although students with disabilities perceived the test to be harder, they performed comparable to peers without disabilities. They also found that college performance was overpredicted when extended time was allowed. In the end, these researchers recommended that those analyzing any test results "(a) use multiple criteria to predict academic performance of disabled students, (b) give less weight to traditional predictors and more consideration to students' background and nonscholastic achievement, (c) avoid score composites, (d) avoid the erroneous belief that nonstandard scores are systematically either inflated or deflated, and (e) where feasible and appropriate, report scores in the same manner as those obtained from standard administrations" (ETS, 1990, Executive Summary Report). |
Although these outcomes provide the field with a rich data source for considering accommodation, they represent post-hoc evaluation data that is confounded with several other variables. For example, the research is limited to college admission testing, all of which represents a limited group of tests for students with disabilities (e.g., those who are both secondary students and who are college bound). The proportions of those with disabilities who participate in such tests are very small and may not be representative of the larger group. And, the tests are unlike those used in statewide accountability systems.
| Comparative Model Example Three. Koretz (1996) analyzed student outcomes in assessment that categorized accommodations into four major classes: (a) dictation, (b) oral reading, (c) rephrasing, and (d) cueing. He examined data on these accommodations singly and in combination along with actual test performance in grades 4 and 8. He also provided a detailed analysis of the population of students with disabilities including information on (a) the degree to which participation in the testing program was inclusive, and (b) the comparability of the population receiving accommodations to both a national sample and to others in the testing program who did not receive any accommodations. Based on the frequency of accommodation use, several comparisons were made to determine the effect of single and multiple accommodations. Specifically, three major analyses were used: (a) comparisons of those with disabilities receiving the accommodation to those in general education not receiving the accommodation, (b) predictions of accommodation influence on outcome when applied singly, and (c) performance on specific items and differential item functioning when receiving accommodations. He found that when fourth grade students with mild retardation were provided dictation with other accommodations, they performed much closer to the mean of the general education population, and actually above the mean in science. Similar results occurred for students with learning disabilities. For students in grade 8, the results were similar but less dramatic. In a second analysis using multiple regression to obtain an optimal estimate of each single accommodation and then comparing predicted performance with the accommodation to that without the accommodation, dictation appeared to have the strongest effect across the subject areas of math, reading, and science, as well as across grade levels. This influence was significantly stronger than that attained for paraphrasing and oral presentation, respectively.Finally, for the item level analyses, descriptive statistics, item-to-total test correlations, and differential item functioning were used to explain why some accommodations may have worked better than others. Students with disabilities performed more poorly on common items as did students in general education. These common items, however, were found to correlate consistently and highly with the total test, reflecting their adequacy as measures of performance. Finally, performance on test items correctly predicted group membership (receiving or not receiving accommodations). For those receiving no accommodations, statistically significant differences in correct performance appeared on 5 of 22 items while for those receiving accommodations, significant differences in correct performance appeared for 13 of 22 items (an equal number were more difficult and more easy). Koretz asserted that the frequency of accommodation use was high (80% in 4th grade and 67% in 8th grade). Furthermore, he suggested that the four accommodations were biased in that students with disabilities scored comparably to those without disabilities: "The highest scoring group of mentally retarded students (those assessed with oral presentation, paraphrasing, and dictation) scored near the mean for nondisabled students in all subjects other than mathematics - an implausible result given that these students have generalized cognitive defects" (Koretz, 1996, p. 64). |
As can be seen with these three examples, a post-hoc evaluation perspective provides considerable data on both process and outcome. Generally, the results can be organized so that the findings may be somewhat conclusive though the explanations for them are not as certain. The major problems with this approach for examining task comparability is that cause-effect relationships are difficult to make using post-hoc evaluation data. For example, by using intact groups, many threats to internal validity exist about selection of subjects, as well as the mortality of subjects (those who drop out or appear with incomplete data), historical events that occur during the study period, and subject maturation (physical or psychological changes). Furthermore, subject selection may interact with any of these latter threats. For example, in many school systems, differential drop out rates occur for students from various ethnic backgrounds, reflecting a problem in which subject selection interacts with mortality. Another major limitation is the manner in which subjects are assigned to treatments (accommodation): It often is not clear and the representativeness of the subjects is uncertain. Finally, in most evaluations, no control or comparison groups are used to determine the differential effect of a treatment (accommodation). For example, while the two studies from Florida (Beattie, Grise, & Algozzine, 1983; Grise, Beattie, & Algozzine, 1982) provide important initial findings about the comparability of tasks, no general education students received the modified tests.
Experimental Model
In the experimental model, a research design is established before data are collected. Appropriate controls are implemented in the manner in which data are collected to ensure that conclusions have integrity. Two components to an experimental model include (a) a research design for organizing the investigation and controlling threats to the validity of the findings, and (b) a technically adequate measurement system to scale behavior so that inferences can be made from the outcomes.
Either group designs or single-subject designs can be used within the experimental model. Both designs should address four specific threats to validity identified by Cook & Campbell (1979): (a) internal, to allow appropriate inferences about cause and effect; (b) statistical conclusion, to ensure that data are appropriately analyzed; (c) external, to consider other populations and settings for applying the findings; and (d) construct, to explain the theoretical network in which the findings are placed. Group designs for assessing the effects of accommodations typically involve comparing students with disabilities using and not using accommodations to either themselves or another group of students. Single subject designs most often use the same student, comparing performance when an accommodation is used to performance when the accommodation is not used.
Group Designs
Three factors need to be considered in conducting a group design: (a) students, (b) treatments (accommodations), and (c) outcomes (measures). These three factors may be either crossed with each other (all factors are presented with each other) or nested (only some factors are presented with each other). Below is a study that depicts each of these relationships for two factors: students and treatment accommodations.
| Group Designs Example One. In a study of a response accommodation (marking the booklet versus using the standard bubble sheet) and an administration accommodation (reading the math test aloud versus the standard silently read administration), Tindal et al. randomly assigned students to various conditions. All students took the test with and without the response accommodation (crossed), but participated in only one of the two administration conditions (nested). While no differences were found between the two response conditions, both statistical and practical differences were found between the two administration conditions favoring the read-aloud accommodation (Tindal, Heath, Hollenbeck, Harniss, & Almond, in press). The reasons for using these two designs (with students taking part in both response conditions while only participating in one administration condition) was to control for two critical threats to internal validity. To be certain students did not perform better in one response condition than the other, simply because it was provided first, the order of responding was counter-balanced: Half the group marked the booklet first and half bubbled the answer sheet first. Yet, the investigators also assumed that potential drift might occur in the administration accommodation (i.e., contamination from exposure to the accommodation) and, therefore, one of the factors (subjects) was nested within another factor (treatment): students were randomly assigned to participate in one or the other administration condition. These procedures are never used within program evaluations. Because the study was done in fourth grade, however, generalization of the findings (external validity) may be limited to students who are learning to read versus reading to learn. Therefore, another read-aloud study was conducted. This second investigation used a video-taped read-aloud of math problems, in part to determine whether reading math helped older students and to ensure that the read-aloud condition was done consistently. In this study, 6th grade students were presented 30 math word problems in standard test booklet format and 30 problems read aloud by a trained reader (Helwig, Tedesco, & Tindal, in press). These researchers found that for math problems containing large numbers of both difficult and total words, students who had both low reading skill as well as high math skill performed significantly better by having the problems read aloud. For other students, the differences were either nonsignificant or in the opposite direction. |
With an experimental model, not only must a research design be used to investigate the relationship among factors (students, treatments, and measures), but a valid measurement system must be used. Validation, however, is not of the tests or measures but of the inferences made from them (Messick, 1989). Messick employs a two-by-two matrix for understanding the construct validity of score meaning and social values. One facet is the source of justification for testing, which may consider evidence to understand score meaning or values to understand consequences. The other facet is the function or outcome from testing, which addresses how the test is to be interpreted or used. Crossing these two facets creates a four-celled table highlighting a unitary view of validity integrating meaning and values. The matrix is designed to build understanding in a progressive manner so that any construct being measured is neither underrepresented nor with irrelevant variance.
Figure 3. Facets of Validity as a Progressive Matrix
Test Interpretation |
Test Use |
|
Evidential Basis |
1. Construct Validity (CV) |
2. CV + Relevance/Utility (R/U) |
Consequential Basis |
3. CV + Value Implications (VI) |
4. CV + R/U + VI + Soc. Conseq. (SC) |
The validation process in this progressive matrix is never definitive but proceeds in an iterative manner with evidence interacting with values to create score meaning and interpretations. The validation process includes both evidence and value in an evaluative rhetoric of argumentation. Finally, validation is interpretative, addressing both the meaning of scores and the use of them to make decisions in applied settings (Messick, 1995).
| Group Designs Example Two. In this set of studies, a creative writing task was used that incorporates the statewide administration and scoring procedures with both handwritten and word-processed compositions. A total of 164 8th graders (152 general education, 12 special education) participated in the study. All students took part in two writing examinations. The first involved students composing handwritten essays over 3 days as part of the Oregon State Writing Assessment. The second occurred approximately 3 months later under identical conditions except students composed their essays on a word processor. All essays were scored by certified state raters using the states 6 trait analytic writing scale.
|
In summary, group design experimental studies need to rely not only on systematic and prospective data collection procedures but also on technically adequate measurement systems. By engaging in a prospective design, explanations of outcomes are less confounded and contain fewer threats to validity. Furthermore, experimental investigations with groups of students need to reflect a program of research in which studies are linked together not just to generate findings but also to inform decision-making. Tests need to be used and interpreted with classroom implications; they need to be validated in the context of both evidence and consequences in this interpretation and use. In the final analysis, however, group designs cannot be used to make predictions for individuals. Although the average (mean) performance may have been higher or lower in either of these groups, not all students may have been equally affected.
Single Case Studies
For individual decisions, only a behavioral approach, utilizing single case studies, can be considered. When the experimental research is aimed at judging task comparability for individual students, the research paradigm is anchored to the unique needs of the student and maximizes the relationship between the process and the outcomes. Using group designs an accommodation may be found to be effective for students with disabilities and not for students without disabilities. Nevertheless, this general finding cannot be used to make predictions of effect or lack of effect for every specific student. All effects are based on group averages and reflect a likelihood for finding that effect with individual students.
Functional analysis focuses on the function of various environmental variables in controlling behavior within the context of reinforcement contingencies. In a functional analysis, the single case is used to define the comparability of a task or response. Generalizations across individual cases can proceed only tentatively, because the hallmark of a functional analysis is the focus on the individual case. When the function of behavior is experimentally analyzed, the proof of comparability is documented consistently within and across phases in which tasks have been changed. Task comparability can be considered from two perspectives with a functional analysis, either in (a) defining a response class or (b) to infer the function of the behavior and response class.
Within a functional analysis approach, contingencies are considered in terms of specific reinforcement paradigms (i.e., positive and negative reinforcement, escape, avoidance, etc.). Basically, a class of responses has a common effect on the environment (the definition of an operant) and is therefore premised upon the assessment of environmental variables that influence the behavior or class of behaviors. Two caveats need to be considered in this perspective: (a) reinforcers are defined in terms of the probability of increasing or decreasing a behavior, and (b) a history of reinforcement "in turn, influences how an individual responds to current environmental contingencies" (Mace, 1994, p. 385): The same "reinforcers" may not be considered uniformly across individuals.
Response class analysis. In determining task comparability, it is important to entertain the possibility that different specific behaviors may function as part of the same response classes. This provides a useful way to think about slightly different behaviors that are maintained in a similar manner in the environment. Response classes are used in behavior analysis to describe discrete behaviors that (a) may have different topographies but serve the same function (i.e., are controlled by the same reinforcement paradigm) or (b) may have similar topographies but serve different functions (are controlled by different reinforcement paradigms). Individual behaviors may belong to more than one response class (i.e., may have multiple controlling contingencies).
Behaviors that form response classes are easy to describe with social behaviors, for example, many behaviors that reflect "compliance" include attending, making eye contact, initiating behaviors upon request, responding with speed and accuracy, etc., all of which have the same function: They reflect a connection between the "mand" (command or demand) and the response. In the case of students who "comply" when interacting with others, they perform upon request and with attention to the task; for those who do not "comply," they delay their response or fail to fully perform it.
| Single Case Studies Example One. How does this apply to testing students with disabilities in large scale assessments? Consider a student with Attention Deficit-Hyperactivity Disorder (ADHD). A number of behaviors may be exhibited in the presence of academic tasks with problems of considerable difficulty (worksheets, tests, assignments, etc.). When presented with a paper-pencil task and given directions to read and complete problems independently, the student may exhibit many different behaviors, all of which result in attention from the teacher and eventually removal of the task: fidgeting, playing with pencils (tapping, chewing, throwing), talking out, making distractive noises, etc. Although, these behaviors are different in terms of topography, duration, frequency, or intensity, they all have the same function of providing the student escape from an aversive stimulus. When an IEP team meets, it may decide that the student needs to take the test in a one-to-one situation, when in fact, the contingencies maintaining the behavior are laced into the classroom and contingencies are in place for maintaining the behavior. Therefore, a behavior management program also may be needed in the classroom to extinguish these problem behaviors. |