WHO WE ARE
|
PUBLICATIONS | PRESENTATIONS | PROJECTS
| RELATED SITES | STAFF
| SITE MAP
| SEARCH
| WHAT'S NEW
| HOME
by Karen Langenfeld, Martha Thurlow, and Dorene Scott
Published by the National Center on Educational Outcomes
January 1997
This document has been archived by NCEO because some of the information it contains is out of date.
Any or all portions of this document may be reproduced and distributed without prior permission, provided the source is cited as:
Langenfeld, K. L., Thurlow, M. L., & Scott, D. L. (1996). High stakes testing for students: Unanswered questions and implications for students with disabilities (Synthesis Report No. 26). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved [today's date], from the World Wide Web: http://education.umn.edu/NCEO/OnlinePubs/Synthesis26.htm
Executive Summary
Increasingly, schools are administering tests that have important consequences for students. This literature review looks at existing research on the effects of these high stakes tests on students, with particular attention to students with disabilities. The review focuses on potential effects on the curriculum, students, student learning, attitudes, school climate, and costs.
Based on this review, we propose the following recommendations for future research and for those people who develop, implement, and evaluate large-scale testing programs:
Focus on the effects of high stakes testing on students rather than on schools and systems.
Assess the effects of high stakes testing on the curriculum for both special and regular education.
Assess the effects of high stakes testing on the dropout rates for both special and regular education.
Study the effects of high stakes testing programs on students who are excluded from testing.
Develop assessments that are more inclusive of students with disabilities (for example, including students with disabilities in state norming samples, and norming tests with some students using common accommodations such as extended time and Braille).
Study the effects of high stakes testing on the relationship between regular and special education.
Develop a framework for evaluating the costs versus the benefits of high stakes testing programs, particularly for alternative and authentic assessments.
High Stakes Testing
Over the past two decades, statewide testing of students has become more and more common.
This testing is conducted to meet a multitude of purposes and a wide variety of
instruments is used. The practice of statewide assessment has been the center of a great
deal of debate and political rhetoric (Popham 1987; Salganik, 1985; Shepard 1992).
Although some states have attached high stakes, such as graduation, to tests since the
early 1970s (e.g., Florida, New York), research findings on the effects of high stakes
testing have not been decisive. The Office of Technology Assessment (OTA) (1992)
summarized the findings by stating:
In the end, then, there appears to be consensus that innovation in school testing policies
can have profound effectsthe disagreement is over the desirability of those effects.
Although some of the evidence is contradictory, at times even confusing, one thing is
clear: test-based accountability is no panacea. Specific proposals for tests intended to
catalyze school improvement must be scrutinized on their individual merits (p. 15).
Research looking at effects on students, particularly students with disabilities, is the
focus of this report. In the past, most research has emphasized school improvement and
system level effects. It is important, therefore, to carefully examine the research for
evidence of student level effects.
As the clients of our educational system became more diverse in their characteristics
(Hodgkinson, 1992; Hodgkinson & Outtz, 1992), it becomes increasingly important to
look more specifically at differential effects of major reforms. Since we know that groups
of students perform differently (Mullis et al., 1994), testing, particularly high stakes
testing that may have important consequences for students, has the potential to affect
some groups differently from other groups. Among the groups of students for whom
differential effects might be expected are those with disabilities (Olson & Goldstein,
1996). At present there is increased pressure for states to use tests to decide whether
students will earn a high school diploma and to hold all students to the same requirements
for graduation (see Thurlow, Ysseldyke, & Anderson, 1995). It is important to
re-examine what we know about high stakes testing and what the research literature tells
us.
What is High Stakes?
A test can be considered high stakes if the results of
the test have perceived or real consequences for students, staff, or schools (Madaus,
1988). Increasingly, states, cities, and school boards are using test scores in order to
evaluate schools and allocate resources. In October 1996, Chicago put 109 schools on
academic probation. According to Hendrie (1996) scores from nationally normed standardized
tests were a chief factor in determining who would be placed on probation. Manzo (1996)
reported that Philadelphia was planning to link teacher raises and cash awards to schools
based on student test scores, attendance, and graduation rates. For schools with
chronically low-performing students, schools could be forced to replace up to
three-fourths of their staffs.
The consequences of testing can be both intended and unintended. Corbett and Wilson (1991)
state:
Stakes can become high when test results automatically trigger important consequences for
students or the school system, and also when educators, students, or the public perceive
that significant consequences accompany test results. Thus, a formal trigger of
consequences need not be built into the testing program for stakes to be high. Instead,
test results can cause the public to make an assessment of the quality of the school
system that serves them, and this judgment in turn can lead to a conclusion that
childrens choices . . . have been affected. The product of this process can be
increased public pressure to improve test scores, especially when the perception is that
the system is likely to have a negative impact on those choices. (p. 27)
While this review includes research using this broad definition of high stakes, the high
stakes tests in which we are most interested are those that determine a students
progress through and out of school. Examples are: minimum competency tests (MCTs),
graduation exit exams, and tests used to decide promotion from grade to grade.
Literature Review Procedure
This analysis of the literature is based on educational and psychological journals and
unpublished research from 1980 to the present. The literature search revealed fewer than
thirty research studies, with only five studies specifically focusing on persons with
disabilities. The articles include published and unpublished research, state-sponsored
evaluation and research reports, papers from professional meetings, and articles from
various research organizations such as CRESST (Center for Research on Evaluation,
Standards and Student Testing).
It became clear, as we reviewed the available evidence, that we would generate more
questions than answers. The review was limited not only by the lack of research in this
area, but also by the quite broad uses of testing for high stakes purposes. For example,
some of the studies focused on elementary schools and the pressure placed on schools to do
well for financial or other purposes. Other studies focused on high school exit exams and
still others focused on exams that determine promotion from grade to grade.
Since the articles contained information on a variety of situations that are considered
high stakes (and thus are not directly comparable), a meta-analysis of the
data was not attempted. Instead, a decision was made to conduct a descriptive analysis,
and to organize the data as a series of research questions, providing summary and analysis
of results and directions for future research. Most of the studies included here are based
on state tests, yet each state has a different approach to testing, and each program is
evolving. The studies and evaluation reports used in this review do not necessarily
reflect the particular states current testing practices. A high stakes testing
program in which schools are evaluated according to the results on the Iowa Test of Basic
skills is different from a program in which students must pass a test in order to graduate
from high school. While often the studies themselves are not directly comparable, we do
have enough studies in this review to see emerging patterns that require further
investigation. A summary of the studies we used in this review is provided in the
appendix.
This report is organized around a series of questions. The available literature addresses
some of these questions, but all are in need of further exploration and consideration by
the people who make testing decisions. The questions center around four basic effects of
high stakes testing: effects on the curriculum, effects on student learning, effects on
student and teacher attitudes and the climate of learning, and the costs versus the
benefits of high stakes testing. Each section of this review focuses on high stakes
testing in general, and then explores the implications for persons with disabilities.
Directions for future research also are proposed.
Curriculum and Instruction
Most research on the effects of high stakes testing has looked at the curriculum and
instruction. Still, the information from this research is relatively indirect. Seven
studies were reviewed (see first table in the appendix) that addressed the effects of high
stakes testing on the curriculum and instruction.
What Effect Does High-Stakes Testing Have on the Curriculum?
Several claims have been made as to the possible
positive and negative effects of high stakes testing on the curriculum. Some individuals,
such as Popham (1987), believe that the curriculum will improve as schools, teachers, and
students attempt to meet the challenges that testing will impose. Others (Madaus, 1988;
Shepard, 1992) fear that high stakes testing will narrow the curriculum, focus on
lower-order skills, or take control of the curriculum away from local sources.
In a study comparing a low-stakes state (Pennsylvania) and a high stakes state (Maryland),
Corbett and Wilson (1990, 1991) found that in a high stakes situation, teachers reported a
narrowing of the curriculum, but that not all teachers thought this was a bad thing.
According to Corbett and Wilson (1990), Maryland school districts focused more
directly on improving test scores, altered the curriculum to a greater extent, reported
more improvement in the curriculum, and felt the curriculum had narrowed more than their
Pennsylvania colleagues (p. 72). They also found that smaller districts were more
likely to make greater curriculum and instruction adjustments. Teachers did not always
think that these changes were necessarily in the best interests of the students. In
general, high stakes testing affected both the content and the sequence of instruction,
and efforts to affect test scores directly increased as the testing dates approached.
Rottenberg and Smith (1990) used qualitative interviews in a high stakes elementary
setting in a state (not identified) that used the Iowa Test of Basic Skills (ITBS). The
test was considered high stakes because the scores were used in the evaluation of
principals and for making curriculum decisions. The media also reported scores on the ITBS
by school and grade level. Rottenberg and Smith found that testing reduced the time
available for ordinary instruction. Schools were also neglecting material not in the
tests, while encouraging the use of instructional methods resembling testing, such as
multiple-choice exams.
Shepard and Dougherty (1991) addressed test-preparation practices and effects of testing
on instruction. They used districts from two states with high stakes testing, one in the
Southwest and one in the Southeast, and sampled teachers in the 3rd, 5th, and 6th grades.
They found that teachers gave greater emphasis to basic skills instruction, and that
nontested content suffered because of the focus on standardized tests. They also found
that teachers spent an inordinate amount of time preparing for tests rather than focusing
on the regular curriculum. They reported four weeks of intensive test preparation, plus
two weeks for administering the test itself. This emphasis on preparation was not limited
to the time surrounding the test administration: 68% of the teachers reported using
worksheets throughout the year to review expected test content and to give students
practice with the testing formats. Consistent with other studies, teachers did report that
the tests helped them to set clear instructional goals.
Herman and Golan (undated) also found that teachers spent an inordinate amount of time
preparing for tests. They surveyed upper elementary school teachers in nine states with
high stakes testing using a 136 item questionnaire designed specifically for this study.
Consistent with Shepard and Dougherty (1991), they found that teachers were spending class
time on worksheets covering test content and format. Teachers also changed the content and
sequence of instruction throughout the year. While higher order skills and nontested
subjects such as arts and science had less emphasis, teachers continued to cover these
subjects. Basic skills, however, were given the most emphasis. As a result of testing,
teachers changed the content and sequence of instruction based on prior tests and how
their classes did the previous year. Another trend found in this study was that
socioeconomic status (SES) had a great deal to do with how well students performed on the
tests and how much the tests affected instruction. The authors found that socioeconomic
status was significantly and negatively related to the amount of attention that schools
and teachers gave to test scores, curriculum planning, and time devoted to test-related
activities. According to Herman and Golan (undated), testing is more influential and
exerts stronger effects on teaching in schools serving more disadvantaged students
(p. 58).
Rodgers, Paredes, and Mangino (1991) looked at the effects of the Texas Educational
Assessment of Minimum Skills (TEAMS), a test that students needed to pass in order to
graduate from high school. The study took place over five years, using 12,404 eleventh
grade students from the Austin Independent School District. The test focused on language
arts and math. Rodgers et al. found that basic skills, as measured on the Tests of
Achievement and Proficiency (TAP), increased as a result of the minimum competency exam,
but that higher order skills remained the same. They concluded that districts should be
cautious about narrowing the curriculum and letting higher order skills suffer for the
sake of improving test scores.
In summary, the literature provides very little direct evidence about the effects of high
stakes testing on the curriculum. The results are consistent, however, in showing that
high stakes testing does affect what and how teachers teach. One can argue about whether
these effects are positive or negative. For example, Shepard and Dougherty (1991) found
that high stakes testing narrowed the curriculum; however, in Corbett and Wilson (1990),
some teachers thought that the curriculum had narrowed while others thought that the
curriculum had become more focused. Berger and Elson (1996) found that in a high stakes
environment, teachers reported a clearer mission for their schools. The authors claimed
that this supports advocates claims that measurement driven instruction adds
focus, coherence, and clarity to the mission of a school. Teachers and students know what
is expected of them and how they will be judged (p. 22). These may all be aspects of
the same phenomenon and deserve to be looked at more closely.
How Are Educational Opportunities For Persons With Disabilities Affected By High
Stakes Testing?
We could find no studies that directly assessed the
effects of high stakes testing on the curriculum for students with disabilities. Indeed,
some researchers, such as Rodgers et al. (1991), purposely excluded students with
disabilities from their studies in order to make data analysis easier. This section
addresses some of the evidence suggesting that educational opportunities for students with
disabilities are negatively affected by high stakes testing.
Bergquist, Elzie and Groves (undated) conducted an evaluation of the impact and
effectiveness of Florida graduation and competency test standards on students with
disabilities in 1986-87. They found that students with disabilities had trouble earning
the 24 credits required for a standard diploma in four years. This made it difficult for
them to accommodate vocational training in their programs. Students with disabilities even
had trouble earning 24 credits toward a special diploma with courses paralleling the
standard diploma, and these students also had difficulty including vocational training in
their programs. The authors concluded that students were more likely to leave high
school with a standard or special diploma but no marketable job skill (p. 10). Also,
accommodations for students in the academically-oriented classes varied from district to
district. As a result, children who failed in one district might have succeeded in another
district that was more willing to meet their needs. The authors reported that
districts with greater flexibility in the course requirements for the diploma were
better able to meet the unique needs of the handicapped student through the course
offerings at the high school level (p. 10).
Little is known about the effects of increased graduation standards on curriculum
offerings. Grossman, Kirst, and Schmidt-Posner (1986) found that increased graduation
requirements and entrance requirements for college in California resulted in increased
offerings in math and science, and decreased offerings in industrial arts, home economics,
business education, etc. If this is a national trend, then one can expect that as
resources decrease and become more concentrated on academic subjects, the opportunities
for students to find an appropriate education may decrease as well.
Directions
It is interesting that the majority of the research is
almost entirely focused on teachers and their perceptions rather than on students and
their performance. For example, do students study longer for tests when the stakes are
high? If they do, does this generalize to the rest of the curriculum, or are students more
likely to study only for the test, and what they believe the test will include?
For students with disabilities, it is especially important to assess how high stakes
testing affects the curriculum. Should students be concentrating on vocational and other
objectives, or should their IEPs focus on the skills measured on a high stakes test?
Educators and legislators must be made aware that the needs of students with disabilities
may not always be best met with what is offered in the regular academic curriculum. If
students with disabilities are to be included in educational reforms, then their needs
both for accommodations in testing and modification of the curriculum must also be taken
into account.
The evidence reviewed so far suggests that high stakes testing affects the curriculum.
These effects can be viewed as positive or negative, but the negative view is more
pervasive. It is equally, if not more important, however, to ask what effect these tests
have on students and learning.
Student Learning
Indeed, questions about the impact of high stakes testing on students with disabilities
are the critical ones that need answering. To date, however, there is virtually no
evidence that adequately addresses the question of how high stakes testing affects student
learning. Again, we turn to the study by Corbett and Wilson (1991) of a high stakes state
(Maryland) where teachers spent more time on test preparation, used more practice tests,
and conducted more content reviews. They noted, however, that while the numbers
visually document more intense activity on the part of educators in Maryland, missing is
any assessment regarding its value to improved learning (p. 70).
The media have touted many statewide testing programs on the basis that students
scores on the tests have improved over time. In this section, we review studies (see
second table in the appendix) that suggest this interpretation of test scores as a measure
of actual student improvement may be premature. Many factors that have nothing to do with
actual learning can affect scores on high stakes tests. This section addresses several of
these factors, including teaching to the test, reclassifying students so that they no
longer have to take the test (e.g., through referral to special education), and increased
dropout rates.
How Well Does High Stakes Testing Measure Student Learning?
Teaching to the test means a concentration
on skills that increase test scores regardless of the amount of knowledge the student
actually possesses. There are some generally accepted skills, often called
testwiseness, that many states openly encourage. Testwiseness includes things
such as getting a good nights sleep, knowing how to make educated guesses, using
relaxation techniques, etc. Other practices are considered unethical and include such
things as giving students copies of the test prior to administration, hinting at answers,
and changing answers on test papers. In the middle ground, there exist practices such as
giving students worksheets (multiple choice, five-minute essays) to familiarize students
with the format of the tests. One extreme example of this was reported by Madaus (1988) of
an entire course dedicated to five-paragraph timed argumentative essays. This course was
being offered in lieu of other advanced courses in writing.
Several studies have been conducted to determine whether increases in test scores reflect
real increases in student learning. Shepard (1990) conducted interviews with teachers in a
high stakes testing environment to investigate the claim that teaching to the
test was causing inflated test scores. Most instances of teaching to the test were
of a mild form, such as using commercially marketed programs designed to teach test-taking
skills to students. Other practices included such things as school-developed practice
tests, pep rallies to psych kids up to do well, and high school courses
specifically aimed at the competency measures. This study found evidence that questionable
practices existed, but could not quantify their effects on actual test scores.
In a different approach, Walstad (1984) looked at what kinds of practices were responsible
for increases in test scores. Controlling for other factors such as socioeconomic status
(SES), Walstad looked at three variables: pretesting students, curriculum changes based on
the states education standards, and district-sponsored workshops to increase the
skills of teachers in implementing the standards. Pretesting, a practice where students
were able to practice the test format, was the only significant variable that contributed
to an increase in test scores. Curriculum and instructional changes had no significant
impact. This suggests that increases in the test scores were not due to actual learning,
but rather to familiarity with the tests.
Another way to determine whether high stakes tests are measuring actual learning is to
look at how they generalize to other tests. In theory, if a student is learning a skill
such as spelling, this knowledge should generalize across state tests and other tests of
achievement. Koretz, Linn, Dunbar, and Shepard (1991) looked at whether performance on a
third grade high stakes test would generalize to other tests. Mathematics scores did not
generalize from one test to another, and reading scores generalized only a small amount,
of little practical significance. Koretz et al. concluded that to a substantial
degree, teachers in this district must be focusing on content that is specific to the
particular test used for accountability, rather than trying to improve achievement in the
broader sense that we would all desire (pp. 20-21).
Other factors that have nothing to do with learning also may influence test scores.
Shepard (1992) stated that:
Because of the pressure on test scores, more hard-to-teach children are rejected by the
system. There is a direct correspondence
between accountability pressure and the number of children denied kindergarten entrance,
assigned to two-year kindergarten programs, referred to special education, made to repeat
a grade, or who drop out of school (p. 5).
This kind of academic redshirtingkeeping students back a grade in order
to improve test scores (see Zlatos, 1994) has been found in several other studies as well.
Potter and Wall (1992) found evidence that, as early as preschool, children were kept back
a grade so that they would do better on the tests. Allington and McGill-Franzen (1992a)
examined test scores in districts that had claimed increases in student performance on
high stakes tests. The districts came from a variety of settings (urban, suburban, rural)
and socioeconomic status. Rather than finding evidence of increased learning and better
teaching, the authors found an increase in the proportion of students retained a grade or
placed in special education. The authors recalculated the test data by determining which
children started kindergarten together. When test scores of children who had been
identified for special education or who had been held back a year were included in the
test scores, the gains districts had been reporting disappeared.
Allington and McGill-Franzen (1992b) also looked at trends in these schools and found an
increasing number of students were being referred to special education or retained a grade
during a period of increased high stakes.
Similar factors also were found to influence test scores on the Texas minimum competency
tests of the mid-eighties. Mangino, Battaile, and Washington (1986) found that increased
test scores were probably due to factors other than actual learning. They identified
several problem areas, including students taking the tests many times, the ability of
students to gain waivers from taking the tests, and a higher percentage of students using
special education exemptions. Thus, a school with incentives to do better on the statewide
tests could improve overall scores by encouraging poor students to get waivers, or by
placing them in special education.
Dropping out of school is perhaps the most serious of the effects of high stakes testing,
but it has been very difficult to prove a causal connection between high stakes testing
and dropout rates. Indirect evidence of increased dropout rates has been found by
researchers such as Potter and Wall (1992) who looked at the effects of a graduation exit
exam on students in South Carolina. They found that more students were retained a grade as
a result of the high stakes testing. Overage students were more likely to drop out of
school. In a longitudinal study from 1982-1989, Morris (1991) looked at patterns of
changes in grade retention rates in a large urban school district in Florida, where tests
were considered high stakes due to pressure from the media. Among other factors (such as
restructuring the schools from a junior high to a middle school model), he found that high
stakes testing and increased graduation standards increased the retention rates. He found
that retention rates increased during the test years when the state introduced a
diagnostic test to identify failing students for remediation. The increases in retention
rates spread to other grades when the test was used to identify weak school programs. When
the graduation requirements increased while the time to achieve the requirements
decreased, retention rates also rose.
Another study that attempted to establish a link
between high school dropout rates and high stakes testing was conducted by Catterall
(1987). This study looked at states that had minimum competency examinations for
graduation from high school. Catterall used interviews to ask teachers and students their
perceptions of the impact of minimum competency testing on dropout rates. Few teachers and
administrators thought that minimum competency testing had much of an impact, and none
could cite any evidence to support their beliefs. They saw the minimum competency tests as
largely meaningless and innocuous (p. 22). Students, however, had very
different perceptions, and 14% of the students said that they knew someone who had failed
the test and dropped out of school as a result. Catterall also found a high correlation
between students who had failed the test at least once and who expressed doubt as to
whether they would finish school.
Two recent studies have attempted to examine the link between high stakes testing and
dropout rates directly, with very different results. The studies looked at different
populations and types of tests. Griffin and Heidorn (1996) reported on the Florida minimum
competency tests in mathematics and communication, which students must pass in order to
graduate from high school. Students may take the tests up to five times, starting in grade
10. The authors found that dropout rates increased only for students who were doing well
academically and subsequently failed the tests. Dropout rates did not increase for
students who already had poor academic records, or for minority students.
Reardon (1996) found very different results when he examined minimum competency tests that
students needed to pass in order to be promoted from eighth to ninth grade. This study
used data from the National Educational Longitudinal Study (NELS) to look specifically at
retention and dropout rates related to high stakes testing. Interestingly, Reardon found
that minimum competency testing was more prevalent in urban schools with high
concentrations of low-income and minority students.
This uneven distribution of MCT requirements may simply mean that the prevalence of MCTs
is related to the prevalence of lower-achieving studentsthe group proponents believe
the tests are most likely to help. But it raises an important concern as well: if MCTs do
influence some students to drop out who would not have otherwise, then not only are MCT
policies harmful, but their harmful effects are disproportionately concentrated on those
students with the fewest opportunities for success (p. 5).
Reardon did, in fact, find evidence that dropout rates do increase as a result of minimum
competency testing and that furthermore, the . . . data also suggests that it is the
concentrated poverty of these schools and their communities, and their concomitant lack of
resources, that link MCT policies to higher dropout rates, rather than other risk factors,
such as student grades, age, attendance, and minority group membership (p. 5).
Reardon found that eighth grade students with minimum competency testing requirements
dropped out at double the rate of students without MCT requirements (8.8% as opposed to
4.2%). Breaking down the data by socioeconomic status, Reardon found that low and
moderately low SES schools were the most related to high dropout rates. MCTs had little or
no effect on the dropout rates in higher SES schools.
While these results conflict with those of Griffin and Heidorn (1996), these differences
can be explained by the differences in the type of tests and populations used in the
studies. The Florida test, on which Griffin and Heidorn base their study, has been
described as a basic-level test that does not set particularly high standards. In
addition, students may take the test up to five times, perhaps relieving some of the
pressure of the high stakes. The Reardon (1996) study, on the other hand, uses a more
representative sample, concentrating on high stakes tests that have earlier and more
immediate consequences for students.
In summary, increased test scores on a high stakes test do not necessarily translate into
increased learning for students. The problems go beyond simple inflation of test scores,
which is a serious concern from a measurement standpoint. These studies point to a
frightening but very real possibility that children will be systematically and
deliberately labeled, excluded, and pushed out of the system altogether in order to
improve test scores. Madaus (1988) argued that teaching to the test is part of human
nature.
Some have argued strongly that if the skills are well chosen, and if the tests truly
measure them, then coaching is perfectly acceptable. This argument sounds reasonable, and
in the short term, it may even work. However, it ignores a fundamental fact of life: when
the teachers professional worth is estimated in terms of exam success, teachers will
corrupt the skills measured by reducing them to the level of strategies in which the
examinee is drilled . . . . The view that we can coach for the skills apart from the
tradition of test questions, embodies a staggeringly optimistic view of human nature that
ignores the powerful pull of self-interest. (p. 93)
How Does High Stakes Testing Affect the Learning of Students with Disabilities?
Macmillan, Balow, Widaman, and Hemsley (as cited in
Griffin and Heidorn, 1996) found that students with disabilities who failed a minimum
competency exam had higher dropout rates than regular education students. This study,
however, did not control for other factors that might also affect dropout rates, such as
academic performance or problem behavior.
As far as effects on what students with disabilities are actually learning as a result of
high stakes testing, we do not really know. If researchers have paid little attention to
effects on typical students, they have paid even less attention to students with
disabilities. Even where students with disabilities are included in statewide testing
programs, they are often left out of research and evaluation reports. This is unfortunate,
since people with disabilities have much to lose when it comes to high stakes testing,
especially for graduation purposes. As Safer (1980) pointed out, students who lack a
regular diploma may be discriminated against in employment later in life.
How Does Inclusion of Students with Disabilities Affect State and Local Test
Scores?
Inclusion of students with disabilities in statewide
assessments causes problems for many persons concerned with the accurate measurement of
students. When persons with disabilities are not included in norming samples, or
accommodations such as large print editions are not part of the initial test
standardization process, then including people, or allowing accommodations raises
legitimate questions about the reliability and validity of the test scores. According to
Bond, Roeber, and Braskamp (1996), 31 states currently use norm-referenced tests, such as
the Iowa Test of Basic Skills (ITBS) or the Stanford Achievement Test (SAT). Scores on
these tests are used for improvement of instruction (29 states), program evaluation (24
states), school performance reporting (22 states), student diagnosis (19 states), and high
school graduation (two states). Many times, students with disabilities are left out of
these tests or denied accommodations because the test was not standardized on these
populations. Current research efforts are trying to determine whether accommodations such
as large print, extra time on tests, Braille versions of tests, etc. are a valid means of
including students in these assessments. Other efforts have been made to determine whether
tests can be administered to persons with disabilities and reported using separate norms
and percentile ranks. Chin-Chance, Gronna, and Jenkins (1996) found that the state of
Hawaii had enough students taking the states norm-referenced test to report separate
norms and percentile ranks for students with and without disabilities and within
disability categories.
Research into testing for students with special needs should not stop at simply
determining the validity of using a type of accommodation. It is possible to develop tests
that are normed on all students, and using formats that would include a greater number of
people. For example, tests can be developed using larger print, so that a separate test in
large print need not be developed. Test items can be developed keeping in mind that some
people may not be able to see the figures, but will need to rely on verbal information.
Criterion-referenced tests, for which a student must simply demonstrate a proficiency in a
subject are more easily adaptable for persons with disabilities. Here, the student need
only demonstrate proficiency in the areas to be tested. For example, in the state of
Hawaii, students must pass the Hawaii State Test of Essential Competencies (HSTEC) to
receive a high school diploma. Students who fail the HSTEC once may take the test again
until they pass, or may pass the test through the Essential Competencies Certification
Center that uses an open-ended response format. According to Bond et al. (1996), 36 states
currently use criterion-referenced tests. Sixteen states use criterion-referenced tests
for graduation purposes. Thirty-four states use written assessments, 12 of them for
graduation, and 18 states use alternative forms of assessment, two of them for graduation.
While inclusion of students with disabilities in these types of assessments is more easily
accomplished than in norm-referenced tests, it is still imperative that states take the
needs of students with disabilities into account when they develop testing programs in
order to avoid costly problems later.
How Do Students with Disabilities Perform on High Stakes Assessments?
In general, the few studies that have been conducted on
students with disabilities show that they do poorly when compared to peers without
disabilities. In Hawaii, Chin-Chance et al. (1996) looked at the scores of students with
disabilities taking the Stanford Achievement Test 8th Edition (SAT8) in the 3rd, 6th, 8th,
and 10th grades. Hawaii uses the SAT8 for national, school, district, and local
comparisons. While the study was not conducted using the states graduation exit exam
(the Hawaii State Test of Essential Competencies), the SAT8 can still be considered high
stakes because the scores are public, comparisons are made among districts and schools,
and SAT8 scores are used to make judgments about a schools effectiveness. The study
looked at data for students with disabilities who took the statewide administration of the
SAT8 in 1994 and 1995. They found that students with disabilities in all categories did
more poorly on the tests than did students without disabilities; however, when looking at
difference scores from one year to the next, students in Hawaii showed more improvement
than did the national norms, and students with disabilities did as well as, or better
than, students without disabilities. This study did not look at how students with
disabilities were faring on the graduation exit exam, a test with much higher stakes than
the SAT8.
Safer (1980) found that in Florida in the late 1970s, students with disabilities were not
likely to pass the minimum competency test required for graduation (see Table 1).
According to the data reported by Safer, between six percent (educable mentally retarded)
and 71% (speech/language impaired) of students passed the communications subtest.
Altogether, only 46% of the special education students taking the Florida graduation exit
exam passed the test. Between one percent (educable mentally retarded) and 33%
(speech/language impaired) passed the math subtest, only 18% of the total.
Table 1. Percentage of High School Juniors with Disabilities Passing Two Subtests of the Florida Minimal Competency Examination in 1977. |
|||
|---|---|---|---|
| Handicapping Condition | Communications Subtest | Math Subtest | N |
| Speech/language impaired | 71 | 33 | 500 |
| Deaf | 47 | 18 | 126 |
| Hard of hearing | 65 | 29 | 49 |
| Physically impaired | 67 | 14 | 110 |
| Emotionally disturbed | 56 | 17 | 114 |
| Socially maladjusted | 49 | 25 | 79 |
| Learning disabled | 49 | 17 | 502 |
| Educable mentally retarded | 6 | 1 | 479 |
From Safer (1980), p. 289
McKinney (1983) looked at how students with disabilities performed on the North Carolina
Minimum Competency test, both with and without accommodations. He found that some groups
were more likely to benefit from test modifications than others. Still, the probability of
passing the test was low, especially for students with mild mental retardation, even with
modifications. Out of 3,043 students taking the MCT in 1978, the reported pass rates (see
Table 2) were only slightly better than those reported by Safer (1980). Even so, many
students with disabilities could not pass both subtests, a requirement for receiving a
diploma. Of the students who failed the test on the first try, 78% took the test a second
time. Of these students, only 35% passed the reading subtest and 28% passed the math
subtest. Students with visual impairments had the best retest success rates (72%) and
students with mild mental handicaps had the lowest retest success rates (21%).
| Table 2. Percent of Students Who Passed the Subtests on the North Caroline Minimum Competency Test for Graduation. | ||
| Handicapping Condition | Reading Subtest | Math Subtest |
| Educable mentally handicapped | 12 | 7 |
| Learning disabled | 56 | 47 |
| Visually impaired | 92 | 88 |
| Hearing impaired | 75 | 70 |
| Multiply handicapped | 32 | 28 |
| "Other" handicapping conditions | 66 | 57 |
From information in McKinney (1983), p. 547-548.
McKinney also looked at how test modifications affected
student performance on the test. Half of the students in the sample received modified
tests. Modifications included extended time, small group administration, audio cassettes,
large print editions, and sign language. Use of modifications was not assigned at random,
but varied across educational districts and type of school personnel (special education
teachers were seven to 17 times more likely to use test modifications). Students with mild
mental retardation who used modifications were more likely to pass the test than those who
did not receive modifications. Students with hearing impairments were actually less likely
to pass with modifications. Test modifications had no significant impact on other groups.
In interviews with school personnel, McKinney (1983) found that special educators were
concerned about the impact of the testing on students with mild mental impairments.
Teachers reported that some exceptional students, particularly educable mentally
handicapped students found the test extremely frustrating (p. 549). Even though
students with mild mental disabilities benefited from the accommodations, they were still
less likely than any of the other groups to pass the tests, and were reportedly more
likely to become frustrated by the testing process. Hall and Gallagher (as cited in
Vitello, Camilli, and Molenaar, 1987) analyzed the same data and found that while
remediation did make it possible for 50% of the students with disabilities retaking the
test to pass on the second try, students with mild mental retardation were not helped by
remediation efforts.
Vitello, Camilli, and Molenaar (1987) also found that students with mild mental handicaps
had the most difficulty passing minimum competency exams. Their study examined the scores
of the 4,299 students with disabilities who took the New Jersey competency test. At this
time, students with disabilities who completed their IEP objectives were given a standard
diploma. Of the students with disabilities who were eligible, only 40% actually took the
exam, and 12% of these students passed. Only one percent (26 of 1438) students with mild
mental disabilities took the exam, and, of this group, only four percent actually passed
the test.
Directions
The studies reviewed so far focus primarily on factors
that may inadvertently influence test outcomes. The literature says very little about
exactly what students should be learning or doing differently as a direct result of
testing. More studies, such as that of Catteral (1987), which focus specifically on
students and how testing reforms affect their lives and learning are also needed. The
conflicting results of Griffin and Heidorn (1996) and Reardon (1996) indicate that not all
testing reforms have the same results. It seems to us that every major testing reform
requires the type of analysis that Griffin and Heidorn conducted, as well as continuing
large-scale studies such as Reardons which look at a broader spectrum of tests.
There is a dearth of evidence about the effects of high stakes testing on students with
disabilities. If, for example, schools are referring more students to special education as
a result of high stakes testing, then we need to know whether these students are getting
the instruction they need, or whether we are simply putting them in positions where less
is expected of them. The effects of both inclusion and exclusion of students with
disabilities should be examined. For example, if we exclude students with disabilities, we
may then spend less time teaching them important skills that the tests cover. If we
include students with disabilities, we may then be neglecting other parts of their
education, such as vocational skills that the IEP team believes to be important.
The exclusion of any group from a high stakes testing program makes it difficult to
evaluate the effects on these students, particularly when looking at effects on dropout
rates. When some students are excluded from testing in the first place, an increase in the
dropout rate of this population would not show up in the results of studies such as
Griffin and Heidorn (1996) or Reardon (1996). Students with disabilities, particularly
learning and behavioral disabilities are already at a particularly high risk for dropping
out of school (Sinclair, Christenson, Thurlow, & Evelo, 1994). It is possible that
exclusion from high stakes testingespecially when the result is denial of a
diplomacould push even more of these students out of school.
The evidence so far suggests that students with disabilities do not fare well on minimum
competency tests. Furthermore, we still do not know whether preparing for the tests,
taking the tests, or passing the tests have any consequences, positive or negative, for
these students.
It is important for special educators to encourage new testing programs, and large testing
companies, to begin including students with disabilities in norming samples, and to
include some common accommodations (such as extended time, reading directions, large print
editions) when developing tests. This will allow educators to more accurately measure the
performance of students with disabilities, and increase the inclusion of students with
disabilities in important assessment activities.
Thus far, we have examined the influence of high stakes testing in two obvious
areascurriculum and instruction, and student learning and performance. In those
sections we found evidence that teachers focus the curriculum on the content and structure
of tests. We also found that test scores often are corrupted through artificial inflation
due to teaching to the test and reclassifying or retaining students who perform poorly so
that they do not have to take the tests. High stakes testing may have other effects as
well, effects that have nothing to do with academic performance or test scores.
Attitudes and School Climate
This section addresses the less tangible issues of school climate and the attitudes of
teachers and students toward testing (see the third table in the appendix for a summary of
the literature reviewed). In this section, we distinguish between two types of high stakes
testing: high stakes for students, and high stakes for schools and teachers. In most
instances, if a test is high stakes for one group, it will be high stakes for all groups.
Unfortunately, most of the research in this area has concentrated on the effects of high
stakes testing on teachers and schools. High stakes for students include failure to get a
diploma or grade promotion. Students may also experience high stakes when pressure is put
on schools to perform better, and the schools in turn put pressure on the students.
Schools and teachers may experience high stakes in terms of funding (for example, bonuses
for good performance, decreases in funding for poor performance), public scrutiny by the
media and politicians (for example, publishing test scores comparing schools or
districts), or job security (for example, threatening to restructure schools if test
performance does not improve).
What Effects Does High Stakes Testing Have on Teachers and Students
Attitudes, and on the Climate of Learning?
Corbett and Wilson (1991) found that staff in a high
stakes state (Maryland) reported greater impact on their students and their own
lives than did staff in a low-stakes state. Teachers in the high stakes state also
reported more stress, more paperwork, and decreased reliance on their professional
judgment.
A qualitative study using classroom observations and interviews of teachers by Rottenberg
and Smith (1990) found negative effects for both students and teachers in a high stakes
testing program. They looked at the role of external testing in elementary schools in
Arizona. The tests used in these schools, such as the Iowa Test of Basic Skills (ITBS),
were considered high stakes because the results were used in the evaluation of principals
and schools, and because the media reported ITBS scores by school and grade level.
For pupils, particularly younger ones, most teachers believe that standardized testing is
cruel and unusual punishment. Because of the length and difficulty of tests, the number of
tests, the time limits, the fine print, and the difficulty in transferring answers to
answer sheets, teachers believe tests cause stress, frustration, burn-out, fatigue,
physical illness, misbehavior and fighting, and psychological distress. Some teachers
believe that the tests cause their pupils to develop test anxiety and a failure mentality.
(p. 17)
Effects on teachers were perceived to be equally negative:
[Teachers] feel ashamed and embarrassed if their pupils score low or fail to grow by
district standards. They feel relieved rather than proud when scores are high, for they
know that test scores are weighted more by pupils socioeconomic status and level of
effort than anything teachers personally do in the classroom. (pp. 17-18)
Similar effects were found by Herman and Golan (undated) who reported that pressure to
improve test scores was negatively related to job satisfaction and pride in teaching for
upper elementary school teachers in nine states across the country.
A recent study by Berger and Elson (1996) looked at the effects of minimum competency
testing on teacher autonomy, cooperation and school mission. They used data from the 1987
Schools and Staffing Survey conducted by the U.S. Department of Education that included
surveys of over 19,000 teachers across the country. They compared teachers responses
in high stakes testing programs to those in states with low stakes testing programs. A
high stakes program was defined as one in which the diploma is withheld when the student
does not pass. As mentioned before, they found that high stakes were correlated with a
clearer sense of mission for their schools. This generally positive finding, however, was
accompanied by a reduced sense of autonomy. Contrary to other studies, Berger and Elson
did not find a correlation between high stakes testing and a reduction in teacher
cooperation. This was the only study we found that looked at a representative sample of
teachers and their attitudes toward testing.
High stakes testing appears to have both positive and negative effects on students and
teachers. Testing causes stress and frustration for both teachers and, reportedly, for
students as well. Teachers reported decreased autonomy and ability to rely on their
professional judgment. Interestingly, an increased sense of clear mission is the one
positive attitudinal change to be documented from high stakes testing. Absent from this
evidence is any direct study of how students have been affected both emotionally and
academically by testing programs.
What are the Emotional/Attitudinal Effects on Students with Disabilities?
Especially when it comes to graduation exit exams, one
can assume that the effects on students with disabilities are at least as serious, if not
more so, than on students without disabilities. McKinney (1983) discussed the possible
repercussions for students with disabilities who, according to his study, have a limited
probability of passing the test, even with modifications. The professional opinions of the
teachers involved in the study were that students, especially those with mild mental
disabilities, would experience frustration.
Many states leave the decision about the participation of students in high stakes tests up
to the Individualized Education Program (IEP) team (see Thurlow et al., 1995). These teams
need to take into account the possible negative effects of both participation and
nonparticipation in the testing programs. Again, little evidence exists about the real
effects that high stakes testing has had on students with disabilities.
Directions
As in other areas, the research into attitudes and
school climate should focus more directly on students, particularly students with
disabilities. Similarly, given the perceptions of educators, it might be beneficial to
focus training efforts toward addressing the perceptions of negative effects and how to
avoid these.
To date, there has been no research into the effects of high stakes testing on the
relationship between special and regular education. If stakes for administrators and
educators are high (for example, linked to promotions, bonuses, sanctions for poor
performance, etc.) will regular educators see special education students as a threat,
bringing down test scores and possibly resulting in the loss of funds or even jobs? Does
high stakes testing affect the inclusion of students with disabilities in the regular
classroom? Do special educators see high stakes testing as a threat to their autonomy and
the ability to individualize the educational programs of students with disabilities? Do
students with disabilities view high stakes testing as yet another barrier to their
inclusion as members of our society, or do they see just another test?
Costs Versus Benefits
So far in this report, we have examined the possible benefit or harm that may occur as a
result of high stakes testing. In the next section, we will look at the costs versus the
benefits of high stakes testing and the extent to which costs have been accurately
estimated and taken into account during the development of high stakes testing programs.
What are the Costs vs. the Benefits of High-Stakes Testing?
The issue of costs versus benefits of high stakes
testing is not an area that has been thoroughly researched. Some advocates claim that
tests can cause substantial improvement in schools at very little cost, with very little
effort on the part of legislators, government, or the public. The idea seems to be that
students and teachers will work harder to gain the rewards of scoring well on the tests,
and avoid the consequences of scoring poorly.
The February 16th Daily Report Card (DRC), an on-line newsletter put out by the National
Education Goals Panel, reported on a controversy about testing taking place in Virginia.
According to the DRC, the governor was proposing a new testing program that would be used
to determine district funding and could also be used to decide whether teachers and
administrators should keep their jobs. Teachers complained that they were being asked to
be more accountable and to achieve higher standards without the tools and training to do
so. An opponent of the proposal was quoted, saying, Testing is education reform on
the cheap. Thats the problem. Its oversimplified. This section addresses
whether testing reforms, particularly high stakes testing, are as cost-effective as
proponents such as Popham (1987) claim.
According to Popham (1987), If properly conceived and implemented,
measurement-driven instruction currently constitutes the most cost-effective way of
improving the quality of public education in the United States (p. 679). A properly
conceived and constructed test should be criterion-referenced, with defensible content
containing a manageable number of targets designed for instructional illumination. He also
provided for ample instructional support so that educators can make the best use of the
tests, which he sees as vehicles of instructional clarification (p. 681). It
is the task of the persons who design the tests to ensure that the content will drive both
instruction and the curriculum. These things can all be accomplished without other
elements of reform, such as better curriculum materials, or better paid and trained staff:
if we were able to replace mediocre instructional materials with more potent, empirically
proven alternatives, then pupils would surely benefit. Similarly, if we were to infuse
into our current teaching force a host of well-paid, highly skilled teachers, we could
surely expect major educational dividends. But such strategies, though they are surely
effective, are very costly (p. 679).
Can a test be both properly conceived and constructed and still be less costly
than other education reforms? Anderson (1977) brought up the subject of hidden costs of
high stakes testing in his background paper prepared for the Minimal Competency Workshops
sponsored by the Education Commission of the States and the National Institute of
Education. Many of the hidden costs that Anderson outlined have yet to be taken into
account when evaluating the testing programs of today. These include such things as test
development, test administration, development and maintenance of regulatory mechanisms
(bureaucracies), and compensatory programs to bring students who do not pass the minimum
competencies up to the current standards. Anderson saw this last area, remediation, as
containing the greatest hidden cost, and the evidence supports his prediction.
Potter and Wall (1992), in conducting a study of South Carolinas minimum competency
testing program, found modest gains in student performance on minimum competency tests.
Besides evidence mentioned before, that these gains may be attributable to factors other
than actual learning (such as keeping children back a grade in order to avoid testing them
for another year), the remediation efforts to raise the test scores had cost the state
over $500 million. After spending that much money, the state still had no real evidence
that the remediation had done much good.
Foshee, Davis, and Stone (1991) found that many students who in the past had failed a
minimum competency test had subsequently passed without remediation. The study calls into
question the cost-effectiveness of providing remediation to students who do not do well on
a minimal competency exam.
Singer and Balow (1987), in evaluating Californias proficiency law since 1980, found
that students did better in regular English class than in remedial classes when retaking a
minimum competency test. They were concerned by a shift in resources directed at students
who were not doing well on the test, reducing the availability of advanced courses.
The evidence suggests that remediation programs not only are expensive, but often
ineffective as well. The actual costs of state-administered high stakes tests are not
known, especially as compared to the costs of other educational reforms.
Who Benefits the Most from High Stakes Testing?
Anderson (1977) predicted that remediation programs
would be too expensive, and suggested rewarding schools that perform well on the high
stakes tests. Compensatory education programs, he believed, were emerging in the wrong
direction, providing substantial incentive for schools to do poorly in order to get
increased funds for remediation. Rewarding schools for good performance, however, has the
added problem of diverting funds to schools that are already financially better off and
doing well rather than giving the funds to schools that have greater needs. Anderson
(1977) recommended that states work to reduce the financial inequities that exist between
school districts as part of the overall effort to impose high stakes testing.
Socioeconomic status can have an impact on who benefits from high stakes testing. Tuma and
Gifford (1995) found that higher graduation standards (not necessarily accompanied by high
stakes testing) had affected the number of academic courses that college-bound students
were taking, but had not affected students who were not college-bound. Potter and Wall
(1992) found that overage students who had been withheld a grade prior to years where
testing was to occur were more likely to be male and nonwhite. Reardon (1996) found that
high stakes tests caused increased dropout rates only in schools with lower socioeconomic
status. Herman and Golan (undated) found that:
Correlations show that socioeconomic status is significantly and negatively related to the
following: school attention to test scores, teachers attention to testing and
planning their instruction, and overall time devoted to test preparation activities . . .
testing is more influential and exerts stronger effects on teaching in schools serving
more disadvantaged students. (p. 57-58)
This study was not able to determine whether the increased effects on schools with lower
SES resulted in better performance on the tests, so it is difficult to interpret these
consequences as positive or negative.
Not all studies found a relationship between SES and performance on high stakes tests.
Corbett and Wilson (1991) found that SES played a surprisingly weak role in
explaining differences between districts.
It is very difficult to determine from the available evidence whether high stakes testing
has different effects on students with low or with high SES. The evidence does suggest,
however, that we need to take a closer look at this question to make sure that we are not
using high stakes testing as a means to further inequities that already exist in our
educational system.
What About Portfolio and Authentic Assessments?
Portfolio and authentic assessments have been gaining
popularity (Bond et al., 1996). Portfolio assessments are in-depth looks into a
students learning histories. They might include all of the assessments that the
students take, as well as examples from their classroom work and other evidence of
learning. Authentic assessments are designed to gain an in depth look at the
students performance level using tasks that are instructionally relevant to the
child, and based on tasks that would normally be expected as part of a curriculum. Only
recently have these types of assessments been used by states for high stakes purposes.
These tests have many advantages over the usual multiple choice exam. They give more
information about students, and are potentially more useful to teachers, and they measure
higher order skills that are more difficult to assess with traditional paper and pencil
tests. Portfolio and authentic assessments have several imposing disadvantages, however.
In particular, they take more time to develop and implement. In addition, someone has to
judge the students responses and determine whether they meet the educational
standards. The reliability of such judgments on a large-scale assessment program has yet
to be established (Shepard, 1992).
Vermont was one of the first states to use portfolio assessments on a large-scale basis.
According to Koretz, McCaffrey, Klein, Bell, and Stecher (1993), the intent of the
assessment was to encourage high standards and good educational practices while
maintaining local autonomy. Koretz et al. (1993) evaluated the 1992 Vermont Portfolio
Assessment program and found disappointing reliability coefficients ranging from .33 to
.43 (see Table 3).
| Table 3. Reliability Coefficients for the 1992 Vermont Portfolio Assessment | ||
| Grade 4 | Grade 8 | |
| Mathematics Best Pieces | .33 | .33 |
| Writing Best Piece | .35 | .42 |
| Writing Remainder | .34 | .43 |
From Koretz et al. (1993)
According to Koretz et al. (1993):
The Vermont portfolio program faces substantial hurdles because of the unreliability of
scoring documented here. Rater reliability is low enough to undermine the utility of 1992
scores for comparing groups of students (schools, districts, or other groups). Even when
scores are aggregated enough to produce estimates with small measurement and sampling
errorfor example, statewide reporting of average scoreslow reliability
threatens the usefulness for gauging trends in performance over time, because it remains
uncertain how an increase in the reliability of ratings will affect the distribution of
scores even if true performance remains constant. (p. 18)
Teachers, however, reported that they liked the portfolio assessments and thought that
they were a valuable tool in gauging student progress, and many schools had expanded the
portfolio program beyond the grades required by the state (Koretz et al. 1993, p. 1-2).
An even more ambitious testing effort to assess student learning for purposes of school
accountability took place in England. Torrance (1993) looked at the effects of an
authentic assessment program in Great Britain. The assessments used in the U.K. are
designed to accompany a National Curriculum. The goal is to assess all children in the
National Curriculum subjects at the ages of 7, 11, 14 and 16 through teacher assessment of
course work (TA) and standard assessment tasks (SATs). Results are assembled
into individual scores with 10 levels of achievement, with expected progress at one level
per two years of schooling. Scores are reported to parents by subject and to the public by
school. At the time of this study, implementation of this plan had begun only with the
younger children (seven-year-olds).
The aspiration was to produce tasks that would contribute to a system of assessment that
was both formative and evaluativeaiding the learning of pupils by providing detailed
information to teachers, and . . . providing directly comparable data on pupil and school
achievement. At the same time, the tasks were meant to be as valid and user
friendly as possible . . . while also guiding the implementation of the National
Curriculum through exemplifying what it should look like in action (examples of tasks
include rolling toy cars down slopes of different gradients to investigate how far they
travel and why; using dice in the context of a simple game to test computational skills;
drawing and labeling a poster to illustrate how and why things grow). Thus, the 1990 pilot
. . . was launched with the intention of piloting tasks that mirrored good primary
education practice while at the same time yielding formatively useful information and
comparable summative results. (p. 83)
The evaluation of this pilot project attempted to get responses from teachers in an
informal, open-ended survey. Very few of the participating teachers responded. Of those
who did respond, most said that they were so exhausted and disillusioned with the
whole business that they very nearly did not get in touchthey could hardly face
spending yet more time on an experience they would rather forget (p. 84).
The major complaint that teachers made in their responses was about their increased
workload. According to Torrance, the most commonly reported figure was two to three
hours of extra work every evening for marking, record keeping, gathering resources, and
planning the next days work, plus six hours each weekend. All this was in addition
to a full days work in school beginning well before and ending well after the
childrens school day (p. 85). This drain on teacher time had indirect effects
on the rest of the school. Teachers of the seven-year-olds did not have time for
assemblies, playground duties and other activities, leaving the burden of these activities
to the other teachers in the school. Teachers reported that relationships with parents and
students also were affected since teachers no longer had the time to welcome parents into
the classroom, or offer extra help to students outside of class time. Other classroom
duties were neglected as well, such as changing reading books and maintaining regular
classroom routines. While students were engaged in the individual and small group tasks,
the students who were not being assessed showed signs of boredom and stress from being
ignored and subjected to busy work. Torrance asserted, It is clear,
then, that teachers did not plan their curriculum, initiate a range of activities, and go
on to assess in an opportunistic and naturalistic fashion, as some of the
claims for classroom authenticity might lead us to believe. Rather, teachers treated
assessment as a special activity, set apart from teaching, and they felt obliged to do
this by the instructions they received (p. 85).
Two major complaints about the implementation of the assessment program emerged from the
survey. One was that the assessments contained so much new material that teachers felt
deskilled and overly dependent. The second was that the materials were too
specific to be used in a naturalistic way, yet the specificity was deemed necessary in
order to use the data for comparative purposes. Teachers reported that the materials were
not flexible enough to expand upon when students showed genuine interest, and many
students were anxious about the results rather than interested in the learning.
The assessment was trimmed back in 1991 and 1992, establishing what the author
saw as a movement back toward paper-and-pencil tests. It is possible that we will see the
same trend in the United States if testing programs have difficulty establishing reliable
assessments.
One can easily criticize Torrancecertainly the self-selected nature of the survey
could have been very skewed; however, on April 17, 1996, Education Week reported that the
national assessments resulted in a major labor dispute in which teachers refused to
administer the tests, which were seen as cumbersome and unwieldy (Bradley, 1996).
What are the Costs Versus the Benefits of High-Stakes Testing for Students with
Disabilities?
Researchers need to look not only at the inequalities
that exist for students of different SES, but they also need to look at those that exist
for students with and without disabilities. As mentioned before, students with
disabilities stand a great chance of failing minimum competency tests, even when they are
given accommodations such as increased time to take the tests, interpreters, or large
print versions (McKinney, 1983; Safer, 1980). Bergquist et al. (undated) mentioned the
need to provide increased supports for students with disabilities not only to do well on
the tests, but also to maintain an appropriate program incorporating their need for
vocational as well as academic subjects.
Directions
Calculating the true costs of implementing statewide
high stakes tests can be a daunting task. It would be useful for evaluators and
researchers to have a framework to estimate the costs of testing programs, and to then
balance these against the expected and realized benefits of the programs.
We also need to think more about inclusion of students with disabilities when developing
any testing program. Portfolio assessments may make it easier for students with
disabilities to participate in statewide testing programs, but it is still not clear
whether this type of assessment can have the reliability needed to be used for statewide
comparisons, graduation requirements, or other high stakes purposes, and the costs of such
programs may be prohibitive.
Recommendations and Conclusions
Special educators need to become more involved in development, implementation and
evaluation of high stakes testing programs. The inclusion or exclusion of students with
disabilities is a problem salient to both regular and special education researchers. The
following recommendations are just some of the many research and evaluation questions that
should be taken into account in future investigations:
Focus on the effects of high stakes testing on students rather than on schools and
systems.
Assess the effects of high stakes testing on the curriculum for both special and
regular education.
Assess the effects of high stakes testing on the dropout rates for both special and
regular education.
Study the effects of high stakes testing programs on students who are excluded from
testing.
Develop assessments that are more inclusive of students with disabilities (for
example, including students with disabilities in state norming samples, and norming tests
with some students using common accommodations such as extended time and Braille).
Study the effects of high stakes testing on the relationship between regular and
special education.
Develop a framework for evaluating the costs versus the benefits of high stakes
testing programs, particularly for alternative and authentic assessments.
The research on high stakes testing is inconclusive and results vary with the type of
research questions asked, and the types of tests examined. The evidence suggests that
teachers change the curriculum based on the tests, concentrating time and effort teaching
to test contents and format. For students with disabilities, this may mean that less time
is devoted to their vocational and other nonacademic needs, even though they are less
likely to pass the test, even when given accommodations. The effects on student learning
are largely unknown, but the evidence does suggest that increasing test scores themselves
do not serve as evidence that students are learning more. Test scores can become inflated
through teaching to the test, excluding or excusing students who may not perform well, and
through increased dropout rates. High stakes testing seems to have a negative effect on
the attitudes and workloads of teachers, but little is known about the effects on students
themselves. States still do not take into account the full costs of high stakes testing
programs, and claims that testing alone can cause major educational improvements have not
been proven. Authentic and portfolio assessments hold promise, but it has not been
established that these can be done well, efficiently and with sufficient reliability to be
used as a large-scale comparative assessment method or for high stakes purposes like
graduation.
The effects of high stakes testing on students with disabilities are harder to determine.
Students with disabilities have not performed well on minimum competency tests, and it is
unclear what effects other types of tests have had on their educational outcomes, choices
and futures. The needs of students with disabilities have been largely overlooked in the
development and evaluation of statewide test-based reforms.
We do not advocate, as Corbett and Wilson (1991) suggest, that we abandon the practice of
testing for educational outcomes altogether. It is our belief that schools need to be
responsible to parents, taxpayers, and the community at large, for student outcomes. The
evidence we have reviewed in this paper supports this position by demonstrating that
exclusion of some students leads to abuses and statistical problems that make
interpretation of test scores dubious at best. The evidence also tells us that testing
alone is not a sufficient mechanism for effective school reform. The most properly
conceived and implemented tests can only be effective if they support clear standards,
solid curricula, committed and well-trained teachers, in schools with the resources and
supports necessary to provide a world-class education to all students. Without these
things, scores may continue to rise, but there is no reason to believe that our children
will be any better educated or prepared for life than they were before.
References
Allington, R., & McGill-Franzen, A. (1992a). Improving school effectiveness or
inflating high-stakes test performance? ERS Spectrum, 10(2), 3-12.
Allington, R., & McGill-Franzen, A. (1992b). Unintended effects of reform in New York.
Educational Policy, 6 (4), 397-414.
Anderson, B.D. (1977, September). The costs of legislated minimal competency requirements.
A background paper prepared for the Minimal Competency Workshops sponsored by the
Education Commission of the States and the National Institute of Education. (ERIC Document
Reproduction Services No. ED 157 947)
Berger, N. & Elson, H.H. (1996, April). What happens when MCTs are used as an
accountability device: Effects on teacher autonomy, cooperation and school mission. Paper
presented at the annual meeting of the American Educational Research Association, New
York.
Bergquist, C.C., Elzie, B., & Groves, L. (Undated). Evaluation of the impact and
effectiveness of recent changes in Floridas graduation and competency test standards
on the educational opportunities provided handicapped students. Paper presented at the
First Annual Regional Conference of the Southeast Evaluation Association, Tallahassee, FL.
Bond, L.A., Roeber, E., & Braskamp, D. (1996). Trends in statewide student assessment.
Washington, DC: Council of Chief State School Officers and NCREL.
Bradley, A. (1996, April 17). English reforms may offer model, report says. Education
Week, p.5.
Catterall, J.S. (1987). Standards and school dropouts: A national study of the minimum
competency test (CSE Technical Report 278). Los Angeles, CA: National Center for Research
on Evaluation, Standards, and Student Testing (CRESST).
Chin-Chance, S.A., Gronna, S.S., & Jenkins, A.A. (1996, March). Assessing special
education students in a norm referenced statewide testing program: Hawaii State Department
of Education. Paper presented at the meeting of the State Collaborative on Assessment and
Student Standards (SCASS) Assessing Special Education Students, Washington, DC, sponsored
by the Council of Chief State School Officers (CCSSO) March, 1996.
Corbett, H.D., & Wilson, B. (1990). Unintended and unwelcome: The local impact of
state testing. Paper presented at the Annual Meeting of the American Educational Research
Association, Boston.
Corbett, H.D., & Wilson B.L. (1991) Testing, reform, and rebellion. Norwood, NJ:
Ablex.
Foshee, D.P., Davis, M.A., & Stone, M.A. (1991). Evaluating the impact of
criterion-referenced measurement on remediation decisions. Remedial and Special Education
(RASE), 12(2) 48-52.
Griffin, B.W., & Heidorn. (1996). An examination of the relationship between minimum
competency test performance and dropping out of high school. Educational Evaluation and
Policy Analysis, 18(3), 243-252.
Grossman, P.L., Kirst, M.W., & Schmidt-Posner, J. (1986). On the trail of the
omnibeast: Evaluating omnibus education reforms in the 1980s. Educational Evaluation
and Policy Analysis, 8(3), 233-266.
Hendrie, C. (1996, October 9). 109 Chicago schools put on academic probation. Education
Week.
Herman, J.L., & Golan, S. (Undated). Effects of standardized testing on teachers and
learninganother look (CSE Technical Report 334). Los Angeles, CA: National Center
for Research on Evaluation, Standards, and Student Testing (CRESST), University of
California.
Hodgkinson, H.L. (1992, June). A demographic look at tomorrow. Washington, DC: Institute
for Educational Leadership, Center for Demographic Policy.
Hodgkinson, H.L. & Outtz, J.H. (1992, December). The nation and the states: A profile
and data book of Americas diversity. Washington, DC: Institute for Educational
Leadership, Center for Demographic Policy.
Koretz, D.M., Linn, R.L., Dunbar, S.B., & Shepard, L.A. (1991). The effects of
high-stakes testing on achievement: Preliminary findings about generalization across
tests. Presented in R. L. Linn (Chair), Effects of High-Stakes Educational Testing on
Instruction and Achievement symposium presented at the annual meeting of the American
Educational Research Association and the National Council on Measurement in Education,
Chicago.
Koretz, D., McCaffrey, D., Klein, S., Bell, R., & Stecher, B. (1993). The reliability
of scores from the 1992 Vermont portfolio assessment program (CSE Technical Report 355).
Los Angeles, CA: National Center for Research on Evaluation, Standards, and Student
Testing (CRESST), University of California.
Madaus, G. (1988). The influence of testing on the curriculum. In L. Tanner (ed.),
Critical issues in curriculum: 87th Yearbook of the NSSE Part 1. Chicago, IL: University
of Chicago Press (ERIC Document Reproduction Service No. 263 183).
Mangino, E., Battaile, R., & Washington, W. (1986). Minimum competency for graduation
(Publication Number 84.59). Austin, Texas: Austin Independent School District. (ERIC
Document Reproduction Service No. 263 183)
Manzo, K.K. (1996, October 30). Phila. plan links student achievement, teacher pay.
Education Week.
McKinney, J.D. (1983). Performance of handicapped students on the North Carolina minimum
competency test. Exceptional Children,49, 547-550.
Morris, D.R. (1991, April). Structural patterns and change in grade retention rates: An
aggregate analysis of data from a large urban school district, 1982-1989. Paper presented
at the American Educational Research Association Conference, Chicago.
Mullis, I.V.S., Dossey, J.A., Campbell, J.R., Gentile, C.A., OSullivan, C., &
Latham, A. (1994). NAEP 1992 trends in academic progress. Washington, DC: National Center
for Education Statistics (NCES).
Office of Technology Assessment (OTA). (1992). Testing in American schools: Asking the
right questions (summary report). Washington, DC.
Olson, J.F. & Goldstein, A.A. (1996). Increasing the inclusion of students with
disabilities and limited English proficient students in NAEP. Focus on NAEP
prepublication copy. Washington, DC: National Center for Education Statistics (NCES).
Popham, W.J. (1987). The merits of measurement-driven instruction. Phi Delta Kappan, 68
(9), 679-682.
Potter, D.C., & Wall, M.E. (1992, April). Higher standards for grade promotion and
graduation: Unintended effects of reform. Paper presented at the annual meeting of the
American Educational Research Association, San Francisco.
Reardon, S. (1996, April). Eighth grade minimum competency testing and early high school
dropout patterns. Paper presented at the annual meeting of the American Educational
Research Association, New York.
Rodgers, N., Paredes, V., & Mangino, E. (1991, April). High stakes minimum skills
tests: Is their use increasing achievement? (ORE Publication Number 90.25). Paper
presented at the annual meeting of the American Education Research Association, Chicago.
(ERIC Document Reproduction Service No. ED 336 422).
Rottenberg, C., & Smith, M.L. (1990, April). Unintended effects of external testing in
elementary schools. Paper presented at the annual meeting of the American Educational
Research Association, Boston.
Safer, N.D. (1980). Implications of minimum competency standards and testing for
handicapped students. Exceptional Children, 46, 288-290.
Salganik, L.H. (1985, May). Why testing reforms are so popular and how they are changing
education. Phi Delta Kappan, 66(9), 607-610.
Shepard, L.A. (1990). Inflated test score gains: Is it old norms or teaching
the test? (CSE Technical Report 307). Los Angeles, CA: University of California, Center
for Research on Evaluation, Standards, and Student Testing.
Shepard, L.A. (1992). Will national tests improve student learning? (CSE Technical Report
342). Los Angeles: University of California, Center for Research on Evaluation, Standards,
and Student Testing.
Shepard, L.A., & Dougherty, K. (1991, April). Effects of high-stakes testing on
instruction. Paper presented at the annual meeting of the American Education Research
Association and the National Council on Measurement in Education, Chicago.
Sinclair, M., Christenson, S., Thurlow, M. & Evelo, D. (1994). Are we pushing students
in special education to drop out of school? Research Policy Brief, 6 (1). Minneapolis, MN:
University of Minnesota, Research and Training Center on Residential Services and
Community Living.
Singer, H., & Balow, I.H. (1987). Proficiency assessment and its consequences.
Riverside, CA: University of California. (ERIC Document Reproduction Service No. ED 290
127)
Thurlow, M.L., Ysseldyke, J.E., & Anderson, C.L. (1995, May). High school graduation
requirements: Whats happening for students with disabilities? (Synthesis Report 20).
Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
Torrance, H. (1993). Combining measurement-driven instruction with authentic assessment:
Some initial observations of national assessment in England and Wales. Educational
Evaluation and Policy Analysis, 15(1), 81-90.
Tuma, J., & Gifford, A. (1995). Higher graduation standards and their effect on the
course-taking patterns of college- and non-college-bound high school graduates, 1969-1987.
Berkeley, CA: MPR Associates (ERIC Document Reproduction Service No. ED 318 767).
Vitello, Camilli, & Molenaar. (1987). Performance of special education students on a
minimum competency test. Diagnostique, 13 (1), 28-35.
Walstad, W.B. (1984, May/June). Analyzing minimal competency test performance. Journal of
Educational Research, 77(51) 261-266.
Zlatos, B. (1994). Dont test, dont tell: Is academic red-shirting
skewing the way we rank our schools? The American School Board Journal, 181(11) 24-28.
Appendix
Summary of Literature Reviewed for This Report
Effects on Curriculum and Instruction
| Study | Type of Test | Method | Subjects | Results |
| Berger &
Elson(1996). What happens when MCTs are used as an accountability device: Effects on teaching autonomy, cooperation and school mission (also in Attitudes and School Climate) |
Graduation exit exam | Survey. Compared responses of teachers in high stakes programs (graduation exit exam) to teachers in low stakes programs. | National Schools and Staffing Survey (SASS), Dept. of Ed. 1987. Representative national sample of teachers from low and high-stakes states. | Loss of teacher autonomy; increased clarity of mission; no effect on teacher cooperation; increased emphasis on basics. |
| Bergquist, Elzie
& Groves (undated). Evaluation of the impact and effectiveness of recent changes in Florida's graduation and competency test standards on the educational opportunities provided handicapped students. |
Increased standards and graduation requirements. Did not report on the effects of the graduation exit exam. | Program evaluation. Student records, surveys, and qualitative observations. | 93 onsite visits statewide. 300 interviews from all levels of system. Reviewed more than 100 student records. Surveyed 500 knowledgeable people. Onsite visits to nine representative districts. | Students with disabilities had difficulty meeting the higher standards and incorporating nonacademic subjects into the curriculum. |
| Grossman, Kirst,
& Schmidt-Posner (1986). On the trail of the omnibeast: Evaluating omnibus education reforms in the 1980s. |
California education reforms including increased graduation requirements, increased college entrance requirements. | Examined course offerings from 1982 to 1985. | Sampled course offerings in 200 school districts statewide. | Increased offerings in academic areas, especially math, science, and advanced placement; decreased offerings in industrial arts, home economics and business ed. |
| Herman & Golan
(undated). Effects of standardized testing on teachers and learning - another look. |
Standardized testing in nine states. | Survey | Teachers chosen in matched pairs from demographically similar schools that had shown substantial improvement on standardized tests and those which had not. Total of 24 pairs. | Substantial time and effort devoted to testing; pressure from schools to improve scores; modification of curriculum; greater impact in lowest SES schools. Gains not clear as to real improvement or teaching to the test. |
| Rottenberg &
Smith (1990). Unintended effects of external testing in elementary schools (also in Attitudes and School Climate). |
ITBS used for evaluation of principals and making curriculum decisions. Media coverage of test scores. | Interviews with teachers. Dual case study design. | 19 teachers in two school districts. Observation of four local teachers. | Negative effects on the attitude of both teachers and students. Reduced time for ordinary instruction. |
| Shepard (1990).
"Inflated test score gains." Is it old norms or teaching to the test? |
All | Survey of state directors of testing regarding narrowing of the curriculum and teaching to the test. | State testing directors in the 46 states that had testing programs. | Fournd instances of teaching to the test. Could not quantify how teaching to the test had affected test scores. Recommended new tests every year. |
| Shepard &
Dougherty (1991). Effects of high stakes testing on instruction. |
Standardized test (two states, unspecified). Third, fifth, and sixth grade testing programs. | Survey | Teachers in districts with high stakes testing. 360 teachers responded (42% return rate). | Greater emphasis on basic skills; neglect of nontested material; inordinate time on test preparation through-out year; clearer instruction goals. |
Effects on Student Learning
| Study | Type of Test | Method | Subjects | Results |
| Allington & McGill-Franzen
(1992a). Improving school effectiveness or inflating high-stakes test performances? |
Third grade testing in reading and mathematics. | Examination of student records using students' scores by kindergarten cohort rather than grade. | Seven schools (low, moderate and high poverty levels). 11 - 70 subjects in each school. Total subjects not reported. | Data suggests that reported test score gains result from retention and referral to special education. |
| Allington & McGill-Franzen
(1992b). Unintended effects of educational reform in New York. |
Third grade testing in reading and mathematics. | Analyses of trends in special education referral and grade retention. | Seven school (low, moderate and high poverty levels). 11 - 70 subjects in each school. Total subjects not reported. | Data suggest increasing trends in special education referral and grade retention during increased stakes. |
| Catterall (1987). Standards and school dropouts: A national study of the minimum competency test. |
Graduation exit exam. | Interviews with educators and students in four states with graduation exit exams. | 736 students. Six state test directors. 13 district test coordinators. 18 school principals. 21 school counselors. | Professionals did not believe that the exit tests had affected dropout rates. Students, while supportive of testing in general reported more effects on dropout rates. Found association between competency test failure and reduced beliefs that the student would finish school. |
| Corbett & Wilson (1991). Testing reform and rebellion (also in Attitudes and School Climate). Corbett & Wilson (1990). Unintended and unwelcome: The local impact of state testing. |
Graduation exit exam (Maryland). Exam to identify students in need of remediation (Pennsylvania). | Compared high-stakes state (MD) to low stakes (PA). Qualitative interviews, extended site visits, and surveys. Twelve sites visited (six from each state). | Over 250 educators. | Greater impact in high stakes state; narrowed curriculum; greater emphasis on basic skills; neglect of nontested subjects; increased preparation time for tests; increased clarity of educational goals. |
| Koretz, Linn, Dunbar, &
Shepard (1991). The effects of high-stakes testing on achievement. Preliminary findings about generalization across tests. |
Third grade high-stakes. Specific test and location not revealed. | Compared scores on high-stakes test to other tests of achievement in the same academic areas. | Third grade student in large, high-poverty district. 620 - 753 students. | Little correlation between the high stakes test and other tests of achievement, suggesting that reported improvements in test scores were due to learning specific to test rather than general academic skills. |
| Mangino, Battaile, &
Washington (1986). Minimum competency for graduation. |
Graduation exit exam. | Analyzed data from the Texas graduation exit exam 1984 - 1985. | 184 students in reading and 115 in math (total number of students receiving waivers). | Problem areas found, including taking the test many times (up to 4), waivers, special education exemptions. |
| Morris (1991). Structural pattern and change in grade retention rates: An aggregate analysis of data from a large urban school district, 1982 - 1989. |
Test used initially to identify weak students, later used to identify weak programs (from low to high stakes). | Analyzed data from Florida's minimum competency testing programs. | All students taking the exam. | Increased retention rates due to school restructuring and increased standards. |
| Potter and Wall (1992). Higher standards for grade promotion and graduation: Unintended effects of reform. |
Graduation exit exam, grade promotion. | Longitudinal study of effects of SC graduation, promotion exams, 1985 - 91. | All students taking SC exams. | Students were more likely to be retained (were overage for their grade) as a result of high-stakes testing. |
| Griffin & Heidorn (1996). An examination of the relationship between minimum competency test performance and dropping out of high school. |
Graduation exit exam. | Examined the effects of graduation exit exam on dropout rates in Florida. | Cross-sections, random sample of students in high school from 14 school districts, grades 10, 11, and 12. N=76,664 students in 75 high schools. | Students who did not pass the MCT test were not more likely to drop-out, regardless of SES, other factors. Exception: students with high GPAs who failed the exam were more likely to drop out. |
| Reardon (1996). Eighth grade minimum competency testing and early high school dropout patterns. |
Promotion from eighth grade to ninth grade. | Examined data from National Educational Longitudinal Survey (NELS). | Focused on MCT in eighth grade. Nationwide representative sample. | Increased dropout rates in programs with eighth grade MCT. |
| Walstad (1984). Analyzing minimal competency test performance. |
Test used to gauge district/student performance. | Used an educational production function model based on prior research and two years of MCT data. Surveys of district administrators. | District administrators. Used data from all students taking MCT in state. | Only pretesting (practicing the test) had a significant effect on student performance. Changes in curricula and teacher training had no effect on scores. |
| Chin-Chance, Gronna & Jenkins
(1996). Assessing special education students in a norm referenced statewide testing program. |
SAT8 (Stanford Achievement Test 8). Used for district comparisons. | Developed separate norms, percentile ranks for students with disabilities. | Students with disabilities (all taking exam). | Were able to develop separate percentile ranks. While students without disabilities, difference scores showed students with disabilities improved as much or more than students without disabilities. |
| McKinney (1983). Performance of handicapped students on the North Carolina minimum competency test. |
Graduation exit exam. | Analyzed data on how students with disabilities performed with and without accommodations (not under control of experimenter). | 3,043 students with disabilities. | Students with disabilities performed poorly on the tests, even when given accommodations. Persons with mild MR receive the most benefit from accommodations but were nevertheless the least likely group to pass the test. |
| Safer (1980). Implications of minimum competency standards and testing for handicapped students. |
Graduation exit exam. | Examined scores of students with disabilities taking graduation exit exam. | All students with disabilities taking the exam in 1977. | Students with disabilities were not likely to pass the exam. |
| Vitello, Camilli & Molenaar
(1987). Performance of special education students on a minimal competency test. |
Graduation exit exam. | Analyzed scores of students with disabilities taking the NJ proficiency exam. | All 4,299 students with disabilities who took exam 1986 - 87 (40% of the ninth grade handicapped population). | Students with disabilities were not likely to pass. |
Effects on Attitudes and School Climate
| Study | Type of Test | Method | Subjects | Results |