Universal Design Online Manual
Christopher Johnstone • Jason Altman
•
Martha Thurlow
• Michael Moore
September 2006
All rights reserved. Any or all
portions of this document may be reproduced and distributed
without prior permission, provided the source is cited as:
Johnstone, C., Altman, J.,
Thurlow, M., & Moore, M. (2006). Universal design online
manual. Minneapolis, MN: University of Minnesota, National Center
on Educational Outcomes.
Introduction
The No Child Left Behind Act of 2001 and other recent changes
in federal legislation have placed greater emphasis on
accountability in large-scale testing. Previously exempt
students, many with disabilities, now must be included,
monitored, and reported by all states. Because large-scale
assessments have such high stakes, it is important to ensure
that assessments are an accurate measure of the knowledge and
skills of ALL students. To ensure that tests are designed from
the beginning with accessibility in mind,
Thompson, Johnstone,
and Thurlow (2002) developed seven Elements of
universally designed assessments, based on research from
a variety of fields.
By the end of this manual you will better understand how the
following design considerations improve testing for all
students:
-
Providing inclusive
assessment populations
-
Measuring what they are
intended to measure
-
Reducing bias to a
minimum
-
Having clear and
understandable instructions and procedures
-
Ensuring amenability to
accommodations
-
Having comprehensible
language
-
Being legible.
This tool
outlines steps that states can take to ensure universal design
of assessments. The recommendations can be used for both
computer and paper-based assessments. The National Center on
Educational Outcomes (NCEO) recommends that states follow the
steps provided in chronological order. Including any step in the
design and review of tests may improve the design features of a
state assessment.This online document is accompanied
by a more detailed "How-To" manual. See
A State Guide to the Development of Universally Designed
Assessments.
Overall Universal Design Principles
-
Universally designed assessments DO NOT change the standard
of performance – they are not watered down or made easier
for some groups.
-
Universally designed assessments are not meant to replace
accommodations. Even by incorporating the elements of
universal design in assessment design, accommodations may
still be needed for some students in the areas of
presentation, response, setting, timing, and scheduling.
-
There
is a reason we are calling the ideas we will be working with
“considerations.” They are steps that should be considered
when developing an assessment. They should be talked about
openly and decisions should be made, having weighed the pros
and cons of different design elements.
-
In addition to English
language learners, ALL students benefit from having more
accessible tests.
Step 1: Ensure the Presence of Universal Design in RFPs
All students must have the opportunity to demonstrate their
achievement of the content standards. Therefore, to satisfy
state and federal requirements for universal design, vendors
should design state tests that allow the maximum number of
students possible (and students with diverse characteristics) to
take the same assessments without threat to the validity and
comparability of the scores.
To this end, vendors must demonstrate how they will develop
"universally designed assessments." Such assessments are
designed from the beginning to allow participation of the widest
range of students and result in valid inferences about the
performance of all students, including students with
disabilities, students with limited English proficiency, and
students with other special needs.
Step 2: Review Teams
Once the assessment is designed and in a format suitable for
previewing, it is important for states to let sensitivity review
teams examine the assessment (in the format in which students
will see the test). Reviews by these teams are common practice
in states, and are often encouraged by test vendors. When
creating bias and content review teams, it is important to
involve members of major language groups, disability groups, and
support groups. Grade level experts, representatives of major
cultural and disability groups, researchers, and teaching
professionals all make up an effective review team.
Reviewers will need the following information to perform a
careful and comprehensive review:
- Purpose of the test, and content standard tested by each
item
- Description of test takers (e.g., age, geographic
region)
- Field test results by item and subgroup
- Test instructions
- Overall test and response formats
- Use of technology
- State accommodation policies
Bias and design issues may arise in test development and are
not problematic if caught by review teams. Sensitivity reviewers
are charged to flag items that may cause “problems for certain
subgroups, where the “problems” are due to their subgroup status
rather than their knowledge of the content. An efficient way to
“flag” items is to use a review sheet, which provides reviewers
an opportunity to mark potential issues with items, thus
providing opportunities for further discussion among reviewers.
By using a structured form, reviewers are more likely to provide
specific feedback to test vendors. Such feedback allows for
items to be re-examined for design issues, rather than (as is
often the case) summarily rejected for unclear reasons. When
using structured forms, item reviewers then create a “win-win”
situation for advocates and vendors. In other words, they are
able to give test vendors specific information about what may be
an issue in items. Vendors can then determine whether changes
can be made to items without having to remove items from item
banks entirely. Item-specific
review
and whole test review
forms can be used for item reviews.
Step 3: Using Think Aloud Methods to Analyze Flagged and Unflagged Items
In an effort to validate the findings of experts, a series of
items can be examined by students themselves using cognitive
lab, or think-aloud methods.
Think aloud methods were first used in the 1940s and have
since been used for a variety of “end user” studies in the
fields of ergonomics, psychology, and technology. In the case of
statewide assessments, the end users are students who will take
tests. Think aloud methods tap into the short-term memory of
students who complete assessment items while they verbalize. The
utterances produced by students are the data that researchers or
states can use to better understand items.
The verbalizations produced in think aloud studies provide
excellent information because they are not yet in the long-term
memory. Once experiences enter our long-term memory, they may be
tainted by personal interpretations. Therefore, an excellent way
of determining whether design issues really do exist for
students, is to have students try out items themselves in “live
time.”
NCEO typically videotapes all think aloud activities, but
states can also either audiotape or have several observers
review field notes. Inter-rater agreement is important for
making decisions based on think aloud activities, so some
strategy for confirming what is viewed or heard during think
aloud activities should be undertaken. In addition, it is useful
to include students who achieve at a variety of levels on
statewide achievement tests. To this end, a sample population
might include students without disabilities and majority culture
children as well as students with disabilities, English language
learners, and students from low socioeconomic status. A NCEO
recent research report on think aloud methods can be accessed at
http://education.umn.edu/nceo/OnlinePubs/Tech44/
An
example of a process for selecting and conducting think aloud
studies is described in Vignette #1.
Vignette #1
State X has recently conducted an expert review on its fourth grade mathematics test. Reviewers found that most items only had minor formatting issues that they would like to see improved, but that three of the items had major issues pertaining to bias, presentation, and comprehensible language. State X’s assessment director was concerned that these items might cause students with a variety of descriptors to incorrectly answer these items because of design issues, thus reducing the validity of inferences that could be drawn from the test. State X then decided to conduct a think aloud study on the three items in question, as well as three items that generally met the approval of item reviewers.
Overview of Activities
State X opted to conduct the think aloud study with its own staff (alternatively, they may have decided to offer a subcontract to a local university or research organization to conduct the study). The study took place in a quiet room, where State X staff members could videotape the procedures.
Sample
Because State X’s assessment director was concerned about the effects of bias, presentation, and language on students with particular disabilities and English language learners, she targeted these students, as well as students who were deemed “typically achieving, non-disabled, English proficient” students. In total, 50 Grade 4 students were contacted. Among these were: 10 students with learning disabilities, 10 students with mild mental retardation (who took the general education assessment), 10 students who were deaf, 10 students who were English language learners (but did not have a disability), and 10 non-disabled, English proficient students.
Procedures
Each student was then individually brought to the quiet room. First, State X staff members explained the process. Then, students practiced “thinking aloud” by describing everything they do when they tie their shoes (sign language interpreters were present for students who are deaf). Once students understood the process, they were asked to think aloud while they answered mathematics items. The only time State X staff spoke was when students were silent for more than 10 seconds, at which time staff encouraged students to “keep talking.” Each item took approximately 10 minutes per student.
After students completed items, State X staff asked post-hoc questions, simply to clarify any issues they did not understand. Data derived from post-hoc questions are not as authentic as think aloud data, but they can help to clarify issues that were unclear to staff.
Analysis
Once all think aloud activities were completed, State X staff reviewed all the videotapes they had taken. Using
NCEO’s think aloud coding sheet, staff were easily able to determine if design issues were problematic for particular populations. The data they collected helped them to make recommendations for
Step 4.
Step 4: Revisit Items Based on Information from Steps 2 and 3
Steps 2 and 3 are likely to produce rich data that identify
concerns about particular items or the entire test. Prior to
field testing (Step 5) it is important to analyze the data
produced in Steps 2 and 3 and make any possible changes that can
be made to the test. Some changes may be impossible prior to
field testing, while others (such as formatting changes) may be
quite easy to make. Regardless of whether changes are made to
tests or not, data from Steps 2 and 3 are important sources for
recommendations and cross-analyzing with field test results.
Step 5: Field Test
It is common practice for states to field test potential exam
items well in advance of their actual inclusion in statewide
testing systems. Somewhat less common is taking potential exam
items and transferring them into accommodated formats and then
field testing them for potential differential item functioning.
It is important that test administrators are aware of item
statistics as well as the effects of accommodations on each test
item when making decisions on which items to include on exams.
Step 6: Analyze Field Test Data
A useful method for ensuring Universal Design of assessments
is to conduct large-scale statistical analyses on test item
results. Many methods exist for examining data to detect design
issues related to Universal Design. Approaches range from simple
methods based on classical test theory to more contemporary item
response theory (IRT) techniques with increasing complexity.
Helpful statistical techniques include: Item Ranking, Item Total
Correlation, Differential Item Functioning (DIF) using
Contingency Tables, and DIF using Item Response Theory (IRT)
approaches.
The analyses listed above will almost certainly produce
disparate results because they are examining slightly different
item functions. In fact, between disability groups and analyses,
it is likely that many items on a test will be flagged at least
once. Such a result does not necessarily mean that an entire
test is flawed.
Rather, a reasoned approach to sorting through large amounts
of data is the rule of halves. If an item is flagged in half of
the analysis methods (n ≥ 2 analyses), that item is a candidate
for re-examination.
Furthermore, if data disaggregated by the disability category of
particular students (e.g., students with learning disabilities,
hearing impairments, etc.), and an item has been flagged across
more than half of those categories, it may also be a candidate
for revision. One can conduct similar examinations on
populations who took tests with specific accommodations (if items
are flagged for half of the accommodations tested, they may have
universal design issues).A NCEO report demonstrating
item review methods is available at
http://www.nceo.info/OnlinePubs/Technical41.htm.
Step 7: Final Revision
After experts have
reviewed items, students have explained how they approached
items using think aloud methods, and field test results have
been reviewed, states and contractors can discuss the final
revisions that need to be made to tests. It is possible that no
changes at all will be made. On the other hand, the “final
revision” stage is the last time states and contractors can
address design issues before tests are distributed with “high
stakes” for the test itself. This stage is one that should be
approached with caution, but in a cooperative spirit that makes
sense for all students as well as the needs of the state’s
finances and timelines.
Step 8: Testing
Step 8 is the culmination of months (or
years) of hard work on the part of both the state and the
contractor. During testing periods in states, students take the
assessments designed by contractors under standard and
accommodated conditions. Results are used for accountability
purposes, and are monitored at both the school and district
levels. Designing a test for accessibility is a challenging
process, and culminates when students take the “live” test.
Step 9: Post-test Review
Once tests results are
available, the process starts again. States can examine results
statistically, and begin the expert review and think aloud
processes for the following year’s test. When contractors
develop a test that states deem acceptable for use for more than
one year, the universal design process is streamlined because
many of the potential problems with a test were caught during
the design and field test stage. Universal design processes are
then used as a tool for ongoing item improvement.
It is possible that there
will never be a test that is accessible to all students for all
items. While a perfectly accessible test may not be possible, a
more accessible assessment is possible. Hard work, cooperation,
and following the
steps in this guide can help
the process. In addition, states may develop their own universal
design processes. As universal design research emerges,
processes will become more succinct, efficient, and effective.
States that have made commitments to
accessible assessments for all students will find their efforts
rewarded in better measurement of what their students know and
can do. It is our hope that this manual will help current and
future processes and make the commitment to universal design and
easier one to make.
Top of page
|