Questions to Ask When Evaluating Tests

Note: The Buros Center for Testing plans to update the document below so that it aligns with 2014 Standards for Educational and Psychological Testing.

The Standards for Educational and Psychological Testing (1999) (hereafter called the Standards) established by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education, are intended to provide a comprehensive basis for evaluating tests. These new standards reflect several important changes from earlier versions, especially in the area of validity. This paper summarizes several key standards applicable to most test evaluation situations.

TEST COVERAGE AND USAGE

There must be a clear statement of recommended uses, the theoretical model or rationale for the content, and a description of the population for which the test is intended.

The principal questions to ask when evaluating a test is whether it is appropriate for the intended purposes. The use intended by the test developer must be justified by the publisher on technical or theoretical grounds. Questions to ask:

What are the intended uses of the test scores? What score interpretations does the publisher feel are appropriate? What limitations or restrictions of interpretations apply?
Who is the intended population for testing? What is the basis for considering whether the test applies to a particular situation?
How was content/coverage decided? How were items developed and selected for the final version?

APPROPRIATE SAMPLE FOR TEST VALIDITY EVIDENCE

The sample used to collect evidence of the validity of score interpretation and norming must be of adequate size and must be sufficiently representative to substantiate validity claims, to establish appropriate norms, and to support conclusions regarding the intended use of the scores.

The individuals in the samples used for collecting validity evidence and for norming should represent the population of potential examinees in terms of age, gender, ethnicity, psychopathology, or other dimensions relevant to score interpretation. Questions to ask:

What methods were used to select the samples in test development, validation, and norming? How are the samples related to the intended population? Were participation rates appropriate?
Was the sample size large enough to develop stable estimates of test statistics with minimal fluctuation due to sampling error? Where statements are made concerning subgroups, are there enough examinees in each subgroup?
Is the range of scores on the test and any relevant criterion measure sufficient to provide an adequate basis for making the validity claims and for norming? Is there sufficient variation in the test scores to make intended distinctions among examinees (e.g., those with and without symptoms)?
How representative is the norm group of the intended population in terms of the relevant dimensions (e.g., age, geographical distribution, cultural background, and disabilities)?

VALIDITY

Test scores should yield valid and reliable interpretations. All sources of validity evidence should support the intended interpretations and uses of the test scores.

The current Standards suggest there is only one dimension of validity and that is construct validity. A variety of methods may be used to support validity arguments related to the intended use and interpretation of test scores. Such evidence may be obtained by examining systematically the content of the test, by considering how the test scores relate to other measures, or by considering the relationship between the scores on the test and other variables in terms of the consistency of these relationships with the theoretical predictions on which test development was based. One element of invalidity is the extent that scores reflect systematic, rather than random, error. That is, systematic error contributes to invalid score interpretations, whereas random errors in scores are due to issues associated with unreliability, which also renders scores invalid. Thus, tests may produce scores that are reliable (little or no random error) but are invalid due to some systematic error. Scores that are unreliable may not be validly interpreted.

Content Related Evidence

Evidence that the content of the test is consistent with the intended interpretation of test scores (e.g., a test to be used to assist in job selection should have content that is clearly job-related) may be collected in a variety of ways. One method often used to examine the evidence of score validity for achievement tests is having content experts compare the test content to the test specifications. Content consistency may also be demonstrated by such analyses as a factor analysis (either exploratory or confirmatory, or both). Questions to ask include:

Did the test development process follow a rational approach that ensures that the content matches the test specifications?
Did the test development process ensure that the test items would represent appropriate knowledge, behaviors, and skills?
Is there a clear statement of the universe of knowledge, behaviors, and skills represented by the test? What research was conducted to determine desired test content and/or evaluate content?
What was the composition of expert panels used in evaluating the match between the test content and the test specifications? How were the experts judgments elicited?

Relationship to other tests

Some tests are designed to assist in making decisions about examinees regarding the examineeÕs future performance. Validity evidence for such tests might include the above content related information and in addition statistical information regarding the correlation (or other similar statistical relationship) between the test to be used as a predictor and some relevant criterion variable. Questions to ask include:

What criterion measure has been used to provide validity evidence? What is the rationale for using this criterion measure? Is the psychometric quality of the criterion measure adequate?
Is the distribution of scores on both the criterion measure and the test in question adequate to minimize the statistical problems associated with restriction in the range of scores?
What is the overall predictive accuracy of the test? How accurate are predictions for examinees who score near the cut point(s)?
Has differential prediction (using separate regression equations) for each group, criterion, or treatment) been considered?

Other forms of validity evidence

There are many other forms of evidence that may be provided to assist the test user in making a decision about the appropriateness of the test for the userÕs purposes. A few such forms of evidence are noted and additional questions proposed. A multi-trait, multi-method matrix that examines the relationship between the scores from the test being reviewed and scores from other tests that are both similar in intent and dissimilar in the measurement strategy may provide evidence of the nature of score interpretations. Similarly, a test that is designed to assess a particular construct (e.g., creativity) may be shown to produce scores that correlate with test scores from measures of other related constructs (e.g., artistic ability). Other strategies may include experimental studies that could be used to aid in theoretical confirmation of the use of the scores. For example, a test designed to aid in diagnosing a psychological problem may provide evidence of validity by showing that individuals who are known to have the targeted pathology obtain predictably higher (or lower) scores than individuals who are known to not have the targeted pathology. There are many other methods that may be used to provide validity evidence that are found in the Standards. Questions to ask include:

Is the conceptual framework for each tested construct clear and well founded? What is the basis for concluding that the construct is related to the purposes of the test?
Does the framework provide a basis for testing hypotheses supported by empirical data?
In a multi-trait, multi-method matrix correlational study, are the different tests really different and are the different methods really different?
Are experimental studies well designed and are the conclusions consistent with the findings of the study?

RELIABILITY

The test scores are sufficiently reliable to permit stable estimates of the construct being measured.

Fundamental to the evaluation of any instrument is the degree to which test scores are free from (random) measurement error and are consistent from one occasion to another. Sources of measurement error may derive from three broad categories: error related to factors that are intrinsic to the test; error related to factors that are intrinsic to the examinee, and errors that are extrinsic to both the test and the examinee. Test factors include such things as unclear instructions, ambiguous questions, and insufficient questions to cover the domain of the construct of interest. Factors intrinsic to the individual may include the examineeÕs personal health at the time of testing, fatigue, nervousness, and willingness to be a risk taker. Extrinsic factors may include such things as extraneous noise or other distractions and misentering a response choice onto an answer sheet. These illustrations are not intended to represent an exhaustive list of factors that influence the reliability of examinee scores on a test.

Different types of reliability estimates should be used to estimate the contribution of different sources of measurement error. Tests that produce scores that are obtained by using raters must provide evidence that raters interpret the scoring guide in essentially the same way. Tests that may be administered on multiple occasions to examine change must provide evidence of the stability of scores over time. Tests that have more than one form that can be used to make the same decision, must demonstrate the comparability (content and statistical) across the multiple forms. Almost all measures, except those that are speeded, should provide an estimate of the internal consistency reliability as an estimate of the error associated with content sampling. Questions to ask:

How have reliability estimates been computed? Have appropriate statistical methods been employed (e.g., internal consistency estimates computed for speeded tests will result in artificially high estimates)?
If test-retest reliability has been computed, are time intervals between testing occasions reported? Are these time intervals appropriate given the stability of the trait being tested?
What are the reliabilities of the test for different groups of examinees? How were they computed?
Is the reliability estimate sufficiently high to warrant using the test as a basis for making decisions about individual examinees?
Are reliability estimates provided for all scores for which interpretations are indicated?

TEST ADMINISTRATION

Detailed and clear instructions outline appropriate test administration procedures.

Statements concerning test validity and the accuracy of the norms can only generalize to testing situations that replicate the conditions used to establish validity and obtain normative data. Test administrators need detailed and clear instructions to replicate these conditions.

All test administration specifications, including instructions to test takers, time limits, use of reference materials and calculators, lighting, equipment, seating, monitoring, room requirements, testing sequence, and time of day, should be fully described. Questions to ask:

Will test administrators understand precisely what is expected of them? Do they need to meet any specified qualifications?
Do the test administration procedures replicate the conditions under which the test was validated and normed? Are these procedures standardized?

TEST REPORTING

The methods used to report test results, including scaled scores, subtest results, and combined test results, are described fully along with the rationale for each method.

Test results should be presented in a manner what will help users (e.g., teachers, clinicians, and employers) make decisions that are consistent with appropriate uses of test results. Questions to ask:

How are test results reported? Are the scales used in reporting results conducive to proper test use?
What materials and resources are available to aid in interpreting test results and are these materials clear and complete?

TEST AND ITEM BIAS

The test is not biased or offensive with regard to race, sex, native language, ethnic origin, geographic region, or other factors.

Test developers are expected to exhibit sensitivity to the demographic characteristics of test takers. Steps can be taken during test development, validation, standardization, and documentation to minimize the influence of cultural dependency, using statistics to identify differential item difficulty, and examining the comparative accuracy of predictions for different groups. Some traits are manifest differently by different cultural groups. Explicit statements of when it is inappropriate to use a test for particular groups must be provided. Questions to ask:

Were the items analyzed statistically for possible bias? What method(s) was/were used? How were the items selected for inclusion in the final version of the test?
Was the test analyzed for differential validity across groups? How was this analysis conducted?
Was the test analyzed to determine the English language proficiency required of test takers? Should the test be used with non-native speakers of English?