Using a Mental Measurements Yearbook Review to Evaluate a Test | Buros Center for Testing

Anthony J. Nitko
Professor Emeritus of Psychology in Education, University of Pittsburgh
Adjunct Professor of Educational Psychology, University of Arizona

Introduction

Once you have located a test, you will want to read its Mental Measurements Yearbook (MMY) review. You need to use the review to make judgments about the quality of the test. This short lesson will help you get started in understanding how to use the MMY review to make these judgments. It discusses how the review is organized and how to use each part of the review to judge the usefulness of the test to you. Each part of the MMY review that is discussed in this lesson is illustrated by an extract from an actual review of the SAQ-Adult Probation III assessment. If you want to see the complete sample review, click here.

The Meaning of the Parts of the MMY Review

Most recent MMY reviews are composed of the following parts:

Test Entry
Description
Development
Technical
Commentary
Summary

The different parts of the review give you different information about the test. The Test Entry part is prepared by the staff and editors of the Mental Measurements Yearbook. External reviewers of the test write the other parts. The MMY editors try to have two reviewers for each test. In the following sections we describe each of these parts and how to use information in them to evaluate your test.

Test Entry

This part of the review is first. It summarizes certain facts about the test, taken from the test manual or other materials provided by the publisher:

Title of the test
Purpose
Population for which the test is intended
Publication dates
Acronym used to identify the test
Scores the test provides
Whether the test is an individual or group test
Forms, parts, and levels the test provides
Whether the test has a manual
Whether there is restricted test distribution
Price data
Foreign language and other special editions
Time allowed examinees and total administration time
Comments about the test
Test author(s)
Test publisher
Foreign adaptations of the test (if any)
Sublistings (if any) of parts of the test that are separately available
Cross references to previous MMY reviews

Example. Here is an example of a test entry:

SAQ-Adult Probation III.

Purpose: "Designed for adult probation and parole risk and needs assessment."
Population: Adult probationers and parolees.
Publication Dates: 1985-1997.
Acronym: SAQ.
Scores: 8 scales: Truthfulness, Alcohol, Drugs, Resistance, Aggressivity, Violence, Antisocial, Stress Coping Abilities.
Administration: Group.
Price Data: Available from publisher.
Time: (30) minutes.
Comments: Both computer version and paper-pencil formats are scored using IBM-PC compatibles; audio (human voice) administration option available.
Author: Risk & Needs Assessment, Inc.
Publisher: Risk & Needs Assessment, Inc.
Cross Reference:For a review by Tony Toneatto, see 12:338.

Interpretation.

You can see that this information is important for your practical evaluation of the test. For example, if you want an individually administered test, then this test is not for you because it is a group test. If you want a test that gives scores for several areas, this test may be useful. If you want a test to use with teens, this test is not for you because it is specifically intended for adults.

Summary.

The Test Entry information gives you a quick overview of a test, allowing you to quickly eliminate or retain the test in your list of possibilities based simply on practical considerations. However, information in the MMY review's Test Entry section does not evaluate the content and quality of the test. For those types of evaluations you need to read the other parts of the review. We discuss these parts in the following sections of this lesson.

Description

A reviewer external to the MMY staff rather than the MMY staff itself writes the Description. The test reviewer gives a general description of the instrument, explaining the purposes of the instrument, identifying the target population, and describing the intended uses of the instrument. In addition, the reviewer summarizes information about administering the instrument and the scores and scoring procedures. The reviewer expands on what is listed in the Test Entry to describe what the content of the instrument is like and generally how the publisher intends the instrument to be administered and used. A reviewer may also compare the current edition of the instrument with the content of previous editions.

Example. Here is an example of a reviewer's statement in the Description section of the review for the SAQ-Adult Probation III that we are using as our example for this lesson:

DESCRIPTION.

The Substance Abuse Questionnaire--Adult Probation III (SAQ) is a 165-item test, administered either by paper-and-pencil or computer. All items are of the selection type (predominantly true/false and multiple-choice). Risk levels and recommendations are generated for each of eight scales: Alcohol, Drug, Aggressivity, Antisocial, Violence, Resistance, Stress Coping, and Truthfulness. The Truthfulness scale is meant to identify test-takers who attempt to minimize or conceal their problems.

Nonclinical staff can administer, score, and interpret the SAQ. Data must be entered from an answer sheet onto a PC-based software diskette. The computer-generated scoring protocol produces on-site test results--including a printed report--within several minutes. For each of the eight scales, the report supplies a percentile rank score, a risk categorization, an explanation of the risk level, and (for most scales) a recommendation regarding treatment or supervision. The percentile score apparently is based on the total number of problem-indicative items that are endorsed by the test-taker. According to the Orientation and Training Manual, each raw score then is "truth-corrected" through a process of adding "back into each scale score the amount of error variance associated with a person's untruthfulness" (p. 8). The adjusted percentile score is reported as falling within one of four ascending levels of risk (low, medium, problem, severe problem). The responsible staff person is expected to use information from the report, along with professional judgment, to identify the severity of risk and needs and to develop recommendations for intervention.

Interpretation.

After reading the Description, you have more information about a test. For example, you can see that the SAQ-Adult Probation III items are multiple-choice or true-false. If you were looking for a test that has a constructed response format, this would not be the instrument for you to use. You can see, too, that this instrument can be administered and interpreted by nonclinical staff members. You can see that the scores are interpreted through the use of normative data (percentile ranks) rather than through absolute meaning of the items in the instrument.

Summary. The Description section gives you a brief overview of how the test is intended to be used and the types of scores and reports the publisher provides.

Development

In this part of the MMY review, a reviewer discusses how the instrument was developed, what underlying assumptions or theory guided the publisher's decisions about how to define the construct the instrument is supposed to measure, and details on item development. Results of pilot testing the instrument would be discussed in this section. In addition, a reviewer might comment on any steps that were undertaken in the selection of the final set of items for the test and give evaluations of the appropriateness of these items for measuring the construct(s) of interest.

A test developer has the obligation to provide evidence that the test was developed in a technically sound way. This means that the items should be selected using both sound theoretical reasoning and sound empirical information about how well the items function. Sound test development improves the validity of the resulting scores, helping you interpret and use the scores in the way the publisher recommends. You should be wary of using a test when the test publisher has not demonstrated the proper theoretical and scientific basis for its development.

Example. Here is an example of how an MMY reviewer has evaluated the development of the SAQ-Adult Probation III that we are using as our example:

DEVELOPMENT.

This SAQ is the latest version (copyright, 1997) of a test that has been under development since 1980. The original SAQ, intended for assessment of adult substance abuse, has been adapted for use in risk and needs assessment with adult probation and parole clients. Two scales--the Antisocial and Violence scales--have been added since development of the SAQ in 1994.

Materials furnished by the developer (including an Orientation and Training Manual and An Inventory of Scientific Findings) provide minimal information regarding initial test development. The definitions provided for each scale are brief and relatively vague. The constructs underlying several scales appear to overlap (e.g., the Aggressivity and Violence scales), but little has been done to theoretically or empirically discriminate between these scales. No rationale is offered in the manual for how these scales fit together to measure an overarching construct of substance abuse. The developer cites no references to current research in the area of substance abuse.

Interpretation. You can see that the test reviewer studied the publisher's manual and other materials to try to determine whether this test was theoretically and scientifically developed. In this example, the publisher rather vaguely described the theory behind this instrument. The publisher provided no information about whether some of the scales measure distinctly different constructs or whether they really measure the same construct under a different name. If you were planning to use this instrument to develop a profile of different dimensions of substance abuse for a client, then you would want some evidence from the publisher that the different scores in your abuse profile have some distinct meaning. If you made the interpretation, for example, that the client was aggressive rather than violent, you might be making an error: According to the reviewer, the publisher provides little information to support your interpretation that these are two distinctly different attributes of a client. The reviewer is saying that the publisher does not provide enough evidence to make a substance abuse profile using the scale scores. The reviewer is saying also that it is not clear how all the scales in the instrument actually work together to give you a sound overall substance abuse score.

Summary. The Development section of an MMY review evaluates how well the publisher used appropriate theoretical reasoning and technical scientific procedures to craft the instrument.

Technical

The Technical section covers three main points--standardization, reliability, and validity. In the standardization part, information about the norm sample is described and evaluated, including how well the sample used by the publisher matches or represents the intended population. Appropriateness of the norms for different gender or ethnic/culture groups may also be discussed. In the reliability part, evidence for score consistency is discussed. The types of reliability estimates and their magnitudes are presented only in a summary fashion. (You will need to read the publisher's technical manual.) The reviewer makes comments about the acceptability of the levels of reliability, the sample used for determining these estimates, and other related issues. In the validity part, the reviewer addresses interpretations and potential uses of test results. The reviewer summarizes validity studies. The reviewer evaluates the test content and the adequacy of the test for measuring the intended construct. If the publisher intends the test to be used to make classifications or predictions, the reviewer summarizes the evidence in this section. In addition, the reviewer evaluates the differential validity of the test across gender, racial, ethnic, and culture groups (including differential item functioning). The reviewer describes the acceptability of the evidence presented by the test publisher to support test interpretation and use.

You should note that, consistent with current measurement standards, a test is not deemed "valid" in and of itself. Rather, it is the uses of the test results that should be shown to be valid. This includes how well test results meet the publisher's intended purposes of the test.

Example. The extract below shows how the reviewer evaluated the technical aspects of the SAQ-Adult Probation III:

TECHNICAL.

Information describing the norming process is vague. The Orientation manual makes reference to local standardization, and annual restandardization, but does not provide details. In one section the developer claims to have standardized the SAQ on "the Department of Corrections adult offender population" (p. 7). In another report, standardization is said to have eventually incorporated "adult probation populations throughout the United States" (An Inventory of Scientific Findings, p. 5). One might assume, based on the citing of SAQ research studies involving literally thousands of probationers that the recency and relevance of norms is beyond question. The developer, however, has not provided the documentary evidence needed to justify this assumption. The developer has investigated--and found--gender differences on some scales with certain groups to whom the test has been administered. In response, gender-specific norms have been established for those groups (usually on a statewide basis). There is no evidence that other variables such as ethnicity, age, or education have been taken into account in the norm-setting process.

The items selected for use in the test have several commonalities. Most items focus on personal behaviors, perceptions, thoughts, and attitudes and are linked in a direct and very obvious way to the content of associated scales (e.g., "I am concerned about my drinking," from the Alcohol scale). Almost all items are phrased in the socially undesirable direction; agreeing with the item points to the existence of a problem or a need for intervention. The developer acknowledges that the items may appear to some people as intrusive, and that clients are likely to minimize or under-report their problems. In the SAQ, the response to this concern has been the inclusion of the Truthfulness scale and calculation of "truth-corrected" scale scores. Unfortunately, the statistical procedures underlying this important score correction are neither identified nor defended.

Internal consistency for the individual subscales of the SAQ has been well-established by a large number of developer-conducted studies that report Cronbach alpha estimates generally in the .80s to .90s. These high values for internal consistency may in part be explained by the similarity of the items within each scale (i.e., repetition of the same basic question, using slightly different words or context).

Evidence of other reliability estimates (other than for internal consistency) to support this instrument generally is lacking. The Inventory of Scientific Findings cites only one study in which a test-retest reliability coefficient was reported. Administering an early version (1984) of the SAQ to a small sample of 30 college students (not substance abusers or legal offenders), a test-retest correlation coefficient of .71 was found across an interval of one week.

Evidence to support the validity of the SAQ is limited. Some concurrent validity evidence is presented, in the form of multiple studies showing modest correlations between some SAQ scales and subscales of the Minnesota Multiphasic Personality Inventory (MMPI). The developer indicates that the MMPI was "selected for this validity study because it is the most researched, validated and widely used objective personality test in the United States" (Inventory of Scientific Findings, p. 14). This explanation, however, does not suffice as a rationale for use of the MMPI to support concurrent validity; and no theoretical framework is provided about how the SAQ subscales relate to the personality constructs underlying the MMPI.

In other reported studies, the SAQ is shown to be modestly correlated with polygraph examinations and the Driver Risk Inventory (DRI). Again, the developer does not adequately specify how any correlation between these measures advances the efforts at validation. The studies cited, and the validation process in general, do not meet accepted psychometric standards for substantiating validity evidence established in the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999). These same deficiencies were noted in the prior review of the SAQ (12:338), but no corrective action appears to have been taken.

Interpretation. The reviewer comments on the adequacy of the norms. Norm-referencing is the main way the publisher intends the scores to be interpreted. The reviewer points out that information about test-retest reliability is very limited for this test, so the meaning of the norm-referenced scores is not clear. Further the reviewer found that there were gender differences in the norms, something that the publisher has tried to address on a state-by-state basis. Because you would be interpreting the scores using the publisher-provided norms, the reviewer's comments should be a warning to you to be very cautious.

The reviewer also points out that the way the items are phrased may lead to persons underreporting their substance abuse on this scale. The publisher tries to compensate for this by creating "truth-corrected" scale scores, but the reviewer points out that the publisher has not explained or defended the validity of doing this. This means that from the reviewer's perspective there is not enough acceptable evidence supplied by the publisher to support the valid use of the scores on the SAQ-III for its claimed purpose.

The next question addressed is the reliability of the test scores. Internal consistency reliability tells you how consistently clients respond from one item to the next within each of the SAQ scales. The reviewer points out that these are well-documented and generally high, in the .80s and .90s. However, internal consistency reliability does not tell you whether the clients' scores on the scales are consistent from one day to the next. Test-retest reliability coefficients tell you that. But the reviewer points out that this type of reliability information is very limited. Without these reliability estimates, you do not have sufficient evidence that your clients' scores on the SAQ III will be the same over any reasonable period of time. This raises the question of how you can diagnose a client's abuse problems or plan a treatment program when you have little evidence of whether the client's scores indicate a stable substance abuse pattern.

The reviewer then turns to questions about the validity of the SAQ III scores. The publisher provides concurrent validity studies, showing correlations of the SAQ III scores with the MMPI (see the review). In this evaluation, the reviewer points out that there is no rationale or theoretic reason given for computing these correlations. You are not given any information about why those studies will support your interpretation or use of the SAQ III as a measure of substance abuse. The reviewer makes similar comments about the other correlational studies the publisher reports. Thus, the reviewer concludes that the validation studies of the SAQ III do not meet acceptable professional standards for validity evidence. This means that you have no acceptable evidence from the publisher for using the scores on the SAQ III. It doesn't say that the instrument is not valid, but it does say that there is incomplete evidence that it is valid. You run the risk of using a test that may not be valid if you select this instrument for your agency.

Summary. The Technical section of the MMY review focuses on three main points: norms, reliability, and validity. The instrument reviewer evaluates the documents and statistical data a test publisher provides. This evaluation provides you with the reviewer's professional judgment of whether the test publisher's technical information meets the professional standards set forth by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education in the Standards for Educational and Psychological Testing. If a test publisher provides theoretical and empirical research to demonstrate the technical adequacy of the instrument, you may feel more confident about using the test. As often happens, however, publishers do not provide this information. This means you can have no external support for the valid use of the test.

Commentary

In this section, the reviewer addresses the overall strengths and weaknesses of the test. The reviewer summarizes the adequacy of the theoretical model supporting test use and the impact of current research on the test's assumptions.

Example. Here is an example of what is included in this section. Again, we use the SAQ III review.

COMMENTARY.

The value of the SAQ as a measure of substance abuse severity with criminal justice populations seems to be compromised on a number of levels. First, the test lacks a clear focus. Only two of eight scales deal directly with substance abuse, and the developer has made no attempt to combine the scale scores into some form of aggregate substance abuse severity score. Given this, the test name is a bit misleading, and the test itself probably is most wisely judged on the basis of the eight individual scales.

Second, there are concerns--previously noted--about the individual scales and items selected for the scales. Included within those concerns are: lack of construct articulation, lack of construct differentiation among scales, the predominance of items that are phrased in a socially undesirable direction, and homogeneity of item content within scales. Item phrasing and the bluntness of the items (e.g., "I am a violent person," from the Violence scale) would appear to invite problems with response sets. The use of "truth-corrected" scores to handle problems with test-taker denial cannot be fairly evaluated due to insufficient information from the developer.

Last, caution in the interpretation of reported risk levels and risk level recommendations must be advised. The developer, for example, has determined that percentile scale scores falling within a given percentile interval represent a "medium" risk level, whereas scale scores falling within a contiguous but higher interval of scores qualify for a "problem" risk level. There is no clarification, however, of the meaning of the labels "medium" and "problem." Further, there are no statements regarding how the two risk levels are to be discriminated from one another, and no identification of outcomes (or probabilities of outcomes) that are tied to the levels. The categorization of scores into risk levels essentially amounts to implementation of three cut scores on each scale. Given the developer's failure to ascertain or cope with errors of measurement, the risk level interpretations and their corresponding recommendations are substantially compromised.

Interpretation. The reviewer comments on three issues regarding this instrument. First, the reviewer sees the instrument as lacking focus and providing substance abuse measurement in only two of its scales. For you, this means that the full range of scales may not measure substance abuse so you may not need the other scales if substance abuse is your main concern.

Second, the reviewer sees the scales as poorly built. The reviewer sees them as lacking a sound theoretical and technical basis for their construction. This is a caution to you that the instrument is poorly developed and is likely to have low validity for the identification of specific substance abuse and the determination of the degree or severity of a client's abuse.

Finally, the reviewer questions the wisdom of the publisher's recommendation for how to determine the risk level of substance abuse by a particular client. The publisher had failed to document the basis for the recommendations for risk and seems to have not taken possible errors of measurement into account.

Taken together, the reviewer's commentary raises serious questions about using this instrument for evaluating a client's level of substance abuse. It would seem wise for you to continue your search for other instruments at this point.

Summary. The Commentary section of the MMY review is an opportunity for the reviewer to speak directly to the reader about the strengths and possible cautions in using the instrument under evaluation. You should consider this commentary as the reviewer's evaluation of the most important issues facing the instrument's users. Taken together with the Development and Technical sections, you should have a good idea of the quality of the instrument and some of the concerns about it that you must face when deciding whether to use it.

Summary

This section consists of approximately six or seven concise sentences. The reviewer should provide conclusions and recommendations about the quality of the test. The reviewer makes the Summary as explicit as possible. If another test is more appropriate, the reviewer may name, cite, and reference that test.

Example. Here is the Summary section from the SAQ II instrument review:

SUMMARY.

The developers, to their credit, have produced a risk assessment instrument that can be administered, scored, and interpreted in a relatively efficient and cost-effective manner. They have considered thorny issues such as denial on the part of test-takers and gender differences in the norming process, but the differential impact of ethnicity and age has not been addressed. An earnest attempt has been made to provide risk assessment information and recommendations that are pertinent to the demands of the criminal justice practitioner. On balance, however, the SAQ falls far short of the mark. Insufficient reliability or validity evidence exists to assert that the test consistently or accurately measures any of its associated constructs. There is continued doubt, in the words of the prior reviewer of the SAQ, that the test "conveys any useful information additional to simply asking the client if they have an alcohol-drug problem, if they are violent, and how they cope with stress" (Toneatto, 1995, p. 891). Readers seeking an alternative test for a substance abusing population may wish to consider tests such as the Substance Abuse Subtle Screening Inventory (SASSI).

Interpretation. The reviewer summarizes the quality of the instrument concisely. You can see that the reviewer does not recommend using the instrument because the publisher has not provided sufficient reliability and validity evidence to justify its use. The reviewer suggests that you consider another instrument, the Substance Abuse Screening Inventory, instead.

Summary. The Summary section of the MMY review gives a concise summary of the reviewer's evaluation of the instrument.

Conclusion

In this lesson we discussed each part of the MMY review and the kinds of information contained in them. We illustrated the explanation with an example from an actual MMY review. We explained how you should interpret the information in each part of the review and how to use it to help you evaluate a test.

When you evaluate a test, you should use the MMY review along with other information you have about the agency or school in which the test will be used. How to integrate the MMY review into this broader context is explained in another lesson entitled, "Using an MMY Review and Other Materials to Evaluate a Test."

Important Vocabulary

absolute meaning

Interpreting the meaning of an examinee's responses on an instrument by studying the content of the items without regard to what other examinees have scored on the instrument.

constructed response format

A test question in which the examinee is expected to respond to an item using his or her own ideas and words rather than choosing an answer from among two or more options.

concurrent validity

The extent to which individuals' current status on a criterion can be estimated from their current performance on an assessment instrument. For example, we can sample students already in college, give them a special aptitude test, and collect their current grade point averages. The relationship between the grades and the test is concurrent validity evidence, because the two measures were collected at about the same time.

construct

The theoretical concept or idea that explains the meaning of an examinee's response to the items in a test.

differential item functioning

An approach to studying test fairness at the level of individual test items rather than looking simply at average difference in an item's performance. The approach studies whether persons of the same ability performed differently on the item. For example, you would study how boys of low ability compared to girls of low ability, boys of average ability compared to girls of average ability, and boys of high ability compared to girls of high ability. If these comparisons show that students in the two groups who are of the same ability perform differently on a task, this may indicate that the task is biased. However, although test items may function differently in two groups, item differential functioning does not prove bias

differential validity

The term used when the scores of a test have significantly different degrees of validity for different groups of people (e.g., males vs. females, Black persons vs. White persons).

empirical information

Refers to the research data used to support the development of the test, the selection of items, claims to reliability, and claims to validity.

internal consistency reliability

A procedure for studying reliability when the focus of the investigation is on the consistency of scores on the same occasion and on similar content, but when conducting repeated testing or alternate forms testing is not possible. The procedure uses information about how consistent the examinees' scores are from one item (or one part of the test) to the next to estimate the consistency of examinees' scores on the entire test.

items

The questions, exercises, and activities appearing on an assessment procedure. Typically, the term "item" is used to refer to paper-and-pencil test exercises

multiple-choice

This format consists of a stem that poses a question or sets a problem and a set of two or more response choices for answering the question or solving the problem. Only one of the response choices is the correct or best answer.

norm sample

A well-defined group of examinees who have been given the same assessment and under the same conditions (same time limits, same directions, same equipment and materials, etc.). The local norm group would include examinees in the local agency or school district only; the national norm group would include examinees in the publisher's sample chosen to represent all examinees in the nation of the type that would be administered the test.

normative data

The statistical data resulting from the application of the test to a national representative sample of examinees.

norm-referencing

A framework for interpreting an examinee's score by comparing the examinee's test performance with the performance of other examinees in a well-defined group who took the same test. Norm-referencing answers the question, "How did this examinee do compared to other examinees?"

percentile ranks

A norm-referenced score that tells the percentage of persons in a norm group scoring lower than a particular raw score.

pilot testing

The process of trying out the test or its items with a small sample of examinees to obtain preliminary information about how well the test is functioning. The results of a piloting exercise are used to make adjustments that improve the test or its items.

profile of different dimensions

An assessment approach that identifies an examinee's pattern of scores in several different but related areas. The approach is useful when you know little about an examinee and want to get a rough idea of the examinee's needs. The rough profile of the examinees provides only general guidance and must be followed up with more detailed diagnosis

reliability

The consistency of examinees' assessment results (scores) when the test procedure is repeated for a group of examinees. Reliability refers to the consistency of scores rather than to the assessment instrument itself. A reliability coefficient is any one of several statistical indices that quantify the amount of consistency in assessment scores.

standardization

A process in which the procedures, administration, materials, and scoring rules are fixed so that as far as possible the assessment is the same at different times and places. Sometimes the term is restricted to mean the process by which the final version of a test is administered to a nationally representative sample of examinees for purposes of developing various types of norm-referenced scores

technical manual

A publication prepared by a test developer that explains the technical details of how the test was developed, how the norms were created, the procedures for selecting test items, the procedures for equating different forms of the test, and the reliability and validity studies that have been completed for the test.

test-retest reliability

A procedure for estimating reliability when the focus of the study is the consistency of the examinees' scores from one occasion to the next and when they are administered the same items on both occasions. The purpose of such studies is to identify the amount of error in the scores than can be attributed to the occasions on which the examinees took the test when the items' content stayed the same.

true-false

This type of item format consists of a statement or proposition that the examinee must judge as true or false

validity

Evidence and theory that support the soundness of one's interpretations and uses of examinees' assessment results. To validate interpretations and uses of examinees' assessment results, one must combine evidence from a variety of sources that collectively demonstrate that the intended interpretations and uses of the test scores are appropriate and defensible. Evidence must also demonstrate that examinees would experience no serious negative consequences when their test scores are used as intended.