Using a Mental Measurements Yearbook Review and Other Materials to Evaluate a Test

Anthony J. Nitko
Professor Emeritus of Psychology in Education, University of Pittsburgh
Adjunct Professor of Educational Psychology, University of Arizona


Tests play important roles in decisions about people, agencies, and institutions. You should select a test carefully. The Mental Measurements Yearbook reviews should be one part of your comprehensive evaluation of a test before you adopt it. Before you adopt a test a committee of persons should carefully examine and evaluate it. The committee should have persons on it who represent those who will be required to use and interpret the test results.

This lesson describes a systematic procedure for conducting a test review and evaluation. No test can perfectly match a particular user's needs, so comparing the merits of one test with another is an important step in choosing the better product.

Steps to Follow When Evaluating a Test

There are six steps you should follow:

  1. Clarify Your Purpose
  2. Use Your Local Context as the Basis for Evaluating a Test
  3. Study Professional Reviews of the Test Materials
  4. Obtain Copies of the Test You are Evaluating
  5. Summarize the Strengths and Weaknesses of the Test
  6. Come to a Conclusion about the Test and Support that Conclusion

Clarify Your Purpose

The first step in evaluating a test is to pinpoint the specific purpose(s) for obtaining examinee information and to find out who will be using the information to make decisions. The clearer you are about the purposes and conditions under which test information will be used, the better you will be able to select a test that yields valid scores. Your goal is to select the test that gives the most valid results within your budget constraints.

Things you need to keep clearly in mind as you begin your test evaluation process include:

1. The general community setting in which the assessment will be used: type of community, ages or grades of clients or students, persons who will be helped by using an appropriate assessment, and persons who will be in charge of interpreting the assessment results. See example.

2. The specific decisions, purposes, and/or uses intended for the assessment results: such as identifying specific reading skills needing remediation, appraising a client's emotional needs or areas of anxiety as a prelude to counseling, appraising a client's aptitude for mechanical activities that a counselor will discuss during guidance sessions, or surveying a client's general levels of substance abuse so a treatment plan can be developed. Be as specific as you can. See example.

3. The way you believe that using test scores or other assessment information will help to improve the decisions, serve the purpose, or solve the problem: The better you can articulate, from the outset, what you expect to accomplish by using an assessment procedure, the better you will be able to evaluate the many options open to you, and to choose the most satisfactory one. See example.

4. The need to strike a balance between the strengths and limitations of different types of testing formats. You need to have in mind a balance between such factors as time, cost, in-depth assessment of narrow areas, and less in-depth assessment of broad areas. The assessment procedure you select will be the result of compromises on several dimensions, so it is helpful to think about these early in the process. Some tests will be individually administered: These will require a skilled test administrator, take time to administer, and cost more per examinee than group administered tests. Individually administered tests, however, usually allow the test administrator to evaluate an examinee in depth, whereas, a group administered test usually does not allow this in-depth assessment. See example.

Use Your Local Context as the Basis for Evaluating a Test

What Assessments Are Already Used? Before you set out to select a new test, you should take stock of the assessments already being used in your local situation (in your practice, your agency, your school district, etc.). In your agency, for example, what information is already collected as part of the intake process? How does this information overlap with what a new assessment would provide? Are there any advantages to using a new assessment to replace what is already collected by other assessments, or will the new assessment procedure be redundant? In a school district, for example, what type of assessments do teachers already do, of what quality are these assessments, and do they serve the perceived need for the new test? See example.

External Assessments vs. Locally Crafted Assessments. You will need a perspective on what an external assessment contributes beyond the locally crafted assessments currently being used by your agency, staff, or school. External assessments do not match the local needs framework exactly. You may have talented professionals in your own agency who could develop assessment procedures that will serve your local needs better than the assessment procedures you could purchase. A school district may decide, for example, that it will be wiser and have more instructional payoff to spend the district's money to improve teacher-crafted assessment procedures than to purchase an external assessment procedure such as a standardized test. Locally crafted assessments, however, often do not yield as reliable or valid scores as some products you can buy, especially for assessing complex psychological constructs. Further, locally crafted assessments do not have national norms to help you interpret the test scores.

Educational State-Mandated Assessments vs. Standardized Tests. Testing in schools presents some special issues. In many states, there is a mandated state assessment program. This may be a basic skills assessment, an accountability program, or a more complex assessment. To reduce redundancy, the assessment you purchase for your local school should supplement this mandated assessment and serve other, nonduplicating purposes. Content and performance standards are often defined by the states, and states are required to attend to these to participate in federal funding. Thus, there is a need to select a standardized achievement test that is compatible with your state-mandated assessment.

Evaluating an Agency or a School District. Sometimes an agency or a school district wishes to use an external assessment instrument to evaluate itself. Agency and school officials should be aware that using only a test for its evaluation is an especially poor foundation on which to evaluate personnel and programs. Program evaluation itself is a technical area requiring a well-prepared professional evaluator. One suggestion is to search the Internet for "program evaluation" or if you are in the United States to contact organizations such as the American Evaluation Association, the Evaluation Center at Western Michigan University, and the American Educational Research Association's Division H.

Qualifications of the Staff. Another consideration is the qualifications of your agency's or school district's staff in relation to the assessment procedure proposed. For example, individually administered tests of intelligence and personality require specially trained professionals to administer and interpret them. Group administered tests of personality and scholastic ability require a different type of training to administer and interpret. Failure to use specially trained test administrators for these types of tests usually means the test results will have very low validity. If such professionals are in short supply in your agency or school district, you will want to use assessment procedures that do not require a high level of professional training. Similarly, using open response formats, performance assessments, and portfolios requires the time and expense of educating staff members about scoring procedures and interpreting the test results.

Study Professional Reviews of the Assessment Materials

If you have done your homework as just described, you will be in a good position to locate assessments and to begin reviewing them. The information sources below will help you identify assessments that approximate your needs. Of particular help in identifying tests and obtaining descriptions and professional reviews of them are the following Buros publications:

Test Reviews Online
Tests in Print
Mental Measurements Yearbooks

Other sources include:

ETS Test Collection Database
Practical Assessment, Research & Evaluation
Test publishers' catalogues
Professional journals
Internet web searches using the test name or variable assessed
Other publications (listed in standard educational and psychological measurement textbooks or provided by a college or university librarian)

Obtain Copies of the Test You Are Evaluating

After narrowing your choices to a few assessment procedures that appear to suit your needs, you should obtain copies of the materials and assessment tasks; detailed descriptions of the assessment content and rationale behind its selection; materials related to scoring, reporting, and interpreting assessment results; information about the cost of the assessment materials and scoring service; and technical information about the assessment. (Some test publishers restrict ordering materials to persons who are qualified to administer them. This practice is intended to prevent test abuse and other inappropriate test usage.)

Much of this material may be bundled together in a specimen set, which is designed as a marketing tool as well as for critical review of materials. As a result, not all materials you need to make an informed review a test are included. For example, some publisher's specimen sets do not include complete copies of the assessment booklets or scoring guidelines. You will need to order these separately.

Technical Information About a Test's Quality. Technical information about a test's quality is not found in a publisher's catalogue. A test's technical manual should contain information about how the test was developed, its reliability coefficients, its standard errors of measurement, correlational and validity studies, equating methods, item analyses procedures, and norming-sample data. Technical manuals are not typically included in specimen sets and must be ordered separately from the tests. Although an agency's testing directors should have copies of the technical manuals for the tests it uses, too often they do not. Some colleges and universities that maintain test collections for their faculty and students to evaluate may have technical manuals. Usually, however, you will need to order the technical manual directly from the test's publisher.

A Committee Should Review All Materials. Once you obtain the materials, a committee should review them. Be sure to compare assessments against the purposes you had in mind for using them. It might be helpful for the committee to obtain input from noncommittee members for certain parts of an assessment (e.g., in a school district, mathematics teachers for the mathematics section, reading teachers for the reading assessment). You could also call upon a college or university faculty member in a relevant psychology or education department to help: A testing and measurement faculty member may be helpful in reviewing and/or explaining technical material; clinical psychology faculty may help with clinical assessments; etc. Examples of the organizations in the United States you might contact for names of local specialists to help are: the National Council on Measurement in Education, Association for Assessment in Counseling and Education, American Psychological Association, state and provincial psychological associations, and American Counseling Association. If you are not in the United States, you may have to search the Internet to find addresses of professional associations.

Tests Must Match the Local Program Goals. It is important to match the test's view point and coverage with the local agency's or school's goals and objectives. The assessment procedures used in a clinic, for example, should be consistent with the clinic's philosophy and approaches to its clients, and with all applicable legal requirements. The types of questions posed to a client and the variables scored for a client should be relevant to the treatment interventions that the clinic uses and to the way the clinic judges the client's success.

In education, it is important to match each test item with your state's standards, and state or local curriculum. You do this by obtaining the complete list of standards or learning objectives, organized by grade level. Two persons could, for example, independently read each test item and record which standard or learning objective it matches. When all items have been matched, they could compare their results and reconcile the differences. The findings are summarized in a table that lists each standard and the I.D. numbers of the test items matching each. The number of mismatching items is also recorded. This should be done separately for each grade, because a test's items may appear at a grade level that is different than the grade at which the corresponding learning target is taught. If there are a lot of these grade-sequence mismatches, the test will not be suitable for your school district. Be sure to note especially the match between the kinds of thinking and performance activities implied by the standards and the test items. Often the items' topical contents match the curriculum, but the types of thinking processes and performances required of students to respond appropriately to the items do not. Finally, find out the month of the school year during which the district plans to administer the test. Then, determine what proportion of the test's items contain content that will have been taught before testing begins.

Pilot the Assessment. If possible, you should administer the assessment to a few clients or students to get a feel for how they respond. This would be especially important if performance activities or individually administered assessment tasks are included. It may be that for some otherwise appealing performance tasks, time limits or instructions are not sufficient and confusion results. This insufficiency and confusion is less likely if the assessment was professionally developed and standardized on a national sample.

Summarize the Strengths and Weaknesses of the Test

It will help your test evaluation if you systematically organize relevant information in one or two pages. Using a form is a concise way of sharing information among committee members or with others who may help make decisions about the choice. Figure 1 suggests what information to record for your review and shows a format you can use to summarize this information.

Figure 1.
A Suggested Outline for Organizing Information in a Test Evaluation

Identifying Information
1. Title, publisher, copyright date
2. Your purpose for wanting a test and the decisions you want to make using it, including the types of examinees with whom you will use the test
3. Purpose of the test as described by the publisher
4. Age, gender, grade level(s), other important characteristics for the population for which the publisher says the test is designed
5. Types of scores and norms the test provides
6. Cost of materials per examinee and cost for other needed services such as scoring and reports

Content Evaluation
1. Publisher's description and rationale for including the specific tasks or questions on the test
2. Quality and clarity of the tasks or questions themselves
3. Currency of the content included in the tasks or questions on the test
4. Match of the content of the test questions to the local situation (e.g., agency's approach to diagnosis and therapy, school district's curriculum, or state's standards)
5. The extent to which the tasks or questions are fair to members of ethnic and gender groups

Usefulness of the Results for Professional Practice
1. Publisher's description and rationale for how the test results may be used by practioners to improve professional practice (e.g., therapists to improve treatment, teachers to improve instruction).
2. Local practitioners' evaluations of how they could use the test results to improve their practice (e.g., therapists to improve treatment, teachers to improve instruction).
3. Overlap of the proposed assessment with the assessment tools currently used by local practioners

Technical Evaluation
1. Representativeness, recency, and local relevance of the national norms
2. Types of reliability coefficients and their values (use average values for each type if necessary)
3. Summary of the empirical evidence for the validity of the test for the specific purpose(s) you have in mind for the test
4. Likelihood that using the scores from proposed test will be used in a way that has adverse effects on persons with disabilities, minority group members, or females

Practical Evaluation
1. Quality of the manual and practitioner-oriented materials
2. Ease of administration and scoring
3. Cost and usefulness of scoring services
4. Estimated annual costs (time and money) if the assessment procedure is adopted for the agency or school district.
5. Likely reaction of the public if the test is adopted

Overall Evaluation
6. Evaluative comments of published reviews (e.g., MMY)
7. Your conclusions about the positive aspects of the test
8. Your evaluative conclusions about the negative aspects of the test
9. Your summary and recommendations about adopting this test for each of the specific purposes you stated earlier

List of References Used for This Test Evaluation


In this lesson we have shown you how to perform a thorough review and evaluation of a test for use in your local situation. We explained six steps you should follow: (1) Clarify Your Purpose, (2) Use Your Local Context as the Basis for Evaluating a New Test, (3) Study Professional Reviews of the Test Materials, (4) Obtain Copies of the Test You are Evaluating, (5) Summarize the Strengths and Weaknesses of the Test for Your Local Situation, and (6) Come to a Conclusion about the Test and Support That Conclusion. We have provided a format for recording and writing your evaluation so you may share and discuss it with others.

The reviews in the Mental Measurements Yearbooks play an important part in your test evaluation process. Studying these reviews and paying careful attention to how the evaluative comments in these reviews relate to the local decisions and purposes for which you want to use a test will result in you choosing a test that yields more valid scores.

Important Vocabulary

analytic scoring

Scoring that evaluators use to rate or score the separate parts or traits (dimensions) of an examinee's product or process first, then sum these part scores to obtain a total score. A piece of writing, for example, may be rated separately as to ideas and content, organization, voice, choice of words, sentence structure, and use of English mechanics. These separate ratings may then be combined to report an overall assessment.

content standards

Statements about the subject-matter facts, concepts, principles, and so on that students are expected to learn. For example, a standard for life science might be, "Students should know that the cell nucleus is where genetic information is located in plants and animals."

correlational studies

An empirical research that results in one or more correlation coefficients. A correlation coefficient is a statistical index that quantifies the degree of relationship between the scores from one assessment and the scores from another. The index is reported on a scale of -1 to +1, reflecting that the relationship between two sets of scores can be negative or positive in direction.

equating methods

The statistical methods that a publisher uses to assure that the scores from one form of a test have the same meaning as the scores from another form of the test.

external assessment

An assessment procedure that comes from outside the local agency or school district and was not crafted by local agency professionals or teachers in the district. A standardized test and a state's assessment are two examples.

holistic scoring

Scoring that requires rating an examinee's product or the process an examinee uses as a whole without first scoring parts or components separately.

group administered tests

Tests that are administered to several examinees simultaneously in a standardized manner.

individually administered tests

Tests that are administered personally to one examinee at a time.

item analyses procedures

The process of collecting, summarizing, and using information from examinees' responses to make decisions about how each item is functioning.

locally crafted assessments

Tests and other assessment tools that are developed in the local agency or school district by its employees.

mandated tests

Specific tests that examinees must take because the law says they are required to do so. For example, state educational assessment programs are usually mandated assessment for students; local governments may require firefighter, police, and other candidates take a specific test. multiple assessment: Combining the results from several different types of assessments (such as homework, class performance, quizzes, projects, and tests) to improve the validity of decisions about a student's attainments.

multiple assessment

Combining the results from several different types of assessments (such as homework, class performance, quizzes, projects, and tests) to improve the validity of decisions about a student's attainments.

national norms

Test results from a well-defined group of students who have been given the same assessment and under the same conditions (same time limits, same directions, same equipment and materials, etc.). The local norms would include only examinees served by a local agency or in the local school district; the national norms would include examinees in the publisher's sample chosen to represent all examinees in the nation.

norming-sample data

The statistical data resulting from the application of the test to a national representative sample of examinees.

open response formats

A type of test question that has many possible correct answers. A closed response format, on the other hand, asks a question for which there is only one correct answer.

performance assessments

Any assessment technique that requires examinees to physically carry out a complex, extended process (e.g., present a demonstration, present an argument orally, play a musical piece, or climb a knotted rope) or produce an important product (e.g., assemble an object, write a poem, report on an experiment, or create a painting). To qualify as an assessment there must be both (1) a hands-on task presented to a student to complete and (2) clearly defined criteria to evaluate how well a student achieved specific learning objectives.

performance standards

Statements about the things students can perform or do once the content standards are learned. For example, "Students can identify the cell nucleus in microscopic slides of various plant and animal cells."


A limited collection of a person's work that is used for purposes of assessment to either present the person's best work(s) or demonstrate the person's improvement over a given time span. Items included in a portfolio are carefully and deliberately selected so the collection as a whole accomplishes its purpose.

professional program evaluator

A person who has the education, training, and experience to plan and carry out the evaluation of an agency's or school district's program. A program is a set of procedures and plan of action for achieving one or more of the agency's or school district's goals.

program evaluation

The process of collecting information and making judgments about the worth of an agency's or school district's program. A program is a set of procedures and plan of action for achieving one or more of the agency's or school district's goals.

reliability coefficients

Reliability is the amount of consistency of assessment results (scores), rather than the assessment instrument itself. A reliability coefficient is any of several statistical indices that quantify the amount of consistency in assessment scores.


The consistency of examinees' assessment results (scores) when the test procedure is repeated for a group of examinees. Reliability refers to the consistency of scores rather than to the assessment instrument itself. A reliability coefficient is any one of several statistical indices that quantify the amount of consistency in assessment scores.


A coherent set of rules used to assess the quality of a student's performance. These rules guide the evaluator's judgments and ensure that evaluators apply the rules consistently from one student to the next. The rules may be in the form of a rating scale or a checklist. Complex performances require evaluation of several learning objectives or several parts of the performance. To do this, the evaluator must use several scoring rubrics: one for each learning objective or each part of the performance.

specimen set

A packet of materials from a test publisher containing a sample of the test, sample score reports, promotional materials, and (occasionally) a technical report of the test's quality.

standard error of measurement

An estimate of the standard deviation or the spread of a hypothetical obtained-score distribution resulting from the repeated testing of the same person with the same assessment. This index estimates the likely difference of students' obtained scores from their true scores.

standardized test

A test for which the procedures, administration, materials, and scoring rules are fixed so that as far as possible the assessment is the same at different times and places.

state assessment program

Tests and other assessments that the law requires to be administered to all students at designated grade levels. Requirements vary by state and may include one or more of the following: a standardized multiple-choice test, a writing assessment, performance activities, portfolios, etc. The purposes of the testing program may be to hold schools accountable, to remediate school programs, or to determine whether students should be certified.

technical manual

A publication prepared by a test developer that explains the technical details of how the test was developed, how the norms were created, the procedures for selecting test items, the procedures for equating different forms of the test, and reliability and validity studies that have been completed for the test.

valid scores or validity

Evidence and theory that support the soundness of one's interpretations and uses of examinees' assessment results. To validate interpretations and uses of examinees' assessment results, one must combine evidence from a variety of sources that collectively demonstrate that the intended interpretations and uses of the test scores are appropriate and defensible. Evidence must also demonstrate that examinees would experience no serious negative consequences when their test scores are used as intended.

validity studies

Empirical research that provides evidence that the scores from a particular test are appropriately interpreted and used in the way they are intended, and produce no negative consequence for the examinees when so used.