The U.S. Equal Employment Opportunity Commission

Meeting of May 16, 2007 - Employment Testing and Screening

Statement of James L. Outtz Ph.D, President, Outtz and Associates

Good morning. I would like to thank the Commission for providing me the opportunity to speak today on the important topic of employment testing.

I am going to discuss three topics today. First, I will address how employment tests and other selection procedures are used. Next, I will examine methods for minimizing the likelihood of discrimination. Finally, I will explore emerging trends and challenges. Special emphasis will be given to the interrelationship between sound personnel selection practices and scientific research, particularly research in the field of industrial and organization psychology. The Use of Employment Tests and Other Selection Procedures Employment testing is quite prevalent in America. Employers, both public and private utilize a variety of standardized selection devices and procedures to make staffing decisions. Employment selection instruments commonly used today include cognitive ability tests, assessment centers, work samples, structured interviews, situational judgment tests, background checks, integrity tests and other personality measures. Test are used for such staffing decisions as hiring, promotion, job assignment, selection for training, and determining minimum qualifications. It should be noted that the use of tests extends far beyond employment. College admissions, professional licensure and other high-stakes selection decisions also rely heavily on tests of one sort or another.

The number and variety of tests and selection procedures in use by employers is actually expanding. This is due in large measure to advances in technology designed to improve efficiency in administration and scoring. These technologies include:

Minimizing the Likelihood of Discrimination

Historical Perspective

Before discussing specific ways to minimize the likelihood of employment discrimination, it may be helpful to describe the history of the “employment testing-employment discrimination” controversy. I will begin by looking back several decades at the evolution of a number of key employment testing issues.

In 1975, a mere 12 years after Dr. Martin Luther King delivered his “I Have a Dream” speech, the personnel selection practices of employers were being subjected to intense scrutiny due to enactment of the 1964 Civil Rights Act and, more specifically, Title VII of the Act which deals with equal employment opportunity. Title VII applied primarily to private sector employers. In 1973, the Equal Employment Opportunity Act expanded coverage of Title VII to public sector employers, including state and local governments.

Governmental and legal involvement in personnel selection issues increased throughout the 1970s. This involvement began to have a profound effect on personnel selection practices as well as personnel selection research. Personnel selection practices had been the purview of human resource professionals and personnel psychologists. Federal guidelines and regulations as well as mushrooming litigation changed that.

Supreme Court cases such as Griggs v. Duke Power (1971); Albemarle Paper Company v. Moody (1975); and Washington v. Davis (1976) spelled out the legal standards that were to be applied to the personnel practices of employers. In 1978, the Uniform Guidelines on Employee Selection Procedures (hereafter, the Guidelines) were adopted by the EEOC and the other federal agencies with primary responsibility for enforcement of Title VII. U.S.

One result of this mushrooming government and legal involvement was to focus the attention on the nexus between employment discrimination case law and what was considered best practice in employment selection. Within the field of industrial and organizational psychology, monitoring of legal developments in personnel selection became the order of the day. As an example, Lodvinka and Shoenfeldt (1978) compared the EEOC guidelines as interpreted by the courts with the APA standards and Division 14 principles. Kleiman and Faley (1978) reviewed 31 court cases to determine the standards set by the courts in assessing the validity of paper-and-pencil tests.

External influences on personnel selection created a number of concerns. One concern was that employers would abandon “valid”, objective, merit-based selection practices in favor of highly subjective, arbitrary procedures designed to comply with legal requirements. Another concern was that scientific advances in selection would be overshadowed by programs and practices aimed at balancing selection outcomes on the basis of demographic characteristics rather than increasing the accuracy of selection decisions.

The external influences that took hold in the ‘70s resulted in a greater focus on subgroup outcomes or adverse impact than had been the case prior to that time. There was an increase in the number of research studies investigating not only the validity evidence associated with different selection devices, but also the degree to which those devices produced adverse impact and thus were possibly discriminatory.

Schmidt, et al. (1977), published a study in which they compared job sample and paper-and-pencil tests with regard to validity evidence and adverse impact. They found that the job sample tests had considerably less adverse impact. In addition, both minority and non-minority examinees saw the job sample tests as more fair. Field, et al. (1977), compared minority and non-minority employees at a large paper company with regard to mean scores on four predictors and two job performance measures. They reported that two of the four predictors did not produce a statistically significant difference in subgroup means. They also found statistically significant differences between subgroups on both criterion measures. All of the predictors had a statistically significant correlation with both criterion measures.

Kesselman and Lopez (1979) compared the validity and adverse impact of a written accounting job knowledge test with a commercially available mental ability test. They found that the job knowledge test produced validity evidence comparable to that produced by mental ability tests. However, the job knowledge test produced less adverse impact.

Two important findings that emerged from the selection research conducted during the seventies were that:

  1. Some tests had less adverse impact than others. However, the average score for minority applicants was almost always lower than that of non-minority applicants.
  2. The average criterion (job performance) score of minority group applicants was typically lower than that of non-minority applicants.

Some chose to interpret these findings as indicating that employment tests were valid and non-discriminatory. That is, the tests simply reflected real differences between groups in their ability to perform the job. Others chose to interpret the findings as indicating the possibility that both the tests and job performance measures were biased and unfair to minority group members. Thus began a debate both legal and scientific over what validity means and what constitutes a fair test or selection procedure.

One aspect of this debate focused on the issue of test bias. Some researchers interpreted the average score differences between minority and non-minority applicants on employment tests as an indication that employment tests were culturally biased. That is, test content was geared toward information more prevalent among non-minority group members, giving those persons an unfair advantage. As a result, they argued, the tests produced systematic errors in measurement or prediction (Murphy ad Davidshofer, 1994).

Issues such as test bias, led to a broader discussion of fairness. The fairness debate was more than a scientific exercise. The EEOC’s Uniform Guidelines stipulate that the fairness of a selection procedure should be investigated whenever feasible.

Issues of test bias and test fairness have not been resolved, partly because there is no clear consensus regarding the meaning of these concepts. Bias in selection is generally thought to mean predictive bias; that is, the systematic under-prediction or over-prediction of performance. This definition was proposed by Cleary (1968) as a measure of fairness. A technical definition of predictive bias however is not necessarily an acceptable definition of fairness.

Fairness is in part a social term that encompasses consideration of testing outcomes. Thorndike (1971) proposed a model of fairness predicated on the relationship between test outcomes and performance outcomes. According to this model, a selection procedure is fair only if the proportion of minority applicants selected on the basis of the procedure is equal to the proportion that would be selected on the basis of actual job performance. There were other fairness models offered during the ‘70s. Agreement on a given model has proven illusive.

The debates during the 1970s revealed an overarching concern beyond the specific issues. It became clear that there was a sense within the scientific community that employment tests, and other selection procedures, were under attack by advocates of social change.

Most within the scientific community believed that the scientific issues in selection should be kept separate from social issues. I, for one, did not see how this would be possible. As long as selection practices had real world consequences, I did not see how issues regarding the type of selection procedures in use, or the methods of use, could be resolved solely on the basis of science.

Thus far I have focused the fairness discussion on employment tests. Measures of job performance produced just as much concern. In the mid ‘70s, (just as today) the most common criterion measure in validation research was supervisor ratings. This brought increased attention to subgroup differences on such measures. Note that a consistent finding in validation studies in the 1970s was a difference in the average supervisor ratings of minority and non-minority employees, with racial minorities receiving lower ratings. An issue arose as to whether the race of the rater and ratee contributed to this difference. The hypothesis was that ratees may receive higher (or positively biased) ratings from raters of the same race. Since most of the raters in organizations were non-minority, the lower ratings of minorities may have been due to this bias. Initial research on this question produced mixed results. Landy and Farr (1980) concluded, however, that raters generally give higher ratings to members of their own race. The results of a meta-analysis by Kraiger and Ford (1985) supported Landy and Farr’s conclusion.

Sackett and DuBois (1991) compared the Kraiger and Ford (1985) results with data from a large-scale military study and a large-scale civilian study. Their findings indicated that the “Whites rate Whites higher/Blacks rate Blacks higher” conclusion may be premature. They found that performance differences may depend in part on the dimension of job performance that is measured. The Sackett and DuBois study drew attention to, the multidimensionality of job performance and its relationship to subgroup differences. Borman and Motowidlo (1993) distinguish between two elements of job performance, task performance and contextual performance. Task performance includes activities that relate directly to productivity, such as selling merchandise in a retail store or operating a production machine in a manufacturing plant. Contextual performance includes activities that support the broader organizational environment in which task performance must occur.

We are beginning to find that different predictors (tests) are needed to best predict specific aspects of job performance. The relative importance of the various aspects of job performance may be job-specific. Therefore, subgroup differences in test performance may or may not reflect subgroup differences in job performance, depending upon the dimension of job performance that is of interest. Given that, subgroup differences in job performance may be influenced to some degree by rater-ratee race effects, identification of the most appropriate selection devices or procedures is more complicated than previously thought.

Reducing Adverse Impact by Expanding the Selection Process

Hunter and Hunter (1984) proposed that using different predictors in conjunction with cognitive ability tests might improve validity and reduce adverse impact, (and by implication reduce the likelihood of discrimination). However, there was at that time no database for studying this possibility. That database exists today and it is growing. Considerable research has been conducted since 1984 on combining different selection devices to improve validity and reduce adverse impact. One approach is to broaden the spectrum of attributes or applicant characteristics measured. As an example, Griffin, et al. (1996), investigated the relative validity of personality testing and the assessment center for predicting managerial performance. They found that personality testing resulted in significant incremental validity over an assessment center alone in predicting managerial performance.

Sackett and Ellingson (1997) developed tables for estimating the level of adverse impact that will result from combining two or more predictors into a composite. Bobko, et al. (1999), discuss the implications of combining cognitive ability, the structured interview, conscientiousness, and biodata. They conclude that adverse impact can be reduced by using different combinations of predictors. They caution however that some adverse impact was present for all predictor combinations. Nevertheless, these findings indicate real progress.

Using Alternative Testing Media

A number of researchers have investigated the effects of varying the testing medium on validity and adverse impact. Pulakos and Schmitt (1996) investigated the validity and subgroup differences associated with a video-administered writing test and a written problem exercise compared to a traditional multiple-choice verbal ability measure. They found that the video-based test resulted in significant reductions in adverse impact.

Chan and Schmitt (1997) compared video-based versus paper-and-pencil situational judgment tests in terms of adverse impact and applicant reactions to the tests. The video-based test was found to produce less adverse impact and to elicit more positive reactions from applicants. Outtz (1998) explored the possibility of developing a taxonomy of test characteristics associated different levels of adverse impact.

Differentially Weighting Selection Procedure Components

A number of researchers have proposed that the multidimensionality of job performance necessitates evaluation of the weights assigned to the different components of a selection procedure. (DeCorte, 1999; Doverspike, Winter, Healy, & Barrett, 1996; Hattrup, Rock, & Scalia, 1997; Murphy and Shiarella, 1997).

Murphy and Shiarella (1997) for example, proposed that weighting of predictors and criteria provides a better understanding of the relationship between selection and job performance. They show that the validity of a predictor composite can vary substantially depending upon the weight given to predictors and criterion measures. The 95% confidence interval for the validity coefficients ranged from .20 to .78.

Understanding the effect of weighting predictors in a composite is the next stage in the advancement of our knowledge of the relationship between selection strategies and adverse impact. The key question with regard to predictor weighting is, “How should the weights be determined?” Some researchers propose that predictor weights should be determined by the importance of the facet of job performance with which the predictor correlates (Murphy & Shiarella, 1997). Thus if an organization places substantial importance on a facet of performance such as teamwork or organizational citizenship, then predictors of those dimensions of performance should be given the most weight. Murphy and Shiarella, (1977) proposed that the appropriate weight for a particular component of a selection procedure should be determined by (a) the relative importance of each dimension of job performance, (b) the selection devices needed to predict each performance dimension (c) the degree to which the selection devices and and/or criterion dimensions are highly intercorrelated and (d) whether each criterion dimension is most strongly related to a different selection device. Most, if not all, of the published research on weighting of predictors has dealt with the simple case of two applicant groups. In reality, the selection situation is likely to be far more complex, involving three or even four applicant groups.

It should be noted that weighting predictors falls under the “alternative methods of use” provision of the Uniform Guidelines. This provision states that:

“In conducting a validation study, the employer should consider available alternatives which will achieve the legitimate business purpose with lesser adverse impact.” (Uniform Guidelines, Section v.)

“The evidence of both the validity and utility of a selection procedure should support the method the user chooses for operational use of the procedure, if that method of use has a greater adverse impact than another method of use” (Uniform Guideline, Section 5G)

Therefore, an employer could be called upon to explain why an alternative weighting method that has less adverse and similar validity could not have been used.

Emerging Trends

Greater Stakeholder Consistency about the Problem

There appears to be greater consensus among the legal and scientific and professional communities regarding the need to address subgroup differences on employment tests. The Uniform Guidelines have always emphasized the significance of adverse impact in determining whether a selection procedure is discriminatory. Recent updates of professional standards, and best practices with regard to employment testing, emphasize the need to address subgroup differences and the possible implications of such differences whenever possible. For example, both the Standards for Educational and Psychological Testing (1999) as well as the Society for Industrial and Organizational Psychology (SIOP) Principles for the Validation and Use of Selection Procedures (2003) address the possible implications of subgroup differences more directly. The Standards and the Principles relate adverse impact to construct-irrelevant variance or test bias. Construct-irrelevant variance is defined as excess, reliable variation in test scores that is irrelevant to the interpreted construct (Messick 1989). Figure 1 on the following page provides a comparison of the Uniform Guidelines, the Standards and the Principles with regard to the treatment of adverse impact.

FIGURE I Perspectives on Adverse Impact
The Guidelines
Adverse Impact
The Standards
Measurement Bias
The Principles
Measurement Bias
A substantially different rate of selection in hiring, promotion or other employment decision which works to the disadvantage of members of a race, sex or ethnic group. (Definitions) …evidence of mean test score differences between relevant subgroups of examinees should, when feasible be examined… for construct-irrelevant variance (7.10)
Construct Irrelevance: The extent to which test scores are influenced by factors that are irrelevant to the construct that the test is intended to measure.
Construct Under- representation: The extent to which a test fails to capture the important aspects of the construct that the test is intended to measure. (Definitions)
…sources of irrelevant variance that result in systematically higher or lower scores for members of particular groups, is a potential concern for all variables, both predictors and criteria. (P. 33)

Figure 1 shows that there is greater consistency across these documents with regard to the conceptualization of adverse impact.

Estimating the Likelihood of Adverse Impact

A second trend in employment testing is the emergence of a scientific data base that allows more accurate estimates of the likely adverse impact and validity of various combinations of employment tests or selection procedures. This in turn will result in greater scrutiny of the kinds of tests chosen for a selection procedure. If it can be determined a priori that certain test combinations and or methods of use have greater adverse impact than others with the same validity, employers will it find more difficult to justify discriminatory tests on the grounds of business necessity.

Although minimizing adverse impact and sustaining validity have traditionally been considered “conflicting” goals, the trend in thinking today is that the simultaneous achievement of these two goals may be possible, at least to a significant degree. Advances in understanding the multidimensionality of job performance have made it possible to identify the best predictors for the facet(s) of performance most important to the organization. This has created the possibility that “low-adverse impact selection devices may in fact be the best predictors of those aspects of performance that are most important to the employer. The outcome would be the selection of highly qualified applicants from diverse applicant groups.

Future Challenges

There is a critical need for more accurate descriptions of the employment tests is use today. Most tests described in the research literature are more akin to testing categories (e.g., work sample, assessment center, situational judgment, etc.) that actual descriptions of test content. This makes it difficult to compare specific testing methodologies in terms of validity and adverse impact. This lack of specificity also makes it difficult to determine the construct that the tests measure.

Some researchers bemoan the fact that despite progress in reducing adverse impact, it is difficult to eliminate it entirely. However, if a selection procedure is shown to be valid, minimization of adverse impact is all that is legally required. Nonetheless, many employers may consider any adverse impact unacceptable.

Finally, the existence of three and sometimes four racial/ethnic applicant groups may make it extremely difficulty to eliminate or even minimize adverse impact for all such groups simultaneously.


American Educational Research Association, American Psychological Association, & National Council on Measurement in Testing (1999). Standards for Educational and Psychological Testing. Washington, DC

Bobko, P., Roth, P.L., & Potosky, D. (1999). Derivation and implications of a meta-analytic matrix incorporating cognitive ability, alternative predictors, and job performance. Personnel Psychology, 52, 561-589.

Borman, W.C., & Motowidlo, S.J. (1993). Expanding the criterion domain to include elements of contextual performance. Personnel Selection in Organizations, San Francisco, CA: Jossey-Bass Publishers.

Chan, D., & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied Psychology 42, 143-159.

Cleary, T.A. (1968). Test bias: Prediction of grades of Negro and White students in integrated colleges. Journal of Educational Measurement 5, 115-124.

Day, D.V., and Silverman, S.B. (1989). Personality and job performance: Evidence of incremental validity. Personnel Psychology 42, 25-37.

De Corte, W. (1999). Weighing job performance predictors to both maximize the quality of the selected workforce and control the level of adverse impact. Journal of Applied Psychology, 84, 695-702.

Doverspike, D., Winter, J.L., Healy, M.C., & Barrett G.V., (1996). Simulations as a method of illustrating the impact of differential weights on personnel selection outcomes. Human Performance, 9(3), 259-273.

Equal Employment Opportunity Commission (1978). Uniform Guidelines on Employee Selection Procedures. Washington, DC.

Field, H., Bayley, G.A., & Bayley, S. (1977). Employment test validation for minority and non-minority production workers. Personnel Psychology 30, 37-46.

Griffin, R.D., Rothstein, M.G., & Johnston, N.D. (1986). Personality testing and the assessment center: Incremental validity for managerial selection. Journal of Applied Psychology 81, 746-756.

Hattrup, K., Rock, J., & Scalia, C., (1997). The effects of varying conceptualizations of job performance on adverse impact, minority hiring, and predicted performance. Journal of Applied Psychology, Vol. 82, No. 5, 656-664.

Hunter, J.E., & Hunter, R.F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72-98.

Kesselman, G.A. & Lopez, F.E. (1979). The impact of job analysis on employment test validation for minority and non-minority accounting personnel. Personnel Psychology 32, 91-108.

Kleiman, L.S., & Faley, R.H. (1978). Assessing content validity: Standards set by the court. Personnel Psychology 31, 701-713.

Kraiger, K., & Ford, J.K. (1985). A meta-analysis of ratee race effects in performance ratings. Journal of Applied Psychology 70, 56-65.

Landy, F.J., & Farr, J.L. (1980). Performance rating. Psychological Bulletin 87, 72-107.

Lodvinka, J. & Shoenfeldt, L.F. (1978). Legal developments in employment testing: Albemarle and beyond. Personnel Psychology 31, 1-13.

Messick S. Validity. In: Linn, RL (ed). Educational Measurement. New York: American Council on Education and Measurement, 1989: 13-103.

Murphy, K.R., & Davidshofer, C.O. (1994). Psychological Testing: Principles and Applications (3rd Edition). Englewood Cliffs, NJ. Prentice Hall.

Murphy, K.R., & Shiarella, H.A., (1997). Implications of the multidimensional nature of job performance for the validity of selection tests: Multivariate frameworks for studying test validity. Personnel Psychology. Volume 50, No. 4, 823-854.

Outtz, J. L., (1998) Testing medium, validity and test performance: Beyond multiple choice, In Hakel, M. Ed (1998) Beyond Multiple Choice, Evaluating Alternatives to Traditional Testing for Selection. Lawrence Erlbaum, Mahwah, New Jersey, 41-57.

Pulakos, E.D., & Schmitt, N. (1996). An evaluation of two strategies for reducing adverse impact and their effects on criterion-related validity. Human Performance, 9, 241-258.

Robinson, G., & Dechant, K. (1997). Building a business case for diversity. Academy of Management Executive 11, 21-31.

Sackett, P.R., & DuBois, C. L.Z. (1991). Rater-ratee race effects on performance evaluation: Challenging meta-analytic conclusions. Journal of Applied Psychology 76, 873-877.

Sackett, P.R., & Ellingson, J.E. (1997). The effects of forming multi-predictor composites on group differences and adverse impact. Personnel Psychology, 50, 707-721.

Sackett, P.R., Schmitt, N., Ellingson, J.E., & Kabin, M.B. (2001). High-stakes testing in employment, credentialing, and higher education: Prospects in a post-affirmative-action world. American Psychologist 56, 302-318.

Schmidt, F.L., Greenthal, A.L., Berner, J.G., Hunter, J.E., & Seaton, F.W. (1977). Job sample vs. paper-and-pencil trades and technical tests: Adverse impact and examinee attitudes. Personnel Psychology 30, 187-197.

Schmidt, F.L., Pearlman, K., Hunter, J.E., & Hirsh, H.R. (1985). Forty questions about validity generalization and meta-analysis. Personnel Psychology 38, 697-777.

Society for Industrial and Organizational Psychology, Inc. (1987). Principles for the Validation and Use of Personnel Selection Procedures (3rd edition). College Park, MD.

Supreme Court of the United States (1971). Griggs v. Duke Power Company Decision. Washington, DC.

Thorndike, R.L. (1971). Concepts of culture fairness. Journal of Educational Measurement 8, 63-70.

This page was last modified on May 16, 2007.

Home Return to Home Page