U.S. Equal Employment Opportunity Commission
Meeting of July 1, 2015 - EEOC at 50: Progress and Continuing Challenges in Eradicating Employment Discrimination
Madam Chair and Commissioners, thank you for the opportunity to talk with you today about the rapidly growing role of so-called "big data" in employment decisions. We are particularly grateful that the Commission has allowed us to provide joint testimony and would like to introduce ourselves in turn.
Our names are Solon Barocas and Andrew Selbst. Dr. Barocas is currently a Postdoctoral Research Associate at the Center for Information Technology Policy, a research center that is a joint venture between the computer science department and the public policy school at Princeton University. He completed his doctorate in the department of Media, Culture, and Communication at New York University, during which time he was also a Student Fellow at the Information Law Institute. Mr. Selbst is currently a clerk on the United States Court of Appeals for the Third Circuit, and has previously clerked for the Honorable Dolly M. Gee of the Central District of California, worked at Public Citizen, and served as a Privacy Research Fellow at the Information Law Institute. Before law school he worked as an electrical engineer, designing circuits.
Dr. Barocas studies emerging computational techniques for data analysis-particularly those based on machine learning-and their implications for privacy, fairness, and other values. In his ongoing research, he grapples with the theoretical underpinnings of data mining and engages with data miners directly to pinpoint policy-relevant problems with the data mining process itself. Mr. Selbst's research interests focus on the relationship between technology and civil rights and liberties issues such as privacy, discrimination, speech and media.
We are authors of a forthcoming paper in California Law Review that explains why data-driven decision-making can discriminate unintentionally and how such cases would be understood under Title VII.
What is Big Data?
"Big Data" first gained purchase as a term of art among practitioners who found that they were pushing up against the limits of the traditional ways of storing, handling, and processing data. In particular, the phrase offered a shorthand way to refer to the infrastructural and computational challenges posed by datasets of unprecedented scale, and the need to develop or adopt new approaches for storage, retrieval, and analysis. The most commonly cited definition of big data-characterized as datasets of enormous volume, assembled at great velocity, with information in a variety of formats ("The 3 Vs")-retains this focus on datasets that overwhelm existing technologies.
In practice, however, the term has taken on a more expansive meaning. Big data is now commonly invoked when describing a wide variety of data practices, including those that would not meet the formal technical definition described above. However unwieldy the term has become, it draws attention to two trends that warrant attention. First, the term refers to the routine data collection that is a byproduct of the digital mediation now core to many aspects of everyday life. Data is big in this case because it is an increasingly comprehensive record of our goings-on. Second, the term also seems to apply when the datasets lend themselves to useful-computer-aided or -automated-analysis. Increasingly, this is the distinguishing feature of big data: the ability to detect useful patterns in datasets that can inform or automate future decision-making. Understood in this way, data is big when it can function as the grist for the analytics mill. For the purposes of this testimony, we focus on the "mining" of big data.
Why are employers turning to Big Data?
Many of today's employers turn to data mining to develop ways of improving and automating their search for good employees.1 Understandably, employers are drawn to data mining because it can vastly improve their ability to recognize "good" candidates by instructing them to search for those characteristics that data mining identified as the distinguishing features of prior employees that excelled on the job. Data mining also has the potential to help reduce discrimination by forcing decisions onto a more reliable empirical foundation and by formalizing decision-making processes, thus limiting the opportunity for individual bias to influence individual assessments. In the testimony that follows, we explain the many reasons why data mining may not always deliver on this promise.
The Persistence of Discrimination
While discrimination certainly endures in part due to decision-makers' irrational prejudice, a great deal of modern-day inequality can be attributed to what sociologists call "institutional" discrimination. Much of the disparate impact observed is not the result of intentional choices, but rather unconscious, implicit biases, entrenched institutional routines, or the lasting economic effects of prior discrimination. Many are rightfully hopeful that data mining will help to identify and address not only intentional discrimination, but these more persistent and no less pernicious forms of unintentional discrimination. As we discuss today, however, data mining also has the potential to fall victim to many of the same dynamics behind institutional discrimination, and to result in a disparate impact on protected classes.
There are good reasons to see big data as a potential boon to civil rights, and we commend employers who look to data mining as a tool to combat discrimination. It is dangerously naïve, however, to believe that big data will automatically eliminate human bias from the decision-making process.
Stated simply, a data-driven decision-making procedure ("an algorithm") is only as good as the data upon which it has been developed. Data is frequently imperfect in ways that allow these algorithms to inherit the prejudices of prior decision-makers. In other cases, they may simply reflect the widespread biases that persist in society at large. In still others, data mining can discover surprisingly useful regularities that are really just preexisting patterns of exclusion and inequality. Adopted or applied without care, data mining can deny historically disadvantaged and vulnerable groups full participation in society. Worse still, because the resulting discrimination is almost always an emergent property of the data mining process rather than a conscious choice by its programmers, it can be unusually hard to identify the source of the problem, to explain it to a court, or to remedy it technically or through legal action.
In our paper, we introduce the computer science literature on data mining and walk through the data mining process step-by-step. Our testimony will describe the potential problems at each step that can result in discrimination:
Our paper also includes a discussion of how data mining can be used to discriminate intentionally in ways that would be difficult or impossible to detect, which we will discuss only briefly today.
The problem with defining the problem
Solving problems with data is far less straightforward than one might imagine. While employers generally turn to data mining to try and find "good" candidates, the "best" applicants, or "standout" employees, these qualities are not self-evident, and data mining requires much more formal and precise definitions of those terms. Data miners must translate a rather amorphous demand for a tool that will find "good" candidates into a question about some very specific and measurable property. Only once this property has been defined can data mining go about its work of finding correlated factors that will allow a computer to infer whether a candidate, applicant, or current employee possesses this quality. Data mining does not offer objective assessments of "good"; it provides a way to estimate what has subjectively defined as "good."
In most cases, there will be many possible ways to define "good," about which reasonable people will disagree. Crucially, each of these definitions can have a potentially greater or lesser impact on protected classes. Sometimes those impacts will be immediately obvious, but often they will not. To observe that these judgments in the definition of "good" are by nature subjective does not, however, suggest that designers cannot take a predicted disparate impact into account. To the contrary, it means that such predictions must be taken into account from the very outset to be at all avoidable.2
The problem with learning by example
Data mining is a process of learning by example. To teach a computer to recognize "good" job applicants, data miners expose the computer to many examples of such employees. And rather than telling the computer in advance which details from the employee records matter, data miners instruct the computer to figure out which details from these records best distinguish the "good" from the not "good." The computer draws a general lesson about the characteristic features of "good" employees from many specific examples in order to apply that lesson to future applicants.
Learning by example is vulnerable to prejudice and bias because the examples themselves may be tainted by prejudice or bias. This can happen in two ways, both of which can be very easy to overlook. First, prejudice and bias may have affected the prior assessments that data miners feed the computer as examples. In these cases, the computer can learn from the bad example set by these prior decisions. For example, the computer may learn to discriminate against certain female or black applicants if trained on prior hiring decisions in which an employer has consistently rejected jobseekers with degrees from women's or historically black colleges. Similarly, where employers have discounted, undervalued, or penalized employees who belong to a protected class, the resulting records will teach the computer to believe that employees with these and related characteristics perform less well than their peers. On its own, a computer cannot know that the employer has assessed its disabled workers less favorably when performing as well as others. The computer necessarily takes its training examples as ground truth. In so doing, it can learn to replicate the same prejudice or bias that data-driven decision-making is meant to stamp out.3
Second, the set of examples that employers use to teach a computer to assess applicants may be skewed. In the language of statistics, the dataset might suffer from a sampling bias, offering a disproportionate representation of the constituent parts of the overall population. Contrary to its boosters' claims, big data is rarely exhaustive.4 In fact, the quality and representativeness of records might vary in ways that correlate with class membership. Most obviously, an employer's records might contain very few examples of members of protected groups in certain roles because the employer has discriminated against applicants with these traits in the hiring process. Likewise, certain communities may have been perceived as a less important source of potential employees and therefore less worthy of attention in recruitment efforts. Certain parts of the population might also have faced greater hurdles in accessing and making use of the technology that generate records that facilitate recruitment and hiring. It is therefore easy to imagine that an employer's historical records might contain disproportionately low numbers of members of protected classes relative to the number of qualified candidates in the overall labor market. This under-counting will compromise the validity of the lessons that data mining is able to draw from these examples. This is especially worrisome in situations where the traits of an already underrepresented minority that best predict success of the job do not overlap with those of the overrepresented majority. For example, if the disproportionately small number of older employees that excel at some job do not share the same credentials or background as the other and far more numerous employees that also perform well in these roles, data mining is unlikely to learn to value the qualities associated with the productive older employees as much as the distinguishing qualities of the more common, younger employees.
The problem with granularity
Each example of a "good" employee includes a corresponding string of features: all the details that the company has amassed about the employee. Like humans, computers will never have access to all the factors that account for variation in job performance, but data mining is expressly designed to find those features among the set under consideration that best distinguish "good" employees from others. Unfortunately, the level of detail that is necessary to reliably classify the majority of employees may be insufficient when dealing with members of historically disadvantage groups.
To take an obvious example, hiring decisions that consider credentials tend to assign enormous weight to the reputation of the educational institutions from which an applicant has graduated, despite the fact that such credentials may communicate very little about the applicant's job-related skills. Consequently, employers routinely overlook those who possess the desired competencies because they seem to lack the putatively necessary credentials. Worse, because members of certain protected classes tend to graduate from these schools at lower rates that others-not only in absolute terms, but relative to their size in the overall population-employers are even less likely to find qualified candidates among these groups. At this level of detail, members of protected classes will be subject to systematically less accurate assessment; the variation in competency that matters to the employer will only become obvious at a lower level of granularity.
If this problem stems from a lack of information, then why would big data not be a potential solution? A number of companies have already discovered that novel sources of publicly accessible data can furnish the kinds of details necessary to discover skilled employees in traditionally overlooked populations.5 These companies' tools are attractive to employers who struggle to find talent in especially competitive or tight labor markets because they can help to uncover qualified candidates that others might miss. Unsurprisingly, because members of protected classes are more likely to have nontraditional backgrounds, these same tools have helped to increase the diversity of employers' workforces.
Unfortunately, even if employers have rational incentive to look beyond credentials and focus on criteria that allow for more precise assessment, they may continue to favor credentials because they communicate pertinent information at no cost to the employer. Indeed, members of protected classes may find that they are incorrectly passed over more frequently than others because the level of detail necessary to achieve equally accurate determinations is too costly for most employers to pursue. While big data may bode well here, the question remains whether the relatively higher costs involved in gaining more data about marginalized groups justify subjecting them to less effective assessment.
The problem with proxies
Even successful applications of data mining can be problematic if the factors discovered to be most predictive of future job performance also happen to be highly correlated with membership in a protected class.
Take, for example, the widely reported tools developed by Evolv (now OnDemand).6 Catering to employers who must deal with rapid employee turn-over, Evolv developed a tool to estimate attrition by looking at a wide variety of measured employee activities and personal details. By its own admission, Evolv discovered that one of the best predictors of employee turnover happens to be the distance between the employee's residence and workplace. Evolv was well aware that distance from work could be highly correlated with membership in a protected class, given racial and ethnic residential segregation, and therefore remove this factor from consideration. Other companies have not held back from considering this information for the very same purposes.7
As we conclude in our paper, "[s]ituations of this sort can be quite vexing because there is no obvious way to determine how correlated a relevant attribute must be with class membership to be worrisome, nor is there a self-evident way to determine when an attribute is sufficiently relevant to justify its consideration, despite the fact that it is highly correlated with class membership."8
Masking problematic behavior
We have, thus far, described a number of ways that data mining can give rise to discrimination in the absence of mal intent. Data mining could, of course, also breathe new life into traditional forms of intentional discrimination. Most obviously, employers could rely on data mining to purposefully uncover reliable but nonobvious proxies for legally proscribed features and then set in place hiring decisions that turn on these seemingly innocuous factors, effectively masking their prejudicial intent. Such scenarios have been the main preoccupation of those expressing fears about big data and discrimination.9 While these fears are technically well founded, they strike us as less in need of attention than unintentional discrimination. Most cases of employment discrimination are already sufficiently difficult to prove that employers motivated by conscious prejudice would have little to gain by pursuing these complex and costly mechanisms to further mask their intentions. When it comes to data mining, unintentional discrimination is the more pressing concern because it's likely to be far more common and easier to overlook.
Because we focus here on unintentional discrimination, our discussion of Title VII is limited to disparate impact. As you know, a disparate impact case breaks down into a prima facie showing of disproportionate impact, followed by an analysis of the business necessity defense put forth by an employer, and finally, a plaintiff's rebuttal of a less discriminatory alternative employment practice. Assuming the dataset used by the employer results in the discriminatory effects that we suggested earlier, we are left with the latter two considerations, and we sketch that analysis here.
Let us consider how data mining works within this framework. For now, assume a court does not apply a strict "necessity" standard, but uses a lower "job-related" standard. The first issue to consider, then, is whether the sought after trait is job-related, regardless of the fact that data is used to predict that trait. If the trait is not sufficiently job-related, a business necessity defense will fail, regardless of the fact that the decision was made by algorithm. Thus, disparate impact liability can be found for improper care in problem specification. For example, in a job for which criminal background is not relevant, it would be difficult for an employer to justify an adverse determination triggered by the appearance of an advertisement suggesting a criminal record alongside the results of a Google search for a candidate's name.10
Once we pass that initial inquiry, though, we must examine the doctrine more closely. As you know, Title VII requires that an employment practice with a disproportionate impact be "job related for the position in question and consistent with business necessity."11 Thus the next questions are whether the model is actually predictive of the traits in question, and whether it is accurate enough. Courts are all over the map on how strictly to consider the "necessity" portion of that test, with some courts requiring a "manifest relationship"12 between the trait and job in question or that the trait be "significantly correlated" to job performance.13 The upshot is that if a test is predictive, it need not be perfect.
Under the Uniform Guidelines on Employment Selection Procedures, the use of data mining to screen job applications or promotion candidates is a selection procedure that must be validated in order to be considered job related.14 Because data mining does not directly test skills related to the job, the resulting model must demonstrate either criterion-related validity, if the data is used to predict job performance directly, or construct validity, if used to predict a human trait, such as honesty. According to the Guidelines definitions, criterion-related validity "consist[s] of empirical data demonstrating that the selection procedure is predictive of or significantly correlated with important elements of job performance,"15 and a user of a construct "should show by empirical evidence that the selection procedure is validly related to the construct and that the construct is validly related to the performance of critical or important work behavior(s)."16 Either way, the important part is that there be statistical significance showing that the result of the model correlates to the trait in question (which was already determined to be an important element of job performance).
We can sort the problems identified earlier into essentially two groups: discrimination resulting from models that are "too good" and "not good enough." Proxy effects are the example of the first group, as the real problem with them is that they are too accurate. Though harmful in terms of foreclosed opportunities, what proxies demonstrate is the uneven real-world distribution of certain traits that the employer seeks. This is not a problem that disparate impact doctrine handles cleanly. If a model is accurate and predicated on legitimately job-related traits, there is good reason to believe it may be valid under the Guidelines. The entire purpose of data mining is to predict future outcomes based on statistical relationships. If an employer actually goes forward and uses the models as part of the selection procedure, it is because it they were predictive of something. So the question solely comes down to whether the trait sought is important enough to job performance to justify its use in any context.17
Discrimination based on faulty or insufficiently accurate models-the second group-can theoretically be addressed by current doctrine more easily. If a model is faulty, the statistical relevance of its predictions is simply worse than it could be if it were not. Thus, under the Guidelines, if a model is bad enough, it will not show the necessary statistical relationship for validity. A more likely legal authority, though, is the alternative employment practice question. If it can be shown that an alternative, less discriminatory practice that accomplishes the same goals exists, and that the employer "refuses" to use it, he can be found liable.18 In this case, the obvious alternative employment practice would be to fix the problems with the models.
Practically, however, such a finding may be more difficult than it first appears. First, it is not always obvious in which ways a model is discriminatory or how badly biased it is. A model that takes biased data as its inputs does not have unbiased data to compare it to; standard tests of validity will report good performance. Nor is the under- and over- representation of protected classes in a dataset always clear. Where it is clear, the mechanism by which such misrepresentation occurs may also be difficult to detect and therefore remedy. If a court is able to see discriminatory results, but unable to figure out the responsible mechanism, it is difficult to imagine the court rejecting the initial showing of statistical relevance, even if it is not perfect, or finding that an employer refused to use an unidentifiable alternative employment practice. Second, an employer may, and often will, rely on a third party or packaged software and will thus not have the access necessary to improve the decision procedure. Even assuming the disparate impact mechanism is clear, it will be difficult to claim that a less discriminatory alternative exists and that the employer refuses to use it; the choice is to rely on outside experts or use an off-the-shelf product or to not engage in data analysis at all. Under current doctrine, only a stricter form of business necessity, under which a test must truly be a minimum requirement for a job to be valid, can ensure that these faulty models will be rejected. Of course, such a requirement will all but halt the use of data mining entirely.
Our testimony today should be taken as a call for well considered use of data mining, not its abandonment. That data mining is vulnerable to a number of problems that can render its results discriminatory is not a reason to discount its use altogether. This would be a perverse outcome, given how much big data can do to help reduce what remain outrageously high rates of discrimination against certain members of society in the workplace. Where there is uncertainty about the potentially discriminatory effects of data mining, employers will be reluctant to rely on it in their decision-making. Greater clarity is therefore needed about these risks to enable employers to make appropriate use of big data. We recommend that the EEOC cultivate the necessary expertise to offer this clarity. In so doing, we echo the White House's recommendation that the "Equal Employment Opportunity Commission […] expand [its] technical expertise to be able to identify practices and outcomes facilitated by big data analytics that have a discriminatory impact on protected classes".19
Education is equally important. Right now, these problems are relatively unknown. But the more employers and data miners understand the pitfalls, the more they will strive to create better models on their own. Many employers switch to data-driven practices for the express purpose of eradicating bias; if they discover that they are introducing new forms of bias, they can correct course. Companies that do so could also distinguish themselves from competitors by announcing that they've taken these additional remedial steps. Even employers seeking only to increase efficiency or profit may find that their incentives align with error correction. Faulty data and data mining will lead employers to overlook or otherwise discount what are actually "good" employees. In these cases, correcting for unintentional discrimination could have enormous benefits for historically disadvantaged populations while also increasing company profit. Where the cost of addressing these problems is at least compensated for by a benefit of equal or greater value, employers may have natural incentives to do so. This also suggests, however, that there may be limits to how far employers will go voluntarily. These cases may require more aggressive solutions or perhaps Congressional action.
A small but growing community of computer scientists have also begun to develop tools to facilitate the process of detecting and correcting for skewed or biased examples.20 Researchers have also developed ways to reason formally about fairness in cases where certain factors that are demonstrably relevant to some decision also happen to correlate with class membership, with the goal of not having to strike these factors from consideration while still withholding proscribed information. These efforts would benefit tremendously from the input of employers, regulators, and policymakers. Researchers are eager to develop a better grasp of the practical administration of discrimination law and to understand what kinds of tools various stakeholders would find most useful. Researchers have also struggled with limited access to real-world datasets and would benefit from collaborations that make such data available.
It is also important to realize that solutions exist entirely outside data mining itself. By its very nature, data mining takes the world as a given and attempts to sort employees as if their predicted performance is entirely independent of the conditions under which they will work. Limiting the issue of discrimination to fairness in the sorting or selection procedure obscures the fact that employers are in a position to alter some of the conditions that account for variation in employee success. As we argue in our paper, "[a] more family-friendly workplace, greater on-the-job training, or a workplace culture more welcoming to historically under-represented groups could affect the course of employees' tenure and their long-term success in ways that undermine the seemingly prophetic nature of data mining's predictions."21 These are all traditional goals for reducing discrimination within the workplace, and they continue to matter even in the face of the eventual widespread adoption of data mining.
1 Claire Cain Miller, Can an Algorithm Hire Better Than a Human? N.Y. Times (Jun. 25, 2015), available at http://www.nytimes.com/2015/06/26/upshot/can-an-algorithm-hire-better-than-a-human.html
2 And because there is a potentially infinite number of ways of making sound hiring decisions, data miners can experiment with multiple definitions that each seem to serve the same goal, even if these fall short of what they themselves consider ideal.
3 One more thing to consider is that workplaces that have been unwelcoming or outright hostile to protected classes, but subject members of these groups to the same formal assessments, may not generate records that wrongly understate these employees' contributions. They may nonetheless create biased data because the employees' performance itself has been affected by the hostile work environment.
6 How might your choice of browser affect your job prospects? The Economist (Apr. 10, 2013), available at http://www.economist.com/blogs/economist-explains/2013/04/economist-explains-how-browser-affects-job-prospects Don Peck, They're Watching You at Work, The Atlantic (Nov. 20, 2013), available at http://www.theatlantic.com/magazine/archive/2013/12/theyre-watching-you-at-work/354681/ Will A Computer Decide Whether You Get Your Next Job? Planet Money (Jan. 15, 2014), available at http://www.npr.org/sections/money/2014/01/15/262789258/episode-509-will-a-computer-decide-whether-you-get-your-next-job
7 Joseph Walker, Meet the New Boss: Big Data, Wall Street Journal (Sep. 20, 2012), available at http://www.wsj.com/news/articles/SB10000872396390443890304578006252019616768
9 Alistair Croll, Big Data Is Our Generation's Civil Rights Issue, and We Don't Know It, Solve for Interesting (July 31, 2012) at http://solveforinteresting.com/big-data-is-our-generations-civil-rights-issue-and-we-dont-know-it/.
13 Gulino v. New York State Educ. Dep't, 460 F.3d 361, 383 (2d Cir. 2006)("significantly correlated with important elements of work behavior which comprise or are relevant to the job or jobs for which candidates are being evaluated" (quotation omitted)).
17 Assuming validation remains a hurdle, there is a potential for additional challenges in enforcement. Some courts ignore the Guidelines' recommendation that an unvalidated procedure be rejected, preferring to rely on "common sense" or finding a "manifest relationship" between the criteria and successful job performance. This tendency might actually work in favor of enforcement, however, given the unintuitive nature of the correlations on which data mining often relies.
19 Executive Office of the President, Big Data: Seizing Opportunities, Preserving Values (2014), at http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_5.1.14_final_print.pdf
20 We refer to the work of "Fairness, Accountability, and Transparency in Machine Learning," an ongoing series of workshops organized by Solon Barocas and a group of computer scientists, which bring together researchers working on these problems. http://www.fatml.org/