Artificial intelligence has a copyright problem, and this problem is deeply related to questions of ethics and justice. Increasingly, AI is adopted by our banks and our bosses, by our cars and our courts. Across the board, implicit bias remains a significant and complex problem.
Several examples have become emblematic of the ways in which implicit bias can channel AI in a prejudiced direction. The Nikon camera that kept asking whether Taiwanese American blogger Joz Wang and her family members were “blinking” while they were taking photographs, for instance, or the time when Google Photos tagged two black friends as “gorillas.”
Or take the example of Google search results. In 2015, the first image of a woman that a Google Images search for “CEO” returned was of a Barbie. Now, while it’s true that there isn’t gender parity among CEOs, these Google Images results give the false impression that there are no real women who are also CEOs. In fact, there are real women who run—or previously ran—companies that perhaps you’ve heard of: Lockheed Martin, Hershey, General Motors, PepsiCo, Procter & Gamble, and IBM, just to name a few. So, where are those women?
As the 2016 Obama White House report on AI put it: “AI needs good data. If the data is incomplete or biased, AI can exacerbate problems of bias.” What computer scientists say is: “Garbage in, garbage out.”
That brings me to the legal question, because AI learns by reading, viewing, and listening to copies of human-generated works—many of which are protectable by copyright law. I am a copyright lawyer by training. To me, of course, part of AI’s implicit bias problem felt like a copyright problem. As I dug into this project, I actually realized that the rules of copyright law are playing an enormous hidden role in the development of AI: everything from limiting the number of companies that can afford to compete to build artificial intelligence systems to privileging certain works to train AI—on the grounds that those works are easily accessible and legally low-risk, even when they are also demonstrably biased.
There’s a unique part of copyright law called the fair use doctrine, which mediates among the normative values of competition, innovation, access, and fairness in the context of other kinds of innovative computational technologies. I wondered whether fair use was just as capable of addressing those concerns in the context of AI.
There is one particularly eloquent example that demonstrates how challenging this problem is and how copyright law can channel AI in a biased direction. That example is the Enron emails.
The Enron emails remain one of the largest publicly accessible data sets of real emails in the world. These were 1.6 million real emails that were sent between Enron employees before being publicly uploaded by the Federal Energy Regulatory Commission, in 2003. They’re also one of the most widely used sources of AI training data.
How does copyright law contribute to AI’s implicit bias problem? More importantly, can copyright play a role in fixing it?
When companies choose training data for AI systems, they are often looking for data that’s easily accessible and, as a copyright matter, legally low-risk. Data sets featuring the Enron emails are freely available online in machine-readable formats. The chances of being sued by a former employee of Enron for copyright infringement are exceedingly remote.
You might think that the Enron emails are good for some applications. You can imagine why they might be an effective teaching tool for an AI system that identifies spam or automates how people organize their emails into folders. But it’s also worth remembering why we have these Enron emails in the first place; if you think there might be some significant biases embedded in the emails sent among the executives of an energy-trading company that collapsed under federal investigation for fraud—an infraction stemming from that company’s unethical culture—you would be right. The Enron emails are simply not representative of the US population—not geographically, not socioeconomically, and not in terms of race and gender. In fact, researchers have used the Enron emails specifically to analyze gender bias and power dynamics.
So, we have empirical data showing that the Enron emails reflect the implicit biases of the employees who sent them. And yet these emails remain a go-to data set for training AI systems because they’re easily accessible and legally low-risk, even though we know that AI systems learning from the Enron emails are picking up these implicit biases.
The issue of implicit bias in AI is certainly not new. I’m in the company of amazing computer scientists, social scientists, and legal scholars, including Batya Friedman, Helen Nissenbaum, Latanya Sweeney, Kate Crawford, Safiya Noble, and Danielle Citron, all of whom have examined the many sources and consequences of computational bias, from the unexamined assumptions of creators to technologically flawed algorithms. My work builds on that scholarship to talk about the most powerful law that’s contributing to this implicit bias problem: copyright law.
How does copyright law contribute to AI’s implicit bias problem? And, more importantly, can copyright play a role in fixing it?
Now, targeting implicit bias is only going to be the start of a conversation about justice and AI. Addressing issues of fairness or bias is the first step toward necessary conversations about what is ethical and what is just when it comes to developing, training, and releasing algorithms. Still, bias is a good place to start. And there’s one easy way to talk about the implicit bias problem that also offers insight into the copyright issue. That example is cats.
Imagine that these images of cats will be used as training data for an AI system designed to recognize cats. (Most commercial AI systems require thousands, if not hundreds of thousands, of images as training data.) Depending on what features an AI system fixates on, you’re going to get two very different and equally problematic results. This data set includes only tortoiseshell cats. In one instance, the AI system could fixate on the color of these cats—a mélange of orange, black, and white—and learn that this particular coloring is the mark of cat-ness. In that case, you might get a false positive result for a brindle dog with the same coloring. The other possibility is that the system learns that pointy ears, fluffy fur, and long tails are marks of cat-ness, making it likely to overlook outlier breeds like Scottish Folds, Devon Rexes, and Manx cats. The data set also only includes domestic cats; wild cats are completely outside of the equation.
Maybe these types of errors aren’t a big deal if you’re just looking for some images to throw into your slide deck. But if you’re building a system that is trying to identify cats entering Washington Square Park or the National Mall, now you have a problem. And if we consider this New Zealand citizen who wasn’t able to renew his passport because an AI system identified his eyes as being closed in his photo, then it’s clear these errors are more than just technical goofs. They can have dangerous consequences.
This is a data problem, which is where copyright law comes in. Because even though the internet is full of cats, that doesn’t mean the images we looked at are free for anyone to use.
Lurking behind the mechanics of training AI is a much hairier matter—that of copyright law. Because most of the photographs of cats that might be used to train an AI system can be protected by copyright law, we have to think about the friction that comes with using those images as training data. I’ll demonstrate that with another cat.
Getty Images is one of the largest repositories of copyrighted images in the world. They make pricing out a license pretty easy. To use this stock image of a tortoiseshell cat costs $375. So, when training AI requires hundreds of thousands of images, we’re talking about investing millions and millions of dollars in instructional data alone. That sum doesn’t include the money invested in paying engineers to build the algorithm, marketers to get it out into the field, or managers who supervise the work.
Distinguishing between what is legally permissible and what is ethically acceptable remains an urgent question.
In the paper article this essay is based on, I talk about the build-it and buy-it models that you can use to lawfully acquire data sets to train AI systems. The build-it model would be something like Facebook, where your status updates and selfies become training data for natural language processing and face recognition algorithms. Or you can adopt the buy-it model, like IBM, which partners with organizations like the Memorial Sloane Kettering Cancer Center to get access to data that can then be used to train its Watson for Oncology system.
Given the friction created by copyright law, it’s not surprising that many AI creators choose to use low-friction data, such as public domain works or Wikipedia articles. But some kinds of low-friction data can lead to copyright issues that widen out into complicated ethical questions.
Hacked emails, like those of the Democratic National Committee, have a lot in common with the Enron emails: they’re easily available in machine-readable formats, they’re perceived as legally low-risk, and they are being used to train AI systems. But using these emails to train AI highlights how the rules of copyright law that privilege biased, low-friction data as useful for training AI can also have implications for both ethics and privacy.
Tort law has long grappled with how to deal with private information made public without consent. In my previous research, I examined how copyright law struggles with distinguishing the public from the private in the context of nonconsensual intimate images, more commonly known as “revenge porn.” But popular sources of biased low-friction data, from hacked emails to publicly accessible profiles on dating websites, are further complicating this discussion. So, what’s the alternative?
To create fairer AI, you can invoke fair use, codified in the Copyright Act as a four-factor balancing test that permits the use of copyrighted works without permission from the copyright holder. Unfortunately, when it comes to asking whether using copyrighted works to train AI systems qualifies as fair use, we’re at something of a legal disadvantage. No court has yet weighed in on the issue. In fact, the last time the Supreme Court took up the question of fair use was in 1993–4, when 2 Live Crew was topping charts. In Campbell v. Acuff-Rose, the court acknowledged that “Pretty Woman,” the band’s parody of a Roy Orbison classic hit, could provide, quote, “social benefit, by shedding light on an earlier work, and, in the process, creating a new one.” Sounds a little bit like artificial intelligence.
Fair use can feel complicated and fact-dependent, and that’s because it is. But, as research by Barton Beebe and Pamela Samuelson has shown, it’s not random. Fair use has long been employed to consider and promote the public benefit of innovative computational technologies. Fair use has given us digital plagiarism detectors, reverse engineered videogames, and Google Books. Taken together, the legal analysis that underpins these and other fair use cases suggests that using copyrighted text and images to train AI systems is likely to be a fair use under copyright law.
But, as Kate Crawford has observed, data will always bear the mark of its history. AI is biased because humans are biased and AI systems learn to be all too human from reading, viewing, and listening to our creative works.
I’ve outlined how copyright law has the power to channel AI in a fundamentally biased direction by privileging biased works as training data. But it also has a profound power to un-bias AI. The values embedded in the tradition of fair use—innovation, competition, and accessibility—ultimately align with the equally important value of fairness. Fair use could, in some circumstances, quite literally promote the creation of fairer AI—by offering a legal way to protect AI creators who are dedicated to challenging structures of power and inequality with their technology. Those AI creators could use a more inclusive range of books, photographs, films, and other works as training data for AI systems and, if challenged in court by copyright holders, defend using copyrighted works as training data by invoking the fair use doctrine.
But the conversation can’t really end there. Just because a use is legally permissible doesn’t mean that it’s advisable. Take IBM’s recent incorporation of Creative Commons–licensed images into an AI training data set. While the use did not pose a copyright issue, it nevertheless caused outrage among some of those whose pictures became part of IBM’s data set without their consent. As Ryan Merkley, the outgoing CEO of Creative Commons, noted in the aftermath, “Copyright is not a good tool to protect individual privacy, to address research ethics in AI development, or to regulate the use of surveillance tools employed online.” Distinguishing between what is legally permissible and what is ethically acceptable remains an urgent question that demands rigorous engagement and thoughtful reflection by courts, companies, and, hopefully, people like you.
For a more detailed exploration of the ideas in this article, see Amanda Levendowski, “How Copyright Law Can Fix Artificial Intelligence’s Implicit Bias Problem,” Washington Law Review, vol. 93, no. 2 (2018).
This article was commissioned by Mona Sloane.