Bad Copyright Laws Are Creating Junky, Biased AI

Machine learning systems need lots of data to overcome bias — but copyright limits their menu

Photo Illustration: Diana Quach
Apr 05, 2017 at 6:18 PM ET

One of the best ways for artificial intelligence to learn is to be fed a steady diet of data — whether it’s pictures of faces so it can learn to detect them, or voices so it can recognize speech.

But as the saying goes, you are what you eat. And since copyright restrictions often limit the amount of quality, easily-obtained data like vast portrait databases and audiobooks, machine learning systems are often hampered by relying on widely available but less healthy options.

That’s the findings from a new draft paper presented at the recent We Robot conference in New Haven, Connecticut. It identified intellectual property hangups as a major hitch for machine learning.

“To avoid legal liability for copying works to use as training data, researchers and companies generally have two options: License an existing database of copyrighted works from a third party or create a database of works they own,” writes Amanda Levendowski, a teaching fellow at NYU’s Technology Law & Policy Clinic and the paper’s author. “When designing facial recognition software, for example, it would be prohibitively expensive and time consuming to negotiate license with a company like Getty Images, the world’s largest repository of photographs, or build a platform like Facebook or Instagram, to which users regularly upload photographs.”

This pushes small companies and AI researchers to avoid copyrighted material altogether, flocking instead to what Levendowski calls “low-friction data,” or training datasets that are readily available for public use. One example is the Enron email dataset, a popular repository of email communications from the energy conglomerate infamously brought down for mass-fraud and corruption in the early 2000’s. There’s also ImageNet, a database of 1.2 million images commonly used to train object recognition algorithms.

There’s just one problem: publicly-available datasets typically aren’t tailored to the AI designer’s goals, and many have been shown to introduce bias because of the types of data represented.

Researchers have repeatedly shown that algorithms trained on skewed and limited datasets can have disastrous impacts, especially when their decisions are used to determine things like which applicants get offered a job or receive a loan, or whether criminal defendants are released on bail.

Decision-making systems often treat certain groups of people unfairly because their training data simply reflects past injustices, like an algorithm that predicts criminals based entirely on who the criminal justice system has targeted in the past. Other times, the dataset under-represents certain types of examples. A lack of racial diversity in publicly available face databases, for example, has caused many commercial face recognition systems to be less accurate in identifying black people.

“Say you’re trying to pick a Supreme Court nominee. If you show [the algorithm] a bunch of pictures of past Supreme Court justices, it’s probably going to spit out a white man from Harvard or Yale,” Levendowski said.

The solution, she argued, is to make this biased, low-friction data less important by using copyrighted materials to fill in the gaps where datasets are skewed or incomplete — like when a training set for face recognition doesn’t represent African-American women, for example.

Ultimately, that means establishing in the law that using copyrighted works to train machine learning algorithms is a non-infringing “fair use” of those works — a doctrine of U.S. copyright law that allows us of a copyrighted work for purposes like scholarship or parody.

“In the article, I argue that transforming copyrighted works into training data for AI is a fair use. Because using copyrighted works can counteract some of the perverse effects of privileging demonstrably biased low-friction data, I also conclude that it’s a use that can quite literally promote fairness,” Levendowski told Vocativ.

Peter Eckersley, the chief computer scientist for the Electronic Frontier Foundation, agrees there’s a clear case for applying the fair use rule to machine learning data. One possible avenue would be to introduce an exemption for machine learning development to the Digital Millennium Copyright Act, a U.S. law that dictates the boundaries of copyright infringement online. But such exemptions need to be renewed every year by the Librarian of Congress, meaning that without a court ruling or change in the law, it would only be a temporary fix.

“The basic structure of copyright law never contemplated computers, let alone the idea that computers might need to learn from the world,” Eckersley told Vocativ. “We need a clear doctrine holding that collecting and using training data is fair use for the same reason that humans are always permitted to learn from and build upon the books we read.”