New

Web-scraping and copyright law

An overview of one legal regime being asserted against scrapers with increasing frequency

Oct 19, 2023

Countless companies and investment firms rely on web-scraping—that is, the automated collection of data from the internet—to power their offerings and inform decision-making. While data scraping is becoming more prevalent, websites and individuals whose data is scraped are asserting a variety of legal claims against data scrapers. In this article, we provide an overview of one legal regime being asserted against scrapers with increasing frequency—copyright law.

Why copyright law is relevant to web-scraping

Copyright is a form of intellectual property that gives its owners the exclusive right to copy, distribute, display, adapt or perform their original works of authorship. Copyright law follows an idea/expression dichotomy in which copyright protects the particular expression of a work, such as the words in a book, notes in a piece of music or lines of computer code, but does not protect the ideas, concepts, principles or themes that may be present in a work. Copyright comes into play in the scraping context if a data vendor scrapes copyrighted content without its owner’s authorization.

It follows directly, that one question that often arises when a website or individual accuses a data provider of scraping copyrightable works is: “who owns the copyright?”

For example, in Facebook v. Power Ventures, the Northern District of California held that Facebook stated a claim for copyright infringement against Power Ventures for allegedly scraping Facebook web page materials (including graphics and video files) over which Facebook held copyright ownership. Power Ventures correctly noted that Facebook did not have copyright ownership over user content, and argued that it was only seeking to extract user content.

However, in order to collect user content, Power Ventures needed to make a copy of a user’s entire Facebook profile page, and Facebook claimed copyright ownership of the page as a whole. Denying a motion to dismiss, the Northern District of California held that this act of scraping the entire Facebook pages might indeed infringe Facebook’s copyrights.

In addition to “ordinary” copyright infringement claims, data hosts have also asserted claims against web scrapers for violations of the Digital Millennium Copyright Act (“DMCA”). Congress passed the DMCA in 1998 to govern the relationship between copyright law and a fledgling internet. While known more for its “safe harbors,” as relevant to web-scraping, the DMCA also prohibits the circumvention of technological measures that control access to copyrighted works on websites, such as paywalls, requirements to enter login credentials, CAPTCHA and limitations on access rates.

In at least one case the court found that a plaintiff alleged a plausible (though ultimately unsuccessful) DMCA claim against a data scraper who allegedly circumvented the website’s firewall in order to scrape copyrighted information.

The DMCA also prohibits the intentional removal or alteration of copyright management information (“CMI”)—that is, information conveyed in connection with a copyrighted work that informs the public that the work is copyrighted. Examples of CMI include the title of a work, the names of the creator and copyright owner, the copyright registration number, and the terms and conditions for use of the work. If content is scraped and posted without the accompanying CMI, a website host may assert a claim over this aspect of the DMCA. Such claims have been rare historically, but are alleged in several cases addressed below.

Generative AI and fair use

A new wave of plaintiffs has filed cases against generative AI developers, claiming they infringe copyright when they use a website’s content to train technologies.

For example, in July 2023, comedian Sarah Silverman filed a complaint on behalf of a class of plaintiffs against ChatGPT creator OpenAI, alleging that the company infringed the copyright of books including her memoir “The Bedwetter” by scraping them from unauthorized online libraries and then using them to train the tool.

In these cases, a central issue will be whether scraping web content to train generative AI meets the “fair use” exception to copyright infringement—an issue that no court has yet to rule on directly. The fair use defense, which involves a fact-intensive inquiry, permits use of copyrighted works without permission under certain circumstances.

In Associated Press v. Meltwater, a federal court held that the defendant’s publication of whole passages from the plaintiff’s copyright-protected news articles on the defendant’s website created a competing substitute for the original articles and thus was not fair use. By contrast, in Authors Guild v. Google, the Second Circuit held that Google Books’ snippet-producing function was fair use because it provided minimal excerpts of books and thus did not create a competing substitute for the original works.

Conclusion

While analysis of any exposure for copyright infringement is necessarily case-specific—particularly when the fair use defense is involved—data vendors should consider seeking legal advice with respect to their specific scraping practices. Taking a thoughtful approach to copyrighted materials may help web scrapers avoid liability while continuing to provide invaluable data to their clients and businesses.

About the authors

Renita Sharma, Hope Skibitsky, and Jonathan Abrams are experienced litigators in Quinn Emanuel’s New York office. They have extensive experience representing clients in the data scraping space, including on issues relating to the intersection of data scraping/aggregation and the Computer Fraud and Abuse Act, violations of terms of service and related torts. They also frequently advise clients on ways to mitigate their legal risk profile with respect to data collection and usage.

1 Facebook, Inc. v. Power Ventures, Inc., No. C 08-5780 JF (RS), 2009 WL 1299698, at *1 (N.D. Cal. May 11, 2009).

2 DHI Grp., Inc. v. Kent, No. CV H-16-1670, 2017 WL 8794877, at *6 (S.D. Tex. Apr. 21, 2017), report and recommendation adopted, No. CV H-16-1670, 2017 WL 4837730 (S.D. Tex. Oct. 26, 2017). In 2021, a jury returned a verdict in favor of the defendant.

3 Silverman et al v. OpenAI, Inc. et al, 4:23-CV-03416

4 Associated Press v. Meltwater U.S. Holdings, Inc., 931 F. Supp. 2d 537 (S.D.N.Y. 2013). The court granted summary judgment for the plaintiff.

5 Authors Guild v. Google, Inc., 804 F.3d 202 (2d Cir. 2015). The court granted Google’s motion to dismiss.

All insights

Suggest a topic for the Neudata blog

Suggest a blog topic

Web-scraping and copyright law

Why copyright law is relevant to web-scraping

Generative AI and fair use

Conclusion

About the authors

More on this topic

How can you navigate the final stages of institutional data licensing?

Keep your buyer's interest. Deliver the clearest signal in your data trials

Turning data assets into strategic advantages: insights from the frontline of monetisation

Is Your Data the Missing Cog? How to Pass the Pre-Call Filter with Institutional Buyers

Suggest a topic for the Neudata blog

Visit us at the Neudata booth during the Traditional and Market Data Summit on 18th September in London

Visit us at the Neudata booth during the Traditional and Market Data Summit on 18th September in London