Data scraping in 2024: What we are watching
We predict that 2024 will be a critical year for the data scraping industry.
Mar 4, 2024
While 2023 was largely focused on the exciting achievements of AI technologies, stakeholders are now tuned in to how these models work, including how they are trained. This article provides a brief overview of the regulatory and legal developments we will be following closely in 2024—developments that are likely to have meaningful impacts on the web extraction industry.
The European Union’s Monumental AI Act
On February 2, 2024, the Council of the European Union voted unanimously to confirm the final text of the EU’s AI Act. In the relevant part, the AI Act imposes transparency obligations on companies operating general purpose AI (“GPAI”) models that are marketed in the EU and offers certain protections for protected works used to train AI. For example, under the AI Act, GPAI providers must “make publicly available a sufficiently detailed summary about the content used for training of the [GPAI] model.” These summaries are to be provided regardless of whether the training data was copyrighted. The AI Act also solidifies protections for copyrighted works used for training purposes. For example, covered developers must honor opt-out requests made by rightsholders and, in some cases, may be required to obtain authorization from rightsholders in order to conduct text and data mining. The AI Act imposes additional obligations relating to training data for “high-risk” AI systems. Specifically, “[t]raining, validation and testing data sets [for high-risk AI systems] shall be subject to appropriate data governance and management practices,” including practices surrounding “data collection.”
While the various requirements of the AI Act will be slowly rolled out, we wait to see how these requirements and others like it may impact how data vendors collect data that is used to train AI models.
Potential U.S. Regulatory Framework for AI
This year will feature ongoing deliberation by U.S. legislators on the contours of a federal AI regulatory framework. In September 2023, Senators Richard Blumenthal (D-CT) and Josh Hawley (R-MO) introduced a bipartisan framework for a comprehensive U.S. AI Act. The framework would establish a licensing regime for GPAI providers involving compulsory registration with an independent oversight body, and would require providers to disclose information about the training data, limitations, accuracy, and safety of their AI models. Other proposed legislation is likewise focused on training data. For example, in December 2023, the “AI Foundation Transparency Act” was introduced in Congress, with a focus on greater transparency into the data used to train AI models and copyright holders’ interests.
In October 2023, President Biden issued a sweeping executive order intending to enhance oversight over AI development. Among other provisions, the executive order directs the Copyright Office, by July 26, 2024, to recommend “potential executive actions relating to copyright and AI … including … the treatment of copyrighted works in AI training.”
Thus, much like the EU, a significant focus of the U.S.’ AI-related legislative agenda is focused on the data used to train these systems and thus could impact the practices underlying the collection of that training data.
Litigation in the U.S.
Already in Q1 of 2024, we have seen significant rulings from U.S. courts that are likely to impact the data scraping industry, and we expect to see more rulings this year that could provide greater clarity on the legal regimes web scrapers continue to confront.
Last month, a court in the Northern District of California granted summary judgment in favor of data vendor Bright Data on Meta’s claim that Bright Data’s scraping of its data amounts to a breach of contract (i.e, Meta’s Terms of Service). The court rejected Meta’s breach of contract claim, holding that “the Terms only prohibit logged-in scraping, and not logged-off scraping,” because a “use” of Facebook or Instagram only refers to actions taken by users while they are logged in. The court also held that a survival clause in Facebook’s terms of service was unenforceable to the extent it purported to prohibit scraping of Facebook data that occurs after a user terminates its Facebook account.
We will also closely follow litigation asserting copyright claims against AI developers arising from the alleged use of copyrightable training data. For example, in a putative class action lawsuit filed by author Sarah Silverman and other authors, plaintiffs assert claims of direct and vicarious copyright infringement on the theory that AI developers scraped their copyrighted materials to train generative AI models that produce allegedly infringing derivative works. In February of this year, a federal judge dismissed certain claims made by the plaintiffs, including claims for vicarious copyright infringement, violation of the Digital Millennium Copyright Act, negligence, and unjust enrichment. The court rejected the plaintiffs’ contention that “every output of the OpenAI Language Models is an infringing derivative work” by virtue of being trained on copyrighted materials. Still pending is the plaintiffs’ claim that OpenAI’s training on copyrighted works violates copyright law and OpenAI’s argument that the use of works to train AI systems constitutes permissible “fair use.” Any ruling clarifying the application of the “fair use” doctrine to material used to train AI is likely to have far-reaching consequences on data vendors who help power AI technologies.
Conclusion
While 2024 is already shaping up to be a critical year for the data scraping industry, stakeholders across the industry are left with many unanswered questions regarding how regulators, lawmakers, and courts will view questions relating to data scraping in the age of AI. With courts, legislators, and potentially regulators set to break new legal ground in the upcoming year, web scrapers should stay informed about the latest legal developments.
The authors
Renita Sharma, Hope Skibitsky, and Daniel Sisgoreo are experienced litigators in Quinn Emanuel’s New York office. They have extensive experience representing clients in the data scraping space, including on issues relating to the intersection of data scraping/aggregation and the Computer Fraud and Abuse Act, violations of terms of service and related torts. They also frequently advise clients on ways to mitigate their legal risk profile with respect to data collection and usage.