The rise of generative AI in 2023 has been nothing short of remarkable. Seemingly overnight, tools like OpenAI’s ChatGPT have gone from obscure research projects to massively popular applications used by millions worldwide. However, the rapid pace of development has also raised serious questions around the data and content being used to train these AI systems.
In late December, tensions came to a head as The New York Times filed a major lawsuit against OpenAI and Microsoft – alleging the unauthorized use of NYT content. This high-stakes legal battle could have huge implications for the future of AI development and set new precedents around copyright in the age of generative models.
Background on Generative AI Growth
To understand this lawsuit, it’s important to first recognize the massive growth and impact of generative AI over the past year.
Tools like DALL-E 2, Stable Diffusion, and ChatGPT leverage a technique called generative pre-trained transformers (GPT) to produce remarkably human-like text, images, and other media. They work by ingesting massive datasets during the training process to learn patterns and associations.
For example, ChatGPT launched in late 2022 and almost instantly amassed over 100 million monthly users – making it one of the fastest growing apps in history. Its human-like conversational abilities have disrupted everything from customer service to school assignments.
However, much of the hype has overlooked the question of where these AIs get their intelligence from in the first place. Generative models are only as good as their training data. In ChatGPT’s case, researchers estimate it was trained on hundreds of billions of parameters pulled from publicly available sources online – including copyrighted news content, books, web pages and more.
“The unlawful use of The Times’s intellectual property undermines the tremendous investment we have made in high quality journalism,” said NYT executive David Perpich.
The New York Times Lawsuit
On December 27th, 2023, The New York Times fired a shot across the bow – filing a major federal copyright lawsuit against OpenAI and their partner Microsoft.
The core allegation is that OpenAI and Microsoft used NYT content like news articles, reviews and investigative pieces to train AI systems without permission or compensation.
Specifically, the lawsuit claims:
- Generative models like ChatGPT emphasize and reproduce NYT content “verbatim”
- This undermines NYT’s business by creating “substitute products”
- It threatens the financial viability of NYT’s public interest journalism
The lawsuit seeks unspecified damages likely amounting to billions of dollars based on copyright infringement penalties.
It also calls for OpenAI and Microsoft to destroy any AI systems and training data derived from NYT content. If this were to actually occur, it would be a massive setback that could require rebuilding key models from scratch without the disputed data.
Why Generative AI Relies Heavily on News Content
To understand why The New York Times content factors so heavily into models like ChatGPT, it helps to recognize what makes news articles uniquely valuable for training generative AI:
Trust and Accuracy – As a reputable news source, NYT content meets a high standard for factual reliability and accuracy that AI models seek to emulate.
Timeliness – Daily news coverage allows models to stay updated on current events and modern language/phrasing.
Range of Topics – NYT covers politics, business, technology, science, arts and more – helping models build broad knowledge.
Volume – With thousands of articles published daily, NYT provides a steady stream of high quality training data.
Without access to trustworthy, timely and encyclopedic news content, generative models would lose much of their capabilities. NYT and other publishers provide immense value to AI companies and the lawsuit alleges OpenAI and Microsoft have unfairly benefited without proper compensation.
Implications of the Lawsuit Moving Forward
Because this lawsuit is unprecedented, it’s impossible to predict exactly how it will resolve. However, legal experts and commentators have flagged a few key implications to watch for as the case proceeds:
Impact on OpenAI’s Valuation – OpenAI was recently valued at over $29 billion in a funding round. However an outcome forcing it to purge training data or pay large damages could significantly reduce that valuation.
More Lawsuits From Other Publishers – Early indications suggest NYT may be just the first publisher filing suit related to AI copyright issues. Groups representing European newspapers and authors have already announced plans to follow suit.
Another Lawsuit: Two nonfiction authors — Nicholas Basbanes and Nicholas Gage — filed suit against OpenAI and Microsoft in Manhattan federal court, alleging the companies misused their work to train AI models, Reuters reported.
Attorney to Know: Matthew Butterick is leading a series of lawsuits against firms such as Microsoft, OpenAI and Meta, El Pais reported. Butterick is seeking to defend the copyrights of artists, writers and programmers, the report explored.
Artist List Leaked: Lists containing the names of more than 16,000 artists allegedly used to train the Midjourney generative AI program have gone viral online, reinvigorating debates on copyright and consent in AI image creation, The Art Newspaper reported.
Three major music publishers — Universal Music Publishing Group, Concord Music Group and ABKCO — sued AI company Anthropic, according to The Hollywood Reporter. The publishers alleged that Anthropic infringed on copyrighted song lyrics by copying the text of the lyrics to train Anthropic’s models and allowing their models to generate text that is similar or identical to the copyrighted song lyrics, The Hollywood Reporter said.
The U.S. Copyright Office issued a notice of inquiry (NOI) in the Federal Register on copyright and artificial intelligence (AI). The Office will use gathered information to “advise Congress; inform its regulatory work; and offer information and resources to the public, courts, and other government entities considering these issues.”
The U.S. Copyright Office launched an initiative to examine the copyright law and policy issues raised by artificial intelligence (AI), including the scope of copyright in works generated using AI tools and the use of copyrighted materials in AI training.
Stock photo provider Getty Images sued Stability AI in the United States, alleging that it had infringed on Getty’s copyrights, Reuters reported. Stability AI at the time said it doesn’t comment on pending litigation. Getty Images filed a similar suit against Stability AI in Europe in January 2023.
- A group of visual artists has sued AI companies such as Stability AI, Midjourney and DeviantArt for copyright infringement, Reuters reported.
- Getty Images commenced legal proceedings in the High Court of Justice in London against Stability AI, claiming Stability AI infringed intellectual property rights including copyright in content owned or represented by Getty Images.
Status of Other Tech Partnerships – OpenAI and Microsoft have a close partnership central to products like Bing AI. The lawsuit could strain that relationship if Microsoft is also impacted by any legal fallout.
Push Towards Licensed Content Models – Some experts believe the best path forward is for AI companies to proactively partner with publishers to license content for training data, rather than scraping it without permission. Lawsuits could accelerate this shift.
The coming months promise to be pivotal as the first major legal battle around AI copyright takes shape between The New York Times, OpenAI and Microsoft. All sides will be closely watched by other tech firms and publishers considering their own positions around generative AI.
The Full Timeline of Key Events
While the NYT lawsuit captured headlines in late 2022, this conflict has actually been simmering behind the scenes for most of the year.
Here is a more complete timeline of how tensions escalated between publishers and Big Tech regarding generative AI:
- OpenAI research director Ilya Sutskever notes in an interview that models like GPT-3 were trained on “essentially every piece of text on the internet” adding “It’s very indiscriminate.”
- This fuels speculation that copyrighted content is being used without permission.
- Various publishers quietly begin testing ChatGPT and notice it can recite full passages from their articles verbatim.
- For example, NiemanLab finds that ChatGPT can reproduce entire sections of its articles without any attribution.
- A group of prominent authors including Margaret Atwood and Neil Gaiman file a lawsuit against OpenAI alleging copyright infringement in its AI training process.
- The Authors Guild seeks class action status to represent any authors whose work may have been used without permission.
- ChatGPT officially launches to the public and almost instantly gains viral popularity.
- NYT begins urgently meeting with OpenAI executives about its concerns over content usage.
Early December 2022
- OpenAI CEO Sam Altman preemptively announces a new policy to properly credit any 3rd party content reproduced by ChatGPT.
- “We’re hoping for a cooperative relationship with news publishers rather than an adversarial one,” says an OpenAI spokesperson.
- Negotiations between NYT and OpenAI reach an impasse over use of content and licensing issues.
- NYT threatens formal legal action if concerns aren’t addressed.
Late December 2022
- With no agreement reached, The New York Times officially files federal lawsuit against OpenAI and Microsoft on December 27th alleging copyright infringement.
- Altman expresses surprise at the lawsuit after believing productive negotiations were underway.
Assessing the Strength of Both Legal Arguments
Like any complex legal battle, both sides in the NYT vs. OpenAI case have reasonable arguments around copyright law and its application to AI systems.
The New York Times’ Position
The crux of the Times’ case is that OpenAI and Microsoft clearly used vast amounts of NYT content without permission to build generative models like ChatGPT. In doing so, it alleges they:
- Directly infringed on NYT copyright protections
- Created AI systems that actively compete with NYT’s own products
- Damaged the NYT brand by attributing fake quotes
The landmark Supreme Court case Feist Publications v. Rural Telephone Service established that facts themselves can’t be copyrighted – but the “creative expression” of those facts can.
So while generative AI can legally state basic news facts, NYT argues the current systems go much further by reproducing entire articles and passages with the paper’s creative expression intact.
“We seek only to stand up for principle – copyright protection and fair compensation for creators,” said NYT publisher A.G. Sulzberger.
OpenAI and Microsoft’s Defenses
OpenAI and Microsoft do have legal arguments of their own in response:
Fair Use – There is an exception under copyright law for “fair use” that allows reproduction of material for research, commentary and other transformative purposes. OpenAI contends that training AI has valid fair use grounds.
Fact vs Expression – As noted above, copyright does not cover pure facts. OpenAI may claim any verbatim passages were purely factual in nature rather than creative expression. But fact vs expression can become a gray area.
De Minimis Use – Even if some copyrighted material was used, OpenAI may argue it was an insignificant amount relative to the full training dataset size. Although the NYT alleges particular emphasis on its brand.
No Harm – Finally, OpenAI can claim that its systems ultimately drive more traffic, subscriptions and attention to NYT – rather than displacing it. Though NYT clearly disputes this notion.
Evaluating all these aspects will come down to complex legal arguments around AI copyright issues that are largely unprecedented in courts so far.
Why a Landmark Ruling Could Take Years
Based on the stakes and complexities involved, legal experts caution that the OpenAI lawsuit could potentially drag out for years before a definitive ruling is reached:
Time Consuming Process
- Federal lawsuits often take over a year just to reach the trial stage as motions and filings are presented.
- Appeals to any initial rulings can also draw out the timeline by months or longer.
Novel Legal Territory
- There is very little legal precedent so far on AI copyright infringement and fair use standards.
- Simply educating judges on the technical details around generative models poses a challenge.
Incentives to Settle
- Ultimately both sides may look to settle out of court to avoid a long, costly battle. But early posturing can still take time.
- NYT has incentive to set a strong precedent though, rather than a quick settlement.
Ongoing Model Improvements
- The AI landscape itself could shift significantly in a few years with next generation upgrades.
- Courts will have to rule on a “moving target” of improving technology.
Big Tech court battles like Google’s Android litigation demonstrate how even billion-dollar cases can meander through courts systems for over a decade. Both sides will need patience for the final outcome.
How Publishers and Big Tech Could Find Common Ground
Rather than commit to a scorched earth legal conflict, there may still be a path for publishers, OpenAI and Big Tech to forge a mutually beneficial way forward:
Pursue Licensed Access Models
- OpenAI has already begun offering publishers like Buzzfeed and Forbes money in exchange for access to new content to train generative AI models.
- A transparent, licensed content marketplace could be better for all.
Develop Attribution Standards
- Clear policies around attributing any 3rd party source material used in AI responses would help ease publisher concerns while still allowing reference to copyrighted works.
Utilize Abstractive Summarization
- More advanced abstractive summarization techniques could help models reference published information without verbatim reproduction that raises infringement flags.
Support Premium Publisher Integrations
- Special integrations exclusively showing original publisher content within conversational interfaces could provide value to users while also directing traffic and subscriptions back to creator sites.
Fund Journalism Grants
- OpenAI and Big Tech partners have tremendous financial resources that could be directed towards grants and programs supporting the future of journalism as a public good.
Broader partnerships and business model alignment could ultimately prove better for users than rivalrous legal action. But cooperation takes willingness from both publishers and tech firms to find common ground built on trust and transparency around content usage.
What Happens if OpenAI Loses?
There remains the possibility that OpenAI and Microsoft could outright lose the copyright case as it proceeds. What would the fallout look like if the companies face a complete legal defeat?
Massive Financial Penalties
- Statutory damages for willful copyright infringement can theoretically climb as high as $150,000 per work.
- With millions of articles potentially at play, theoretical damages could exceed billions.
- Realistically the final penalties would likely fall far short of the maximum, but could still be hugely consequential.
- If forced to purge associated training data, OpenAI may have to effectively “turn back the clock” on key models by years until suitable non-infringing replacement data can be utilized.
- For example, ChatGPT could revert to a much less advanced version while new data sets are secured.
Loss of Competitive Advantage
- Such product rollbacks would be highly damaging to OpenAI’s position as the market leader in generative AI against rivals like Anthropic working on constrained models.
- Rebuilding could take years and erase any first mover advantage.
Chilling Effect on Innovation
Some critics argue that if taken too far, the legal penalties could make tech companies hesitant to pursue ambitious AI initiatives if they carry unquantified legal risks around unintended copyright infringement. Striking the right balance will be key.
A Defining Case for AI’s Future
The New York Times vs. OpenAI lawsuit will have an enormous impact on setting expectations, responsibilities and incentives as AI becomes further enmeshed into digital products and services.
Issues of financial liability, appropriate content usage and publisher relationships hang in the balance as the case plays out. How these complex questions around copyright and generative models ultimately get resolved promise to shape the AI landscape for years to come by establishing firm legal precedent.
Both startups and tech giants will be closely monitoring for any guidance the messy court battle helps provide around what is and isn’t permissible as AI capabilities continue expanding at a torrid pace.
For OpenAI specifically, the case represents an inflection point after a meteoric rise to prominence in 2022. Setbacks from an adverse ruling could derail momentum and position. But a favorable outcome would further cement their status as the top AI trailblazer to watch entering 2024 and beyond.