The Internet Archive has always been a valuable resource for journalists, from finding records of deleted tweets or providing academic texts for background research. However, the advent of AI has created new tensions between the parties. Several major publications have begun blocking the nonprofit digital library’s access to their content based on concerns that the company’s AI bots are using the Internet Archive’s collections to indirectly scrape their articles.
“A lot of these AI businesses are looking for easy-to-use, structured content databases,” said Robert Hahn, head of business and licensing affairs for The GuardianSPOKE Nieman Lab. “The Internet Archive’s API would have been an obvious place to plug in their own machines and absorb the IP.”
The New York Times took a similar step. “We blocked the Internet Archive bot from accessing the Times because the Wayback Machine provides unrestricted access Times content – including from AI companies – without permission,” a representative from the newspaper confirmed to Nieman Lab. Subscription-focused publication Financial Times and social forum Reddit have also taken steps to selectively block how the Internet Archive catalogs their material.
Several publishers have tried to sue AI businesses over how they access content used to train several language models. To name just a few from the field of journalism:
-
The New York Times nabbed OpenAI and Microsoft
-
The Center for Investigative Reporting was sued OpenAI and Microsoft
-
The Wall Street Journal and New York Post nabbed Confusion
-
A group of publishers that include The Atlantic, The Guardian and Politics nabbed UNITE
-
The New York Times and the Chicago Tribune nabbed Confusion
Some media outlets seek financial deals before offering their libraries as training material, although such arrangements seem to pay publishing companies more than writers. And that’s not even examining the copyright and piracy issues that also militate against AI tools in other creative fields, from fiction writers on visual artists on musicians. The whole Nieman Lab story worth reading for anyone following any answers to the creative industries of artificial intelligence.








