Everything is a data source for AI developers

No one’s jokes, memes, or Pokémon are safe from the relentless pursuit of AI training data.

What happened: Someone pulled one million Bluesky posts together and turned them into a dataset for machine learning and AI training. While it was among the most popular datasets on the repository website Hugging Face, the creator later took it down and apologized for violating “principles of transparency and consent in data collection.”

Bluesky has been adamant that it won’t train AI with user data, but it has few ways to stop others from doing so, though it is exploring technological and legal barriers.

Catch-up: Niantic, the company behind mobile games like Pokémon Go, revealed that it was using years of data from players to train what it calls a “large geospatial model” — like a large language model, but with location data.

Most geolocation and navigation data is from vehicles. Pokémon Go players walking on sidewalks and paths, as well as identifying landmarks, provides data that could be used in fields from mixed reality headset to robotics to the military.

And also: Publisher HarperCollins signed a deal with an unnamed AI company to train models on its books, though it is a fully opt-in program for authors.

Why it matters: Developers are hungrier than ever for AI training data, and they’re willing to turn to any sources available — even if the people the data is harvested from didn’t necessarily agree to it.

One reason cited for a slowdown in AI improvements has been a lack of new data. AI companies already scraped what was available online, and website owners are exerting greater control over what’s left.
Obtaining data legitimately also doesn’t mean people will be happy about it. X updating its terms to allow AI training is one reason users are jumping to Bluesky. Pokémon Go players didn’t know their data would one day be sold, and some journalists aren’t happy outlets are selling their stories to AI companies.

Zoom out: Publishers and website owners are trying various tactics to ensure AI companies keep their sticky fingers to themselves. These range from updating their robots.txt — a file that tells web services what they are and are not allowed to do — to updating the copyright pages in books.

Yes, but: The ways to prevent AI from scraping data essentially amount to asking nicely, unless a publisher goes the New York Times route and file a lawsuit. Developers have even figured out how to circumvent code that can be embedded into art to force an AI to hallucinate.

Everything is a data source for AI developers

Get the newsletter 160,000+ Canadians start their day with.

The Peak