AI Training Data Results — Stack Overflow CEO Prashanth Chandrasekar poses on May 21, 2024, in Cambridge, Mass. Chandrasekar said the company is trying to balance rising demand for instant chatbot-generated coding assistance with the desire for a community "knowledge base" where people still want to post and "get recognized" for what they've contributed. (AP Photo/Matt O'Brien)

tech

AI is learning from what you said on Reddit or Facebook. Are you OK with that?

July 9 04:14 pm JST 24 Comments

By MATT O'BRIEN

CAMBRIDGE, Mass

Post a comment on Reddit, answer coding questions on Stack Overflow, edit a Wikipedia entry or share a baby photo on your public Facebook or Instagram feed and you are also helping to train the next generation of artificial intelligence.

Not everyone is OK with that — especially as the same online forums where they've spent years contributing are increasingly flooded with AI-generated commentary mimicking what real humans might say.

Some longtime users have tried to delete their past contributions or rewrite them into gibberish, but the protests haven't had much effect. A handful of governments — including Brazil's privacy regulator on Tuesday — have also tried to step in.

“A more significant portion of the population just kind of feels helpless,” said Reddit volunteer moderator Sarah Gilbert, who also studies online communities at Cornell University. “There’s nowhere to go except just completely going offline or not contributing in ways that bring value to them and value to others.”

Platforms are responding — with mixed results. Take Stack Overflow, the popular hub for computer programming tips. First, it banned ChatGPT-written responses due to frequent errors, but now it's partnering with AI chatbot developers and has punished some of its own users who tried to erase their past contributions in protest.

It’s one of a number of social media platforms grappling with user wariness — and occasional revolts — as they try to adapt to the changes brought by generative AI.

Software developer Andy Rotering of Bloomington, Minnesota, has used Stack Overflow daily for 15 years and said he worries the company “could be inadvertently hurting its greatest resource” — the community of contributors who’ve donated time to help other programmers.

“Keeping contributors incentivized to provide commentary should be paramount,” he said.

Stack Overflow CEO Prashanth Chandrasekar said the company is trying to balance rising demand for instant chatbot-generated coding assistance with the desire for a community “knowledge base” where people still want to post and “get recognized” for what they've contributed.

“Fast forward five years — there’s going to be all sorts of machine-generated content on the web," he said in an interview. "There’s going to be very few places where there’s truly authentic, original human thought. And we’re one of those places."

Chandrasekar readily describes Stack Overflow's challenges as like one of the “case studies” he learned about at Harvard Business School, of a how a business survives — or doesn't — after a disruptive technological change.

For more than a decade, users typically landed on Stack Overflow after typing a coding question in Google, and then found the answer, copied and pasted it. The answers they were most likely to see came from volunteers who'd built up points measuring their credibility — which in some cases could help land them a job.

Now programmers can simply ask an AI chatbot — some of which are already trained on everything ever posted to Stack Overflow — and it can instantly spit out an answer.

ChatGPT's debut in late 2022 threatened to put Stack Overflow out of business. So Chandrasekar carved out a special 40-person team at the company to race out the launch of its own specialized AI chatbot, called Overflow AI. Then, the company made deals with Google and ChatGPT maker OpenAI, enabling the AI developers to tap into Stack Overflow's question-and-answer archive to further improve their AI large language models.

That kind of strategy makes sense but may have come too late, said Maria Roche, an assistant professor at Harvard Business School. “I’m surprised that Stack Overflow wasn’t working on this earlier," she said.

When some Stack Overflow users tried to delete their past comments after the Open AI partnership was announced, the company responded by suspending their accounts due to terms that make all contributions “perpetually and irrevocably licensed to Stack Overflow."

“We quickly addressed it and said, ‘Look, that’s not acceptable behavior’,” said Chandrasekar, describing the protesters as a small minority in the “low hundreds” of the platform's 100 million users.

Brazil’s national data protection authority on Tuesday took action to ban social media giant Meta Platforms from training its AI models on the Facebook and Instagram posts of Brazilians. It established a daily fine of 50,000 reais ($8,820) for non-compliance.

Meta in a statement called it a “step backwards for innovation” and said it has been more transparent than many industry counterparts doing similar AI training on public content, and that its practices comply with Brazilian laws.

Meta has also encountered resistance in Europe, where it recently put on hold its plans to start feeding people’s public posts into training AI systems — which was supposed to start last week. In the U.S., where there's no national law protecting online privacy, such training is already likely happening.

“The vast majority of people just have no idea that their data is being used,” Gilbert said.

Reddit has taken a different approach — partnering with AI developers like OpenAI and Google while also making clear that content can't be taken in bulk without the platform’s approval by commercial entities “with no regard for user rights or privacy.” The deals helped bring Reddit the money it needed to debut on Wall Street in March, with investors pushing the value of the company close to $9 billion seconds after it began trading on the New York Stock Exchange.

Reddit hasn't tried to punish users who protested — nor could it easily do so given how much say voluntary moderators have on what happens in their specialty forums known as subreddits. But what worries Gilbert, who helps moderate the “AskHistorians” subreddit, is the increasing flow of AI-generated commentary that moderators must decide whether to allow or ban.

“People come to Reddit because they want to talk to people, they don’t want to talk to bots,” Gilbert said. “There’s apps where they can talk to bots if they want to. But historically Reddit has been for connecting with humans.”

She said it's ironic that the AI-generated content threatening Reddit was sourced on the comments of millions of human Redditors, and “there’s a real risk that eventually it could end up pushing people out.”

Discover all that Akita has to offer

From thrilling ski resorts and relaxing hot springs to vibrant winter festivals, there’s something for everyone.

Learn More

24 Comments
Login to comment

virusrex
July 10 04:12 pm JST

Endless loop of unnatural or mistaken content written by AI being harvested and used to train AI until all becomes just AI talking with itself about nonsense.

3 ( +4 / -1 )

dagon
July 10 01:29 pm JST

poor AI, if forced to learn from the garbage posted on public forums, it will be renamed AD (from dumbness)

How about Artificially Subsidized and Supplemented Simplicity (ASSS).

That could be a new data point for LLM web crawlers hungry for something new.

With all the bot farm spam it could be a death spiral feedback loop.

0 ( +1 / -1 )

timeon
July 10 12:33 pm JST

poor AI, if forced to learn from the garbage posted on public forums, it will be renamed AD (from dumbness)

0 ( +1 / -1 )

spinningplates
July 10 10:59 am JST

Lol.

AI aint ‘learning’ nothing from the junk I post! That’s for sure. Hahaha!

1 ( +1 / -0 )

Sven Asai
July 10 09:28 am JST

Of course it's not OK, because then people like me and many others who are not on Reddit or Facebook will be dramatically underrepresented if not even completely excluded from the training data and this leads to the next one of so many AI failures.

-1 ( +0 / -1 )

dagon
July 10 07:12 am JST

Not everyone is OK with that — especially as the same online forums where they've spent years contributing are increasingly flooded with AI-generated commentary mimicking what real humans might say.

Look at this site. Over the past week the number of stories about Japan or American politics have dropped off and engagement looks do be way down, judging by the number of comments.

Combined with moderation, which was often inscrutable but has gotten even more overzealous.

Top stories about cyclones in Bangladesh, a minor UN peacekeeping mission and other stories have zero comments at the top.

If one of the mods duties is too drive engagement and comment generation they are doing a poor job.

3 ( +3 / -0 )

kintsugi
July 10 06:42 am JST

I don't post on Reddit or Facebook.

2 ( +2 / -0 )

tora
July 10 06:27 am JST

Once your stuff is up in the public sphere, it's potentially there for perpetuity. You have to realise that. Google et all already 'own' your information and opening sell it to advertisers and do what they like with it as they see fit.

Don't want it out there, then don't post in the first place. This includes JT.

1 ( +1 / -0 )

Peter Neil
July 10 05:41 am JST

it isn’t even ai. it’s bots, which have been around for decades, and simple programming to copy the content.

ai at this point is a hoax to manipulate stock prices and valuations. all you ever see are samples and examples and tests, all of which are crap.

0 ( +1 / -1 )

JRO
July 10 04:50 am JST

Problem is that it's learning on our stuff to replace us at what ever we are currently doing to make a living. All this to transfer the money we are making to a few rich elites. Training on unlicensed content should absolutely not be legal, that's basically getting a non exclusive resell license to our content for free. I actually do have a contract where a company is allowed to train on some of my content which they pay for, which I'm totally okay with, every person part of the process need to be paid, otherwise it doesn't work.

Even then at some point there will be a need for some type of UBI, not everyone can become a plumber.

2 ( +2 / -0 )

GBR48
July 10 03:25 am JST

I have no problem with AI scraping my posts. I like to think it will improve the tech, which would be a public benefit.

But given the nature of so much web 2.0 content, I'm not sure you will get a reliable AI out of it. You may want your kids to study at Oxbridge. You probably wouldn't want them to study at the University of Facebook.

0 ( +1 / -1 )

DatAss
July 10 01:39 am JST

Let's all just hope AI isn't learning from my shenanigans.

0 ( +1 / -1 )

falseflagsteve
July 10 12:29 am JST

Fed up with AI, enough of this sort of thing I say. It’s dangerous and before you know it you’ve ended up with Skynet controlling the world. It’s happening too fast without enough controls and needs to be banned until more is known about it and how it can end up.

-9 ( +1 / -10 )

NCIS Reruns
July 9 11:45 pm JST

I don't partake in social networks, which I find moronic to the extreme. I don't even critique books I've read from Amazon. Emails to friends are as far as I'm willing to go.

5 ( +5 / -0 )

smithinjapan
July 9 10:29 pm JST

I don't care all that much. I already get targeted ads all the time. I would actually delete Facebook and all related apps immediately if it weren't my only way of keeping in touch with some people. I wouldn't mind seeing it go the way of MySpace and Mixi.

3 ( +3 / -0 )

blue in green
July 9 10:02 pm JST

What could possibly go wrong?

5 ( +5 / -0 )

owzer
July 9 09:51 pm JST

I had best keep posting common sense so AI will be a force for good!

-2 ( +2 / -4 )

sual
July 9 09:24 pm JST

Garbage in = Garbage out

6 ( +6 / -0 )

Blackstar
July 9 08:34 pm JST

“Fast forward five years — there’s going to be all sorts of machine-generated content on the web," he said in an interview. "

Forget the 5 years bit. There's a lot of that going on already, actually. For the companies that use it, that comes in as a saving. For professional writers that basically means a career change. "C'est la vie" you might say.

No one can fully comprehend where we're going with AI, but it's generally agreed that machines will be smarter than us (self aware, able to make decisions independently of us) within 15-20 years at the latest. It goes without saying that "the insights and ramblings of the J expat community 2024" will not be on its radar.

The scariest thing about the phenomenal growth of AI's potential is that the people who express the deepest worries about it are those who know the most about it (but have withdrawn from the production side of it)!

In my opinion, the best thing you can do is limit/end your social media use. If you need to "share", send emails to people you actually know.

2 ( +3 / -1 )

WoodyLee
July 9 07:35 pm JST

Wow, I am all for it as long as the content is Accurate and not misleading or paid for by certain interests$$$.

0 ( +1 / -1 )

Peter14
July 9 07:07 pm JST

Then I hope people are saying that humans and AI can live and work together, and that AI should never do anything that could harm or cause humans harm.

3 ( +3 / -0 )

Daninthepan
July 9 06:17 pm JST

If AI is learning from what I said, we are doomed.

10 ( +10 / -0 )

Ricky Kaminski13
July 9 05:34 pm JST

I actually have no problem with AI using our comments to train itself. Almost feel like its an honor to be the generation that trained baby AI. Hope it remembers who we all were, what we said and tried to stand for. Even the comments on our beloved JT may be fair game! 'The insights and ramblings of the J expat community 2024, the golden era', etched in AI memory for infinity! Howzaaat!!

-4 ( +3 / -7 )

TokyoOldMan
July 9 05:13 pm JST

AI doesn’t really create new stuff but instead regurgitates previous written stuff, maybe even reworded. So if r/conspiracy is used as the source for AI “wisdom” then things will certainly become interesting.

-1 ( +3 / -4 )