OpenAI Inks Deal to Train AI on Reddit Data

OpenAI has finalized an agreement with Reddit to utilize the social news site’s data for training its AI models.

In a blog post on its press relations site, OpenAI announced that the partnership will grant it access to “real-time, structured, and unique content”—including posts and replies—from Reddit. This data will help improve the capabilities of OpenAI’s tools and models to better understand and showcase that content. Reddit content will be integrated into ChatGPT, OpenAI’s popular conversational AI, and both companies will collaborate on introducing unspecified new “AI-powered features” for Reddit users and moderators.

Additionally, OpenAI will become a Reddit advertising partner.

“Reddit will be building on OpenAI’s platform of AI models to bring its powerful vision to life,” OpenAI stated in the post. “Using LLMs, ML, and AI allows Reddit to improve the user experience for everyone.”

OpenAI has established several similar licensing agreements with various content providers, ranging from stock media libraries to news publishers. However, a notable aspect of this partnership is that Sam Altman, OpenAI’s CEO, holds an 8.7% stake in Reddit, making him the third-largest shareholder and a former member of Reddit’s board of directors.

To address potential scrutiny, OpenAI’s press release emphasized that, while Altman remains a Reddit shareholder, the partnership “was led by OpenAI’s COO [Brad Lightcap]” and “approved by [OpenAI’s] independent board of directors.”

Reddit has increasingly made data licensing agreements a central part of its growth strategy as it navigates the market as a public company.

In its IPO prospectus, Reddit disclosed contractual agreements to license its data to customers, including Google, with a combined worth of over $200 million. In its first earnings report as a public company, Reddit reported a 450% year-over-year increase in non-ad revenue, largely due to these agreements.

Following the announcement of the OpenAI deal, Reddit’s stock rose 11% in extended trading.

“The paradox I see is that, as more content on the internet is written by machines, there’s an increasing premium on content that comes from real people,” Reddit CEO Steve Huffman said during the company’s earnings call in March. “And we have nearly two decades of authentic conversation.”

Reddit’s platform—comprising over 1 billion posts and more than 16 billion comments, figures that grow daily thanks to its hundreds of millions of active users—is a valuable resource for generative AI companies. These models learn from examples of content, like text and images, to generate new, similar content.

However, the company may face pushback from users concerned about the monetization of their data.

For instance, Stack Overflow, a Q&A forum for software developers, recently struck an agreement with OpenAI to provide data for model training. In protest, some users deleted their top-rated answers on the platform. Stack Overflow responded by restoring the deleted posts and banning those users, citing non-compliance with its terms of service.

Reddit has already shown resistance to one effort aimed at giving users more control over their data.

Vana, a startup built on blockchain technology, is trying to launch a data DAO (Digital Autonomous Organization) to let Reddit users pool their data and collectively decide how it’s used or sold. Reddit banned Vana’s subreddit, which was dedicated to discussing the DAO, accusing the company of “exploiting” its data export controls.

Subscribe to Updates

What's Hot

OpenAI Inks Deal to Train AI on Reddit Data

Related Posts

Subscribe to Updates