Cloudflare, the well-known cloud service provider, has unveiled a new, free tool aimed at preventing bots from scraping websites hosted on its platform for data to train AI models.

Some AI vendors, including Google, OpenAI, and Apple, provide website owners the option to block their data-scraping bots by modifying their site’s robots.txt file. This file instructs bots on which pages they can access on a website. However, as Cloudflare points out in a blog post announcing the new tool, not all AI scrapers respect these directives.

“Customers don’t want AI bots visiting their websites, especially those that do so dishonestly,” the company writes on its official blog. “We are concerned that some AI companies, intent on bypassing rules to access content, will persistently adapt to evade bot detection.”

To address this issue, Cloudflare has analyzed AI bot and crawler traffic to refine its automatic bot detection models. These models assess various factors, including whether an AI bot might be trying to evade detection by mimicking the behavior and appearance of a regular web browser user.

“When bad actors attempt to crawl websites at scale, they typically use tools and frameworks that we can fingerprint,” Cloudflare explains. “Based on these signals, our models can appropriately flag traffic from evasive AI bots as bots.”

Cloudflare has also established a form for website hosts to report suspected AI bots and crawlers. The company plans to continue manually blacklisting AI bots over time.

The problem of AI bots has become more prominent as the demand for model training data increases with the generative AI boom. Many websites, wary of AI vendors training models on their content without notification or compensation, have chosen to block AI scrapers and crawlers. According to one study, around 26% of the top 1,000 websites have blocked OpenAI’s bot, while another study found that over 600 news publishers had done the same.

Blocking AI bots, however, is not a foolproof solution. Some vendors appear to be ignoring standard bot exclusion rules to gain a competitive edge in the AI market. For instance, AI search engine Perplexity was recently accused of impersonating legitimate visitors to scrape content from websites, and both OpenAI and Anthropic have reportedly ignored robots.txt rules at times.

In a letter to publishers last month, content licensing startup TollBit mentioned that it sees “many AI agents” ignoring the robots.txt standard.

While tools like Cloudflare’s new offering could help detect and block clandestine AI bots, their effectiveness will depend on their accuracy in identifying such bots. Additionally, these tools do not address the more complex issue of publishers potentially losing referral traffic from AI tools like Google’s AI Overviews, which exclude sites that block specific AI crawlers.

Share.
Leave A Reply

Exit mobile version