BLOG

Generative AI Security and Unauthorized Content Scraping Protection

 缩略图
Published April 24, 2024

Imagine being an expert who sells information for a living; people pay to ask you questions. Suddenly, a machine sucks that expertise from your mind, learns to answer questions faster than you, and takes away your livelihood. With the rise of generative AI, enterprises with business models reliant on content face precisely this dilemma.

We may not know where Gen AI will take us, but clearly it will transform how we consume content and, in so doing, disrupt enterprises that rely on the distribution of content—similar to how the Internet transformed our consumption of news and entertainment, devastating many traditional news outlets.

The Gen AI disruption will impact businesses that sell content, such as media, news, and stock photos, as well as businesses that rely on content to attract viewers to paid advertisements. So too will the disruption impact businesses that draw prospects to promote products and services, to build brand identity and customer relationships, or to move customers with a call to action.

While the crawling of content for search was born in the early days of the web, LLM-based apps like ChatGPT function in a fundamentally different way. Search engines provide summaries with links back to original content, adding value by making content discoverable. Conversely, chat-based apps powered by LLMs do not necessarily provide links back to original content; rather, they invite users to remain within the chat, learning more through further prompts, draining all value from the enterprise that created the content.

Organizations cannot rely solely on the robots.txt file, which enables organizations to declare a scope for crawlers, because not all organizations crawling for content for the training of LLMs will respect the robots.txt file. It is up for debate whether LLMs merely copy and reproduce content or synthesize content like any creator. The significance of copyright law to LLM scraping is now being discussed in the courts. How laws and norms will evolve is hard to tell, but organizations should begin thinking now about how to protect the content that their businesses rely upon.

Scraping can be mitigated, although not easily. Indeed, it’s hardly a new problem. Scrapers have sought to gather competitive data on airlines, retail chains, and hotels through fare, price, and rate scraping. Not only do these enterprises want to avoid the loss of competitive data, but the traffic load of scrapers—especially those seeking the most up-to-date data—can add up to 98% of all traffic to a site in some cases, impacting performance and even crashing sites.

Scrapers use bots to automate data collection. Unfortunately, traditional mechanisms for mitigating bots, such as CAPTCHA and IP address deny lists, are ineffective against scraper bots. Because scraping is generally considered legal, numerous online services are available for bypassing CAPTCHA. Using machine learning or click farms to solve the CAPTCHAs, these services are fast and cheap, and far more efficient than most of us at cracking those irritating puzzles. The easiest alternative to CAPTCHA, IP deny lists, is also ineffective because of services available to scrapers. These services enable scrapers to issue their requests through tens of millions of residential IP addresses—a number so large and growing that maintaining deny lists is completely infeasible.

Even many specialized bot management solutions struggle with scraping because those solutions depend on instrumentation for signal collection. A typical example is login. The browser first issues an HTTP GET request to retrieve a web page containing a login form. On that page, JavaScript runs in the background, collecting data about the browser and the user’s typing and mouse movement patterns. When the user submits their credentials, JavaScript inserts the signal data into the HTTP POST request, which the bot management solution, acting as a reverse proxy, intercepts and analyzes to determine whether the agent making the request is a bot.

Many content sites, however, do not require a combination of GET and POST to access content, whether that’s blog posts, news items, or pricing. Rather, a single HTTP GET request returns everything the scraper wants, eliminating the chance for instrumentation.

We know that many bot management solutions fail to protect scraping because there are several services that provide easy API access to scraping content. ZenRows, for example, lists the anti-bot vendors that they can bypass.

Fortunately, F5 Distributed Cloud Bot Defense resolves this problem through a technique called an interstitial—a page that loads rapidly, collects data quickly, and then loads the content of the requested page. Over several years of defending the largest airlines and retailers from scraping, F5 has refined the technique to be fast, efficient, and effective. The interstitial executes only once per user session because once an agent is identified as human, further checks are unnecessary, except for guarding against replay abuse.

As the most effective bot management solution available, Distributed Cloud Bot Defense offers content creators the best defense against the scraping of their content for LLM model building. For organizations that want to protect their content and their business, Distributed Cloud Bot Defense provides the best option.