Recently, there’s been a lot of talk about how to easily opt your SEO content out of being used to train language models (LLMs) like ChatGPT. While it can be done, it isn’t straightforward, nor is it proven to work.
How AIs Learn from Your SEO Content
Experts train LLMs with data originating from multiple sources (e.g., crawled websites, Wikipedia, government court records, emails, and books). These are open-source datasets that are frequently used for training Ais. There are also some websites that give away large datasets full of different types of information. Amazon and Google are just a few of these companies that offer such a portal. According to Wikipedia, there are at least 28 other portals.
These portals contain datasets that are obtained from Common Crawl (filtered), WebText2, Books, and Wikipedia. Of these, both Common Crawl and WebText2 are based on crawling the Internet. Therefore, they’re the ones that you’ll want to pay the most attention to.
What WebText2 Is
Reddit has created this private OpenAI dataset. Their idea is to use their trustworthy URLs to provide high-quality content for the datasets. In this way, you can think of WebText2 as an extended version of the original WebText dataset, which was developed by OpenAI. Whereas WebText2 has 19 billion tokens, the original dataset only contained around 15 billion tokens. This is why today’s WebText is being used to train GPT-2 instead of using OpenAI.
What Common Crawl Is
The Common Crawl dataset is one of the most commonly used datasets. The SEO content here was created by a non-profit organization that goes by the name “Common Crawl” (a.k.a. CCBot). Their bot crawls the whole Internet, downloading data from organizations that wish to use it. The bot can also get rid of spammy sites from its dataset.
Introducing CCBot
Since CCBot (CCBot/2.0) obeys the robots.txt protocol, you can block it from scraping your site. Doing so will prevent your data from being used to create another dataset so that your SEO content isn’t used in training this ChatGPT. Unfortunately, this is something that must be done prior to your site being crawled because otherwise, it’s probably already been included in the dataset. This doesn’t mean you shouldn’t still add this text because doing so allows you to opt-out from being scanned further when CCBot updates its dataset.
Earlier, it was mentioned that this process is neither straightforward nor guaranteed to work. Here is a prime example: CCBot also obeys the nofollow robots meta tag directives.
How to Block ChatGPT from Using Your Website
With search engines, you can choose not to have your website crawled. While this is also possible with Common Crawl, you can’t remove your website’s content from any dataset that already exists. Additionally, you can’t opt out of being crawled by research scientists.
Things to Consider Before Blocking ChatGPT
There are many companies that are using these datasets to filter and categorize URLs. They hope to use them to create lists of websites that they want to target with their advertising. One such company is Alpha Quantum. They use the Interactive Advertising Bureau Taxonomy to offer a categorized dataset of URLs. Doing so enables them to use it for contextual advertising and AdTech marketing. If you choose to be excluded from such datasets, you could lose potential advertisers.
How to Get Help with ChatGPT and Your SEO content
At the Local SEO Tampa Company in Tampa, FL, we keep up to date with information regarding SEO content, which includes ChatGPT. So, instead of worrying about finding time to do this yourself, get in touch with us so we can take care of them for you.