Website Rank in Common Crawl's C4 AI Dataset

Apr 20th, 2023

The Washington Post published an article, “Inside the secret list of websites that make AI like ChatGPT sound smart”, that visualizes a bit of the C4 data model via websites that have been crawled.

Scroll down a lot to “Is your website training AI?” (direct anchor link), and you can enter a domain (e.g. yours!) to see how much that page has contributed.

Rank	Domain	Tokens	Percent of all Tokens
11	washingtonpost.com	55M	0.04%
106,350	forum.zettelkasten.de	190k	0.0001%
328,137	christiantietze.de	71k	0.00005%
718,385	zettelkasten.de	32k	0.00002%

Can’t beat a community effort like a forum, no matter how much or how long I post :)

Got the link and the idea to check my sites from jwz.org.

Update 2023-04-20: I should’ve checked the Post’s claims, first; they say it’s Google’s dataset, but that seems to be wrong. Here’s a short summary of the C4 dataset I could find:

What is C4?

The Post link to a research paper. That, and the article itself, reference C4: this stands for “Colossal Clean Crawled Corpus”, from https://commoncrawl.org/). The dataset is available on Hugging Face.

Hugging Face’s “Models trained or fine-tuned on c4” sidebar contains Google-owned repositories, but Common Crawl doesn’t appear to be affiliated with Google directly. Wikipedia indicates that the Common Crawl foundation is founded by Gil Elbaz whose company “Applied Semantics” was acquired by Google. But the Common Crawl foundation isn’t, as far as I can tell.

The dataset is likely used by AI, including Google’s, but that’s different.

To be fair, I got the idea that this is Google’s from JWZ’s post title, “I’m the Googlebot. I’m here to index you. Please hold still.” first, and then didn’t get rid of this idea when writing this post.

Sorry for the confusion – I updated the post’s title!