Website Rank in Common Crawl’s C4 AI Dataset

The Washington Post published an article, “Inside the secret list of websites that make AI like ChatGPT sound smart”, that visualizes a bit of the C4 data model via websites that have been crawled.

Scroll down a lot to “Is your website training AI?” (direct anchor link), and you can enter a domain (e.g. yours!) to see how much that page has contributed.

Rank Domain Tokens Percent of all Tokens
11 washingtonpost.com 55M 0.04%
106,350 forum.zettelkasten.de 190k 0.0001%
328,137 christiantietze.de 71k 0.00005%
718,385 zettelkasten.de 32k 0.00002%

Can’t beat a community effort like a forum, no matter how much or how long I post :)

Got the link and the idea to check my sites from jwz.org.

Update 2023-04-20: I should’ve checked the Post’s claims, first; they say it’s Google’s dataset, but that seems to be wrong. Here’s a short summary of the C4 dataset I could find:

What is C4?

The Post link to a research paper. That, and the article itself, reference C4: this stands for “Colossal Clean Crawled Corpus”, from https://commoncrawl.org/). The dataset is available on Hugging Face.

Hugging Face’s “Models trained or fine-tuned on c4” sidebar contains Google-owned repositories, but Common Crawl doesn’t appear to be affiliated with Google directly. Wikipedia indicates that the Common Crawl foundation is founded by Gil Elbaz whose company “Applied Semantics” was acquired by Google. But the Common Crawl foundation isn’t, as far as I can tell.

The dataset is likely used by AI, including Google’s, but that’s different.

To be fair, I got the idea that this is Google’s from JWZ’s post title, “I’m the Googlebot. I’m here to index you. Please hold still.” first, and then didn’t get rid of this idea when writing this post.

Sorry for the confusion – I updated the post’s title!

Fetch Personalized Command Explanations with ‘um’ from Your Terminal

Teaser image

I stumbled upon this page: http://ratfactor.com/cards/um Dave Gauer describes how he has a shell script, um, that he can use as a man replacement to help remember how to use a command. Dave’s implementation uses the cards} from his own Wiki, because the um pages there are “consolidated, I won’t forget about them, it’s easy to list, create, and update pages.” (To be honest, though, I can’t figure out where his um cards actually are, and what they look like.)

Continue reading …

Create FastMail Masked Email Addresses with maskedemail-cli

I’m a happy FastMail user. If you want to be a happy, too, use my referral code for 10% off of your first year (and I’ll get a discount, too!) → https://ref.fm/u21056816 I never used their Masked Email feature, though, because it’s so cumbersome to create these addresses from the web UI. I all but forgot about this feature until today, when I looked for something else in my settings.

Continue reading …

VIPER Added to the Wiki

I was adding “tech stacks” to my CV and figured I might as well link the tech to articles or overviews on my page.

The ‘wiki’ pages I added some time ago are the best places to summarize topics and embed a list of related posts. So I added a page about VIPER and briefly had a look at my old posts.

On a side note, it’s funny how this approach looked kind of popular for a while, but never really caught on. Another example that programming is a pop culture. VIPER never made the Top 10. While it’s still an approach that does what it set out to do, the shorter ‘VIP’ tried to supplant it later, but the much less opinionated [view] coordinators really took the stage. (Crazy that Soroush’s post is from January 2015, which is 8 years ago.)

Copilot for Xcode Works Okay

I’ve never touched GitHub Copilot in all these years, but everyone seems to be very happy with it. People recommend Copilot for all kinds of refactorings and repetetive tasks. So I figured I might give it a try and see how it works. Just yesterday, I used the Copilot Xcode plugin to write a lot of boilerplate for me. I can confirm it does its job.

Continue reading …