11 or so Android Ebook Reader Apps for Academic Writing Workflows: Annotations are Hard

Here’s my personal comparison of Android ebook readers for my Boox eink tablet.

I would love to add drawings as annotations. Ratta Supernote devices do this splendidly by storing the pencil input directly, without handwriting recognition. (Example here.) This is the gold standard. Everything else requires multiple apps (to draw a diagram, for example) and import/export (of notes or EPUB book locations), which is also acceptable, but not ideal.

There’s a one month-old overview on Reddit that summarizes the situation nicely: you apparently have to use a proprietary cloud/sync subscription, or you’re out of luck, because all annotation exports suck in some way.

{{TOC}}

Android Apps

Some links to reader recommendations

Onyx Boox Neo Reader (built-in)

Annotations are from the Neo Reader app are exported like this, with placeholders for the actual content:

Reading Notes || <<BOOK FILE NAME>> AUTHOR
TIMESTAMP | Page No.: PAGENUMBER
MULTI-LINE QUOTE OF HIGHLIGHT
 【Note】 ANNOTATION TEXT
-------------------
...

The Onyx Book device is a Chinese product, so of course there’s these “thick” square brackets, called “lenticular brackets”.

Here’s some actual reading notes:

Reading Notes | <<Out of the Software Crisis - Baldur Bjarnason_2714c74f-8f8d-4b96-af08-d25255acc9f6>>Baldur Bjarnason
2023-03-29 15:48  |  Page No.: 22
Everything becomes harder and harder until the system is effectively thrown out and replaced. It happens in a number of different ways
-------------------
2023-03-29 15:48  |  Page No.: 23
Our systems die because we keep killing them
【Note】Lists of reasons
-------------------
2023-03-29 15:49  |  Page No.: 26
managers think their job is to extract work from the team. They think variations in performance are because of variations in the work they can extract from each employee
【Note】interessante Behauptung für die vielleicht noch Belege o. ä. im Buch folgen?
-------------------
2023-03-29 15:52  |  Page No.: 28
If something keeps going wrong, that means there isn’t a feedback loop telling the team that it’s going wrong. Shouting at them doesn’t count
【Note】change system pieces for productivity

Parsing this would be simple enough. The first line describes the book, and 19 (!) dashes separate annotations. tail +2 starts at the 2nd line of the text, and splitting a string with csplit into multiple files would work.

Example using gcsplit, which has a couple more options:

$ cat export.txt |\
    tail +2 | \
    gcsplit --prefix="note" - '/-------------------/1' "{*}"

I end up with 80 note*.txt files that way; it works.

Screenshot of exported scribble
Screenshot of exported scribble

The Boox’s killer feature is the amazing pen input. Visual annotations are exported as PDFs with screenshots of the pages.

In the picture above, you see a page with 2 regular highlights (the inverted texts), plus a scribble at the bottom.

Because the export doesn’t contain information about the page the annotation is on, I just tried a handful of these.

This limitation, again, makes the export useless to me. I can use the Neo Reader app to jump to the page with the annotation, and that works fine.

All in all, the annotation features are good, but as I wrote in another post, the export is meh, and I’m better off processing notes from the device on the device. In other words, like with paper-based books: open the book on the reader and then process all annotations one by one.

Handwriting tab in the book navigation view vs annotations

For this workflow, the most annoying thing is that handwriting/scribbles and textual annotations are on different tabs.

There’s no unified view, so I can’t go through all my annotations. If a scribble is on a one page without any highlights, I could process all my highlights one by one and miss the scribble. In the screenshots, you’ll see that I have annotations on pages 30 and 32, and a scribble on page 31. I’d miss the scribble if I jump from annotation to annotation. In other words, I need to keep track of the page numbers in both tabs to process them in sequence. Not a fan.

Moon+ Reader

I also tried the apparently very popular 3rd party app Moon+ Reader. Its export is of the following format, with placeholders in all-caps:

BOOK TITLE (Highlight: N, Note: M)
----------
◆ CHAPTER TITLE

◼︎ HIGHLIGHTED TEXT .... (ANNOTATION)

Instead of positions or page numbers, you get the current chapter title above the exported annotation. That can be helpful to find a highlighted section in the ePub file, if the chapters are short enough. Good luck with huge fantasy novels.

The TXT export also contains these (odd) unicode characters at the beginning of lines to separate things. Moon+ Reader also is a Chinese product as far as I know.

KOReader

KOReader and Moon+ are mentioned the most inside the Boox community, I’d say. I can see why KOReader is a fan favorite.

The UI is minimal and works very well with eink devices, that’s a big plus. No animations, clear shapes and lines, large touch targets. They nailed that.

Custom KOReader keyboard. It's high-contrast, but not the native one.

But KOReader decided to use a custom keyboard, so you can’t use the Boox keyboard switcher which supports pen input and handwriting recognition. Why?

List of all bookmarks, notes, highlights in one place

Textual annotations are okay. You can check them out in one place (via “Bookmarks”, where you can filter for highlights, notes, and page bookmarks) and editing notes works well enough.

KOReader shines when it comes to exporting highlights. There’s support for Joplin and Readwise, but I’m mostly interested in Markdown, TXT, HTML, and JSON.

Screenshot of KOReader’s HTML output

Markdown is serviceable:

# Out of the Software Crisis
##### Baldur Bjarnason

## It was great until it wasn’t
### Page 12 @ 25 April 2023 08:00 AM
*Churn is devastating for software quality as it destroys institutional memory and sabotages many of the fundamental mechanisms of programming, which require stability and consistency. Churn in manufacturing or physical product design isn’t nearly as disruptive as in software*

---
keep the programmers

TXT looks like it’d be easy to parse:

 Out of the Software Crisis

 It was great until it wasn’t

  -- Page: 12, added on Tue Apr 25 08:00:56 2023
Churn is devastating for software quality as it destroys institutional memory and sabotages many of the fundamental mechanisms of programming, which require stability and consistency. Churn in manufacturing or physical product design isn’t nearly as disruptive as in software
---
keep the programmers
-=-=-=-=-=-

But, of course, JSON is the easiest to process programmatically:

{
    "file": "/storage/emulated/0/Books/Out of the Software Crisis - Baldur Bjarnason_2714c74f-8f8d-4b96-af08-d25255acc9f6.epub",
    "created_on": 1684747456,
    "entries": [
        {
            "chapter": "It was great until it wasn’t",
            "page": 12,
            "time": 1682402456,
            "sort": "highlight",
            "text": "Churn is devastating for software quality as it destroys institutional memory and sabotages many of the fundamental mechanisms of programming, which require stability and consistency. Churn in manufacturing or physical product design isn’t nearly as disruptive as in software",
            "note": "keep the programmers",
            "drawer": "lighten"
        }
    ],
    "title": "Out of the Software Crisis",
    "number_of_pages": 250,
    "author": "Baldur Bjarnason",
    "md5sum": "d3f1220d162570c6b2256e12de868873",
    "version": "json/1.0.0"
}

Using jq syntax here to denote fields, '.entries[0].text' is the highlighted text, '.entries[0].note' is the note, and '.entries[0].drawer' the “style” of the annotation (there’s highlighting as “lighten”, underline, strike-through, and inverting for stark contrast).

But again – page numbers!

At least the JSON output contains '.number_of_pages' so you can compute the percent offset yourself. Page 12/250 is 4.8%; Calibre’s web reader insists that the same book has 200 “pages”, so the location to go to would be 4.8% × 200 = 9.6, but that’s not the location I’m looking for. Searching for the phrase (“Churn is devastating…”), 6.0 is the location. So all in all, the page numbers are just as useless unless you find an ebook reader for your real computer that can make sense of the page numbering of KOReader. (Spoiler: Apple’s Books app doesn’t, and Emacs ereader-mode doesn’t either.)

The more time I spend on this, the more I wonder if I should’ve used PDFs (and a device with a larger screen) instead.

KOReader’s issue tracker lists 2 issues when you search for “cfi” (EPUB Canonical Fraction Identifiers), but it’s not going to happen at the moment.

For what it’s worth, KOReader also comes with

  • a terminal emulator
  • a text editor

eLibrary Manager

eLibrary Manager Basic is free but doesn’t offer annotations; the eLibrary Manager (Pro)
version costs EUR 1.59 at the moment (2023-05-22) so I tried that. We’re still in “cost of a cup of coffee” territory here.

Its ePub reader tells me that Bjarnason’s book has 123 pages in total. That’s almost half of what the others say!

By default, page turns are animated. You can turn this off (good). Page refresh then still looks, weird, though, as if the screen clears and then renders in vertical lines from left to right. I wondered if disabling animations just removes half of the transition, and this is a bug, but the 5.0.1 release notes revealed:

  • Due to reader rendering and animation issues that occur on different WebView versions, the following updates have been made:
  • Disable software layer rendering for WebView versions greater than or equal to 110.
  • Allow toggling of software layer WebView rendering through Advanced ePub Reader Setting, in case the default behaviour is not satisfactory.

The “Software Layer” setting was set to “Default”. (It does not indicate whether this means ‘on’ or ‘off’, confusingly.) I tried to disable it; no change. I tried to enable it, now the animations, or rather the page turn glitches, are gone for good.

Reading is nice. It renders the book well, period.

Opposed to the Neo Reader application,

  • it correctly display list bullets vertically centered next to the first line of a list item (which I shouldn’t even need to mention, but the Neo Reader app is that wonky), and
  • it also correctly interprets a bit more advanced CSS, e.g. to display attributes from heading tags as chapter numbers. (Which is likely not useful that much, but Bjarnason’s book does use these.)

So as a good ebook reader, the money is not wasted.

(But KOReader does everything better in my opinion.)

List of bookmarks, highlights, and annotations; no export from there, though. You need to go to another menu for that

Selecting text is clumsy: long-press to open a context menu, select “Highlight”, then select text. That’s not going to age well. It’s the same process for “Bookmark/Note”: the text is highlighted and a note is added. You need to long-press and select “Bookmark/Note” again to edit the highlight, though. The primary interaction doesn’t seem to be on the textual level for some reason. It’s like I only have right-click menus to do anything, and that’s super weird for an ebook reader because what else do you interact with? Links, probably, but distinguishing tap from long-press would do the trick there.

You do notice how eink tablets aren’t the primary use case of this app.

Export options: Does not look like exporting annotations was the primary use case

Export of book information to Calibre requires the app author’s “Calibre Documents Provider” app, which costs the same as the reader. It bridges Calibre’s library to Android’s file system, more or less? I didn’t try that, because I already do sync my Calibre books with, well, Calibre Sync (on iOS and Android). If annotations were exported, I might be interested in switching. But updating book metadata isn’t useful (to me). So I passed.

The export is a JSON file with these contents:

[
   {
      "title": "Out of the Software Crisis",
      "creator": [
         "Baldur Bjarnason"
      ],
      "bookmarks": [
         {
            "page": 12,
            "excerpt": "build software development",
            "note": "bookmark note test"
         }
      ],
      "highlights": [
         {
            "page": 11,
            "excerpt": "software ship begins to si",
            "text": "software ship begins to sink. ",
            "colour": 0
         }
      ]
   }
]

On the plus side, the JSON is very minimal. I don’t care about the root-level separation of “bookmarks”/”highlights”, though. And again, it has useless page numbers.

Please do note that the root object of this file is a JSON array. The file is called export.1684755174801.json. So it will contain all book annotations, I guess? Oof.

ReadEra

ReadEra highlights in one list; the dark-by-default UI doesn't work well on eink

This app does install, it does find ebooks, and it does do highlights. But the format is:

BOOK TITLE
AUTHOR

HIGHLIGHTED TEXT 1
--
ANNOTATION 1

*****

HIGHLIGHTED TEXT 2

*****

HIGHLIGHTED TEXT 3
--
ANNOTATION 3

No page or location indicators at all. The line-based output and the separators (two dashes -- between highlighted text and the annotation, and five asterisks ***** between highlights/quotes) would make it simple to split the output into pieces. That’s somewhat useful. But without any location indicators, this is mosly good to export highlighted quotes. Otherwise, you’d have to add the location into your annotation manually.

Foliate

The open source GTK app Foliate was mentioned on HackerNews. I don’t know why it was mentioned in that thread, because it’s not available on mobile/Android. But it’s looking like a great option for Linux. Annotations are stored as JSON with EPUB Canonical Fraction Identifiers instead of page numbers. Great idea. I’d love to have that.

EPUB reader (by Bum bum apps)

The name, “Foliate”, brought me to the FolioReader GitHub Team: they offer Kotlin/Android and Swift/iOS SKS. Not a lot of releases in recent years, but the absence of PDF support and focus on EPUB and highlighting got me interested. The FolioReader-Android repository does collect Play Store links to apps using the library, fortunately.

EPUB reader (by Bum bum apps) is a more recent addition to that list, uses the FolioReader library, and is not a comics reader app.

Reading is limited to scrolling; there are no pagination features. Scrolling can work on the Boox Nova Air2 if you change the refresh mode a bit. But it’s not optimal.

You can highlight text, but grayscale eink isn’t supported well, and export is per-highlight: so you can only share quotes, more or less. From the book overview, highlights behave are like bookmarks.

There are no textual annotations.

Lithium

This app was recommended here and there, and there’s a very promising HTML annotation export (yes, the export is a structured format!), plus open-source tools to extract info from the export – but the app is not compatible with the Nova Air2, it seems. I can’t install it from the Play Store.

Installing the .apk from elsewhere works, and I can open the app, but I can’t read any book. It won’t display content and remains stuck in the “loading” phase.

Study Comfort

From the app anouncement on Reddit and the screenshots, I was looking forward to test this, but apparently this never left the first public beta stage. Annotations are very bare-bones, and inserting a textual note is error-prone (e.g. “invalid note position” error). So this didn’t work.

Subscription/cloud services I didn’t even try

  • Google Play Books. Because Google.
  • BookFusion: can’t miss this since the founder posts links all over HN and Reddit. (Nothing wrong with that.) CSV and Markdown export sound good. Subscription pricing doesn’t.
  • Readwise’s Reader, because it’s also a SaaS.
  • Zotero, a platform I don’t dislike, sync your library and annotations if you find an app that speaks ‘Zotero’. I already have my ebooks synced on my NAS and made available via Calibre, so I don’t really want another library cloud storage. Also, Zotero’s annotation seems to be limited to PDF, so no ePub and thus no “reflow” of content.
  • A lot of apps do way more than just book reading and highlighting; so I wasn’t interested.
  • iOS apps are aplenty;
    • MarginNote 3, for example, would be available on iOS and macOS (also via Setapp). That looks like a well-made application, but doesn’t help in this situation.
    • Polar doesn’t do EPUB, yet, but has Anki flash-card export. Not available on Android.

Emacs

… yeah, sorry.

The Boox tablet does run Termux, and you can install Emacs 28 there and use the terminal version.

There’s also the real native Emacs port, offering version 30 (currently being worked on) in a GUI.

Emacs is useless without a keyboard, of course. You can use the GUI version to open .epubs with Emacs from the toolbar, but you can hardly switch the mode to e.g. ereader-mode. If you could, you could annotate your ebook using org-mode notes, maybe.

This is a very nerdy idea, and I’d love to try that once, sometime, but it’s not solving any of my problems. (Not like Emacs ever does, you might say …)

Conclusion

Ryan West was/is on a similar journey. Zotero seems to be superb for PDF annotations. But for ePub, there’s no annotation standard, and each app is doing things differently.

I checked Open Annotation in EPUB, Draft Specification (23 July 2015): it looked like it never went anywhere, but folks on the W3C EPUB 3 Community Group responded kindly on the mailing list. It’s being discussed on the public GitHub repo a lot, and Hypothes.is is mentioned as a notable implementation. But no wide-spread adoption, that’s certain.

On large (or rather: huge) Boox devices, you could use split screen to combine highlights in your EPUB/PDF files on one side with hand-written notes on the other. This would be similar to reading a real book with a real note-pad. But you need absolute page numbers as references to make both media work together. EPUB readers don’t gel with this, so that’s not the workflow I’m looking for. You could use a reader that produces the Calibre-compatible location identifiers instead of page numbers, but it’s still a chore. Page numbers in books work better for this.

So what’s the verdict, then?

I have no happy answer.

KOReader is a very good eink-enabled EPUB reader. It’s open source, so maybe one can tinker with that. So it’s either that, with the custom keyboard I hate, or the Boox’s native Neo Reader, with the weird vendor lock-in and proprietary format.

The Supernote is unmatched, software-wise, thus far. But the Boox hardware and color pencil input and Android base is promising.

BOOX Vector PDFs Are Already Colored

After my interesting journey into replacing greyscale values in the vector PDF export of Onyc BOOX Notes, some brainiac on Mastodon pointed out that there are non-grey colors I could use. (Seriously, thank you Jeroen :))

Some default pencil colors, exported from the greyscale device

So the solution is to simply use colors, not shades of gray!

What the BOOX thinks I'm seeing (screenshot to the left) vs. what I actually see (photo to the right).

That’s even less work for me, which is good, but I’m still a bit sad that the PDF recoloring experiments have now ended.

Colorize Onyx BOOX Notes Vector PDFs (Really Rough Edition)

Sacha Chua has this nice Python script to colorize her SuperNotes sketches. Can’t be that hard to apply color replacement to the notes from BOOX devices, can it?!?!

The example note

I had this exported vector PDF lying around:

Screenshot of the original PDF file in all its greyscale glory

A single page, two brush colors. Ideal testing ground.

If you’re interested, you can download the original PDF here.

Failure: Replace colors in rasterized image

Opposed to Sacha’s sketches, the BOOX Notes app can produce vector PDFs (which I used here) instead of rasterized images. Converting the PDFs to rasterized image produces smooth edges. Her notes have pixelated, hard edges that are easier to color match.

That’s not a great start to perform color-based replacements!

My color picker tells me the mid-tone gray is ~0.52 grayscale, or #858585 hex RGB. And black is black.

Digging through a couple of StackOverflow answers, I found that ImageMagick can perform color replacements with fuzzy input color matching, which sounded ideal for the antialiasing everywhere:

$ magick grayscale.png \
    -colorspace sRGB -type truecolor \
    -fuzz 5% -fill blue -opaque "#858585" \
    -fuzz 10% -fill red -opaque black \
    colorized.png
Replacing the mid-tone gray with blue and the black shapes with red doesn't look good and overlaps in some places already

The result isn’t that good-looking. Fuzzy matching on tonal value alone is hard because a gray fading into the background color is similar to a black fading into the background color in some places.

ImageMagick would be a great choice to colorize a monochrome picture with one other color, or a color gradient: then you could match the tonal values from 0.0 (black) to 1.0 (white) to positions on that gradient using the +level_colors operation.

Success: Replacing vector shape colors in the PDF

Selecting a brush stroke

I noticed in Preview that I could select brush strokes, so they weren’t baked into the PDF “flatly”.

It took me a while to figure out how to treat these “sub-images”, or “embedded images”, or whatever they were. Preview’s inspector revealed that they are annotations, and that the fill color is stored in the PDF somehow.

With a gray shape selected, I found that the PDF is comprised of a ton of annotations. And apparently I modified some by touching them, oops.

Annotations can be added and edited by the user, too; but Preview wouldn’t let me change the color of these annotations. Popular open source PDF annotation tool Skim doesn’t show any of these shapes as annotations at all. Affinity Designer 2 doesn’t recognize these, either. I then sent them to my fiancée: Adobe Illustrator on Windows also doesn’t recognize any of these, but looking at the PDF in Thunderbird reveals the strokes animated. Stroke by stroke. That looked pretty cool. To be frank, I suspect this was merely incrementally rendering all the strokes using a slow engine. Still, nice effect! I didn’t find any obvious timing information that control this, so exporting video from this is a task for another day.

Next, I did the only sensible thing to inspect PDFs: open the file in a text editor.

Editing PDFs in Emacs

In doc-viewer-mode, which renders PDFs, press C-c C-c to edit the PDF’s source code. (Ok, Emacs is not necessary; TextMate would be just as fine to display the contents.)

Looking through the syntax that was not binary garbage, I looked for a repeating pattern that might be the annotation object syntax. There, I eventually found a string of numbers next to the letter “C”, 0 0 0 , and that looked suspiciously like a color triplet for “black”. Scanning further down, I found 0.501961 0.501961 0.501961, which surely was the mid-tone gray color!

Replacing these instances and then saving the result broke the PDF. So to edit the PDF file, one needs to uncompress it and then perform replacements.

$ brew install pdftk-java
$ pdftk orig.pdf output uncompressed.pdf uncompress

That blew a 700 KiB PDF up to 12 MiB.

Searching for 0 0 0 to indicate “black”, I found a lot of previously compressed object streams that started like this:

/GS gs
0 0 0 RG
1 j 1 J
2.02625 w 376.49 1730.42 m 375.853 1730.42 l S
2.10902 w 375.853 1730.42 m 375.215 1730.42 l S
...

“GS”?

GhostScript?

Chapter 4 (“Graphics”) of the GhostScript reference (PDF) says:

The CS and SC operators select the current stroking color space and current stroking color separately; RG sets them in combination.

So this sets the color space “DeviceRGB” and the stroke color “black”. (This also teaches us that these are strokes with varying line widths, not filled forms. The lines that follow remind me of SVG and could well be path “move” commands at a certain width with a stroke action.)

Replacing 0 0 0 RG with 0 0 1 RG and CS 0 0 0 to CS 0 0 1 was simple enough to change everything to “blue”. (I don’t yet understand why >90% of the strokes use the 0 0 0 RG format, but a handful has CS 0 0 0 or the color array \C [0 0 0] form.) Anyway, replacing these occurrences in the uncompressed state did the trick.

On this journey, I also found a Python script to replace annotation colors using Regex. And a Python 2 version. Both replace any color with one single color value. I want to perform a mapping, so a simple string replacement is currently the easiest option. (Until I write a script in the next installment, of course!)

To save space, compress the result:

$ pdftk uncompressed.pdf output recompressed.pdf compress

This is actually 100 KiB smaller than the original. Nice.

Result

Replacing the color data in the

Before and after the color conversion

This looks really good! I don’t want to keep these colors, but as a tech demo, that’s great.

The mid-tone gray stroke was implemented somewhat translucent, and this still shows through in the re-colored output where the arrow pieces overlap. Ideally, I want flat, non-transparent colors for these.

Or maybe I don’t? If I remember correctly, I used the text marker tool, which should probably continue to render in a see-through manner.

I don’t understand half of the PDF’s “source code”, so I’ll be digging into that a bit more in the future to avoid making stupid mistakes with my PDFs. I do know now that I’m not interested in writing a PDF renderer from scratch, though!

Up next: automating this to color-match the non-black strokes. The BOOX Notes app has e.g. a “yellow” stroke, which I can’t see because it’s a greyscale device, but matching these named colors to real, well, colors would be nice.

Update 2023-05-25: Turned out that I was right and that the color names like “navy blue” or “yellow” are exported as colors. So use these!


If you want to check out these PDFs in your own text editor, here they are:

Boox NeoReader Annotation Export Is Meh

When you use the built-in “NeoReader” on a Boox tablet, you get the best pencil input and quite good highlighting and annotation support.

If you don’t have a Onyx Boox eink tablet with that app installed, don’t bother looking for it on the Android/Google Play Store – that app is not available anywhere else, it seems. And the app of the same name on the Play Store is a QR Code Reader.

The built-in NeoReader app’s annotation export is very disappointing, though.

Here’s the one highlighted section from “Do Androids Dream of Electric Sheep” I have. It’s the TXT export of one highlighted section from the ePub version of that book:

Reading Notes | <<Do Androids Dream of Electric Sheep_ - Dick, Philip K__33d2f489-4536-43a1-ab54-5b581fb9efb1>>Dick, Philip K.
2023-01-19 21:50  |  Page No.: 148
How can I save you," the old man said, "if I can't save myself?" He smiled. "Don't you see? There is no salvation."
 "Then what's this for?" Rick demanded. "What are you for?"
 "To show you," Wilbur Mercer said, "that you aren't alone. I am here with you and always will be. Go and do your task, even though you know it's wrong."
 "Why?" Rick said. "Why should I do it? I'll quit my job and emigrate."
 The old man said, "You will be required to do wrong no matter where you go. It is the basic condition of life, to be required to violate your own identity. At some time, every creature which lives must do so. It is the ultimate shadow, the defeat of creation; this is the curse at work, the curse that feeds on all life. Everywhere in the uni
-------------------

It’s nice that there is a timestamp (2023-01-19 21:50) per highlight.

But this page number is useless – this is an ePub, which means the amount of pages 100% depends on the font size. At least in this case, because there’s also the convention of denoting positions as blocks of 1024 characters in the absence of a page-map file.

For example, when I use the Calibre server to read the ePub edition of the book, the cover page is denoted as Current position: 1.0 / 291. When I jump to position ‘148’, I’m no where near the highlighted section. If I do go to the highlighted section, Calibre reports this as 213.3 / 291. (The decimal point is a bit confusing to me; why is that kind of precistion even needed?)

This means:

  • Annotation export is useful for quotes. I can live with not knowing the location in an ebook for quotes. Just copy the part into a note in my Zettelkasten, reference the source, done.
  • But the annotation export is useless for ‘academic workflows’. I can’t use the export to get to the highlighted location in another ebook reader app, so I need to use the device to check what’s on the page and around it for context.

So I’m stuck with processing information in an ‘academic’ workflow by using the device itself. Just like I would have a paper-based book open next to me.

I did a whole series on processing David Epstein’s book “Range” if you’re interested.

That doesn’t defeat the purpose of the reader. It’s perfectly fine to prop up this digital book the way I’d prop up a physical one. But it diminishes its potential.

If only the export used character offsets, I could imagine writing my own scripts to jump to highlights in .epub files. But since I have no clue how to make sense of a page location, the TXT export doesn’t help at all.

None of this is an issue with PDF files, of course: they have fixed page contents and thus page numbers do make sense. (But print PDFs don’t fit on the tiny Nova Air2’s screen.)

Safari (for Mac) URL Scheme

This will open my website in Safari on macOS, no matter your default browser:

x-safari-https://christiantietze.de

Try it!

Depending on your platform and browser, this may not be clickable or not do anything (I have no clue what Android browsers would do with this).

I was looking for this because I was dissatisfied with how I tabbed into Safari in a screencast recently.

With a link like this, I can open Safari from my notes, e.g. from within The Archive during demonstrations.

For mobile Safari on iOS, it’s apparently com-apple-mobilesafari-tab:https://christiantietze.de, but I haven’t tested that yet. Using Shortcuts you can reveal this.

All of Safari’s URL Schemes

There haven’t been a lot of useful references on the web, so I looked myself; how do you figure out which URL Scheme an app supports?

  1. Locate the .app bundle; for Safari, since it’s in a protected system folder, it’s easiest to search for “Safari” in Spotlight (+ (that’s the space key)) and then reveal the match in Finder (+R).
  2. Right-click the app, “Show Package Contents”;
  3. inside the Contents/ folder,
  4. view the Info.plist file in a text editor of your choice.
  5. Search for CFBundleURLTypes. This key is associated with an array of values (you’ll notice these as they are indented a bit to the right).

This is the relevant section from Safari’s Info.plist as of today in Ventura:

<key>CFBundleURLTypes</key>
<array>
  <dict>
	<key>CFBundleURLName</key>
	<string>Web site URL</string>
	<key>CFBundleURLSchemes</key>
	<array>
	  <string>http</string>
	  <string>https</string>
	</array>
	<key>LSHandlerRank</key>
	<string>Default</string>
	<key>LSIsAppleDefaultForScheme</key>
	<true/>
  </dict>
  <dict>
	<key>CFBundleURLName</key>
	<string>Local file URL</string>
	<key>CFBundleURLSchemes</key>
	<array>
	  <string>file</string>
	</array>
	<key>LSHandlerRank</key>
	<string>Default</string>
  </dict>
  <dict>
	<key>CFBundleURLName</key>
	<string>Safari Start Page</string>
	<key>CFBundleURLSchemes</key>
	<array>
	  <string>x-safari-https</string>
	</array>
  </dict>
  <dict>
	<key>CFBundleURLSchemes</key>
	<array>
	  <string>prefs</string>
	</array>
	<key>LSHandlerRank</key>
	<string>None</string>
  </dict>
  <dict>
	<key>CFBundleURLSchemes</key>
	<array>
	  <string>x-webkit-app-launch</string>
	</array>
	<key>LSHandlerRank</key>
	<string>None</string>
  </dict>
</array>

Apart from http and https, Safari also registers for the file protocol.

Then you see x-safari-https, prefs, and x-webkit-app-launch.

Detach Xcode Console -- via Terminal

I have been complaining on social media about the Xcode Console sticking to the bottom, and how I’d prefer a horizontal split etc.

And there have been good suggestions to e.g. open a new Console tab or window. (Thanks Dominik Hauser for the tip!) That didn’t stick because I needed to adjust the Xcode behaviors and sometimes that didn’t compose well with what I’ve been doing.

But what about using Terminal?

Terminal is a perfect console and it’s a separate app and its window is, well, ‘detached’ from Xcode by default.

It’s under the app Scheme settings, of all locations, below the localization options. I never look there, until I fiddled with localization today, and only when something did not work did I check out all the settings here in earnest. (If “banner blindness” on the web is a thing, there surely must be a “settings blindness” in Xcode – because there’re so many!)

Scheme editor ▶ Run ▶ Options -- there, at the bottom, you can switch to Terminal

As always, there are downsides, of course:

  • Xcode opens a new Terminal window every time, so it starts with the default settings, font, and size;
  • Quitting/Terminating the app will close the Terminal window;
  • The Xcode Console will not get the output, so there’s no log in your build history’s “Run” entries (see “Report navigator”, +9), i.e. the thing you could open in another Xcode window to detach the Console.
  • LLDB still runs in the Xcode Console, not the Terminal window. That means you’d need to check the Terminal for any debug log output. You’re still stuck in the small console pane at the bottom. (Separate “Run” tabs/windows show output but aren’t interactible.)
Terminal output silences the console (bottom right) and any “Run” windows you might have open (right).

I’m not using Terminal.app, I’m using iTerm 2 or the eshell, so I might get by with customizing the default Terminal profile to start with a palatable font and window size.

But the limitations above really make this more annoying than useful to me.

Also, from 6 test runs, one failed. Frank Gregor had no success getting this to work at all.

All in all, this is not the saving option we’ve been looking for.

Wrap HTML Tables in Figures using Nanco and Kramdown

I noticed that on mobile phones, wide tables wouldn’t scroll horizontally – instead, they broke out of the content container and everything looked a bit wonky.

My goal: wrap <table> in <figure> and add figure { overflow-x: scroll; } to make the table scrollable inside its container.

Initially, I wanted to extend kramdown and, after the Markdown block element for tables have been added, wrap the whole thing in a <figure>. Maybe add a new block element to the representation of the Markdown document.

Even though kramdown is a very mature library, I couldn’t find an example for that, though. The next best thing I found was to add my own parser and then decorate the #parse function with a post-processing step.

I was way too lazy for that (my own website takes a while to compile so I would need a test project and then fiddle with the setup and interpret errors during compilation etc. etc.) – so instead I wrote a simple post-processor that takes the generated HTML and wraps any <table> tag in a <figure> tag using Nokogiri:

# lib/filters/wrap_table_in_figure.rb
class FigurizeTableFilter < Nanoc::Filter
  identifier :figurize_table

  def run(content, params={})
    return content unless content.include?("<table")
    doc = Nokogiri::HTML(html)
    doc.search('table').wrap('<figure/>')
    doc.to_html
  end
end

I exit early when no tables are present to avoid parsing the document for nothing.

The rest is a standard Nanoc Filter that I call for all my posts using filter :figurize_table in my Rules file.

Website Rank in Common Crawl's C4 AI Dataset

The Washington Post published an article, “Inside the secret list of websites that make AI like ChatGPT sound smart”, that visualizes a bit of the C4 data model via websites that have been crawled.

Scroll down a lot to “Is your website training AI?” (direct anchor link), and you can enter a domain (e.g. yours!) to see how much that page has contributed.

Rank Domain Tokens Percent of all Tokens
11 washingtonpost.com 55M 0.04%
106,350 forum.zettelkasten.de 190k 0.0001%
328,137 christiantietze.de 71k 0.00005%
718,385 zettelkasten.de 32k 0.00002%

Can’t beat a community effort like a forum, no matter how much or how long I post :)

Got the link and the idea to check my sites from jwz.org.

Update 2023-04-20: I should’ve checked the Post’s claims, first; they say it’s Google’s dataset, but that seems to be wrong. Here’s a short summary of the C4 dataset I could find:

What is C4?

The Post link to a research paper. That, and the article itself, reference C4: this stands for “Colossal Clean Crawled Corpus”, from https://commoncrawl.org/). The dataset is available on Hugging Face.

Hugging Face’s “Models trained or fine-tuned on c4” sidebar contains Google-owned repositories, but Common Crawl doesn’t appear to be affiliated with Google directly. Wikipedia indicates that the Common Crawl foundation is founded by Gil Elbaz whose company “Applied Semantics” was acquired by Google. But the Common Crawl foundation isn’t, as far as I can tell.

The dataset is likely used by AI, including Google’s, but that’s different.

To be fair, I got the idea that this is Google’s from JWZ’s post title, “I’m the Googlebot. I’m here to index you. Please hold still.” first, and then didn’t get rid of this idea when writing this post.

Sorry for the confusion – I updated the post’s title!

Fetch Personalized Command Explanations with 'um' from Your Terminal

I stumbled upon this page: http://ratfactor.com/cards/um

Dave Gauer describes how he has a shell script, um, that he can use as a man replacement to help remember how to use a command. Dave’s implementation uses the cards} from his own Wiki, because the um pages there are “consolidated, I won’t forget about them, it’s easy to list, create, and update pages.” (To be honest, though, I can’t figure out where his um cards actually are, and what they look like.)

Since I’m using a Zettelkasten for all my stuff, I have notes that contain examples for commands I don’t use a lot, like ffmpeg. That’s just as powerful.

File Naming Scheme

Let’s say I want a result for um sed that produces helpful sed explanations and examples.

How do I identify the corresponding note?

  • By file name:
    • I should ignore the note’s ID;
    • the note’s file name should contain explitic markers to identify the command;
    • the file name should also be human-readable.
  • Via index:
    • Map the command name to an overview note.
    • I could create a (Multi)Markdown table of key–value-pairs.

The file name option sounds simplest to implement: No need for a Markdown table parser. Then again, Marin Todorov has a package for that.

The naming scheme? Basically this:

ZETTELID § `COMMANDNAME` rest of readable title

For example:

202304181909 § `sed` command.md
  • The §, per my convention, denotes an overview of sorts.
  • Surrounding the program name in backticks works with my Markdown expectations.
  • The human-readable string can be anything else.

I’d need to grep for "§ " followed by the search term in backticks. That’s simple:

$ find . -name "*§ \`sed\`*"
./202304181909 § `sed` command.md

Making a function of this:

um() {
    find . -name "*§ \`$1\`*"
}

Then um sed produces the same filename.

But I don’t want filenames; I want content!

Pretty Printing of Markdown

The potato version would simply use cat or less (or the PAGER environment variable, if set).

I could also use mdcat to show rendered Markdown inside the Terminal. (Notably absent: tables and footnotes.)

um() {
    find ~/Archiv/ -name "*§ \`$1\`*" | head -n 1 | xargs -I {} mdcat {}
}

I’m limiting the output of find to the first result, just to be sure, and then pass the filename to mdcat.

Note I changed the path to search in to my notes directory instead of .. Trailing slash required.

Surprising output? Read on

The output of um sed on my machine then is this:

┄202304181909 § sed command

#sed #shell

{{202211120933 Use sed and xargs -n2 to rename files.md}} {{202106211122 Use sed to prefix STDOUT.md}} {{201805011641 Use sed for stateful
replacements in text files.txt}}

This is just a concatenation of 3 filenames, you say?

You’re correct!

That’s because I need to figure out a way to transclude this overview to show all the atomic notes at once in a concatenated output.

Here’s a better example: um tmux.

That contains a couple of examples in the note itself.

File transclusion is tricky and even less standardized, so I’ll keep that as an exercise for future-me.

Create FastMail Masked Email Addresses with maskedemail-cli

I’m a happy FastMail user.

If you want to be a happy, too, use my referral code for 10% off of your first year (and I’ll get a discount, too!) → https://ref.fm/u21056816

I never used their Masked Email feature, though, because it’s so cumbersome to create these addresses from the web UI. I all but forgot about this feature until today, when I looked for something else in my settings.

A quick research then produced a CLI application to manage Masked Emails: https://github.com/dvcrn/maskedemail-cli

Here’s how to set this up to make creating generated email addresses as easy as mm create!

Create a FastMail API Token

First, create an API token for the CLI app: https://app.fastmail.com/settings/security/integrations

Heads up: the FastMail settings have changed recently and API Tokens aren’t where the docs say they are. Go to Settings > Privacy & Security and pick the Integrations tab. API tokens are at the bottom.

Update 2023-04-18: FastMail support reacted fast and updated their docs to better reflect the new location.

I called my token “dvcrn/maskedemail-cli”. Grant it read-write access to “Masked Email”. (Read-write is the default.)

Now install the CLI tool.

Install maskedmail-cli

It requires Go:

$ brew install golang  # if you need Go
$ go install github.com/dvcrn/maskedemail-cli@latest

Go, by, default installs these binaries to ~/go/bin. I needed to add this to my path in my ~/.zshrc:

export PATH=$PATH:$HOME/go/bin

You can check that maskedemail-cli works by experimenting with examples from its README. I’m going ahead with the setup instead.

Store your FastMail API Token in Keychain

Add a password for maskedemail-cli to your Keychain like so:

$ security add-generic-password -a $USER -s "maskedemail-cli" -w "<<API TOKEN HERE>>"

Verify the password is readable:

$ security find-generic-password -ws "maskedemail-cli"

I asked for ideas on Mastodon and got the excellent advice not to pass the token (or any password for that matter) as a parameter, ever. Instead, favor the environment variables offered by maskedemail-cli.

$ maskedemail-cli
Usage of maskedemail-cli:
Global Flags:
  -accountid string
    	fastmail account id (or MASKEDEMAIL_ACCOUNTID env)
  -appname string
    	the appname to identify the creator (or MASKEDEMAIL_APPNAME env) (default: maskedemail-cli)
  -token string
    	the token to authenticate with (or MASKEDEMAIL_TOKEN env)
...

So we’ll be using MASKEDEMAIL_TOKEN.

Access the Keychain to populate the environment variable just before calling the actual command:

$ MASKEDEMAIL_TOKEN=$(security find-generic-password -ws "maskedmail-cli") maskedemail-cli session
your@email.com [deadbeef] (primary: true, enabled: true)

If everything worked, this should show details about your account session.

For brevity, you could reach for an alias:

# ⚠️ Don't do this!
$ alias mm="MASKEDEMAIL_TOKEN=$(security find-generic-password -ws "maskedmail-cli") maskedemail-cli"
$ mm list
...

But the alias evaluates the expression first, then stores the resulting string, including the plain text token. which mm would then print the token for everyone! Not good.

Use a function instead and forward parameters via $@:

mm() {
    MASKEDEMAIL_TOKEN=$(security find-generic-password -ws "maskedemail-cli")\
        maskedemail-cli "$@"
}

Update 2023-04-21: Added missing quotes around $@, which makes passing whitespace in quoted parameters possible instead of flattening everything. Thanks @teilweise@layer8.space!

With that, you can use the shorthand mm session or mm list just like you could with an alias, but the function will evaluate the sub-expression every time – like it should!

Getting a Masked Email is as simple as: mm create

To put the email onto your clipboard on macOS:

$ mm create | pbcopy

Nice! That’s already great for automation.

For fine-grained control, the create command also accepts

  • -domain "<domain>", e.g. “facebook.com”
  • a human-readabable -desc "<description>"

To have an overview in the onslaught of newly generated email addresses, supplying these parameters is likely a good idea!

Now I’ll experiment with creating some Keyboard Maestro macros or Siri Shortcuts to make Masked Email creation simpler.

Join FastMail using my referral code for 10% off of your first year → https://ref.fm/u21056816


→ Blog Archive