Parse Boolean Search Expressions in Swift

null

Ever wanted to implement a full-text search in your app? Didn't find a boolean search expression parser that works? Look no further!

For the initial release of my note-taking app The Archive I had to create a couple of open source libraries that I have yet to talk about. Today, I want to show you my search expression parser. It powers the app's Omnibar to find notes quickly using simple boolean expressions.

Searching in a Pile of Notes

A prerequisite for me was to have an interface that'd work well with C Strings. strstr is just the fastest way to look for a needle in a haystack. It's crazy. To utilize the full potential of full-text C String search, though, you should keep an all-caps or all-lowercase version of the note around so you can do a case-insensitive search. While case-insensitive searches are slower when you perform them on-the-fly, the results are stunning on a bre-built index that doesn't distinguish upcase and downcase anymore.

The actual objects in my app have a bit more information, but this is a good-enough approximation of what a note representation in the in-memory index looks like:

struct IndexableNote {
    private let cString: [CChar]

    init(text: String) {
        self.cString = text
            // Favor simple over grapheme cluster characters
            .precomposedStringWithCanonicalMapping
            .cString(using: .utf8)!
    }
}

The original String doesn't matter for the purpose of indexing. All we want to do is perform fast searches.

My first naive implementation of a full-text search separated the search string at every space. So foo bar baz would search for foo, bar, and baz in every note. To make the parts easier to search for, convert them to C Strings as well.

Here's the implementation that served me well through the beta and into the v1.0 release:

struct IndexableSearchString {
    private let cStringWords: [[CChar]]

    init(_ string: String) {
        self.cStringWords = string
            // Favor simple over grapheme cluster characters
            .precomposedStringWithCanonicalMapping
            .lowercased()
            .split(separator: " ")
            .flatMap { $0.cString(using: .utf8) }
    }

    func matchesAll(in haystack: [CChar]) -> Bool {
        for needle in cStringWords {
            if strstr(haystack, needle) == nil { return false }
        }

        return true
    }
}

extension IndexableNote {
    func matches(searchString: IndexableSearchString) -> Bool {
        searchString.matchesAll(in self.cString)
    }
}

You prepare the search once, then match it against each note in the index as a filter. That's it. It's super simple, but it works well for 99% of all use cases. This is pretty close to a Google search already: You enter a list of terms and want search results that contain all of them.

Then there are tech-savvy people, though, who want to exclude search terms and use boolean OR to search for variants, like "banana OR apple OR fruit". To cater to their needs, I wrote a search expression parser that does just that, and which provides the C String matching that proved to be so useful.

The Boolean Search Expression Parser

I wrote the SearchExpressionParser library with note-taking apps in mind. Search terms had to be human-readable enough for a layperson to understand what's going on. That's why operators are all caps: AND, OR, and NOT/!.

The library behaves as follows:

Note that a lowercase "and" will not be treated as an operator, only all-caps "AND" will. So there's no need to escape a lowercase \and, for example.

You can parenthesize expressions:

!(foo OR (baz AND !bar))

That evaluates to an equivalent of:

!foo OR !baz AND !foo OR !bar

As of yet, there is no real operator precedence implementation. I didn't need that, and I discovered not every full-text search implements this correctly at all. So instead of operator precedence that satisfies math-nerds, logicians, and programmers, I roll with a strict left-to-right approach.

The Expression object tree of the nested term above looks like this, by the way:

// !(foo OR (baz AND !bar))
NotNode(
    OrNode(lhs: ContainsNode("foo"), 
           rhs: AndNode(lhs: ContainsNode("baz"), 
                        Rhs: NotNode(ContainsNode("bar")))))

That's what you'll get from the parser. It's a self-evaluating expression node tree.

Expressions are not optimizing themselves to abort quickly; instead, the whole expression tree will be traversed and checked if the operators permit. Since an AndNode simply combines the result of the left-hand side with the right-hand side using Swift's && operator, the right-hand side could be skipped if the left-hand side evaluates to false already. In the best-case scenario, this is just as efficient as regular boolean expressions in Swift.

This also means that !(a OR b) will result in:

NotNode(OrNode(lhs: ContainsNode("a"), rhs: ContainsNode("b")))

Since the underlying || operator always evaluates both sides, this is less efficient than the equivalent term !a AND !b.

But does it matter in your case?

If so, pull requests with boolean expression normalization are welcome, of course!

It didn't matter to me. When people compose intricate expressions, well, then I think they're using their note archive in a very peculiar way. I don't see the benefit of spending extra work on a normalizer when I could be adding features that benefit a much wider audience.

Using the SearchExpressionParser API

The SearchExpressionParser API exposes Parser.parse(searchString:) that you'll be using:

import SearchExpressionParser
guard let expr = try? Parser.parse(searchString: "Hello") else { fatalError() }
expr.isSatisfied(by: "Hello World!") // true
expr.isSatisfied(by: "hello world!") // false

The parser can potentially throw an error, but all errors you'll get are programmer errors on my side. There are no regular error conditions. When an error gets thrown here, please report it, because it's a bug.

The library provides a CStringExpressionSatisfiable protocol to perform my beloved strstr search instead of the more literal and much slower String.contains. It will also make the search case-insensitive.

To implement this, take the IndexableNote from above and modify it to meet the API criteria:

struct IndexableNote {
    private let cString: [CChar]

    init(text: String) {
        self.cString = text
            // Favor simple over grapheme cluster characters
            .precomposedStringWithCanonicalMapping
            .cString(using: .utf8)!
    }
}

import SearchExpressionParser

extension IndexableNote: CStringExpressionSatisfiable {
    func matches(needle: [CChar]) -> Bool {
        return strstr(self.cString, needle) != nil
    }
}

That's all you need to get lightning-fast case-insensitive search:

guard let expr = try? Parser.parse(searchString: "Hello") else { fatalError() }
expr.isSatisfied(by: IndexableNote(text: "Hello World!")) // true
expr.isSatisfied(by: IndexableNote(text: "hello world!")) // true

Let the boolean expression parser do its magic for you:

let warAndPeace = IndexableNote(text: String(contentsOf: "books/Tolstoy/War-and-Peace.txt"))
let protagonist = try! Parser.parse(searchString: "\"Pierre Bezukhov\" OR \"Pyotr Kirillovich\"")
protagonist.isSatisfied(by: warAndPeace) // true

That's all there is to the power of The Archive's search expressions! They were pretty fun to implement and make searching for relevant notes in your note archive much easier, e.g. hyperlink (#zettelkasten OR #note-taking). With The Archive's saved search feature, you can compose boolean queries once and then get back to an accurately reduced subset of your thousands of notes in a split-second.

Find SearchExpressionParser on GitHub and feel free to open issues, pull requests, and ask questions anytime!

Browse the blog archive