Natural Language Toolkit Word Counter

May 10th, 2020

Back in January when I drafted this post, I had just discovered Apple’s NaturalLanguage.framework. I still don’t know how powerful it really is, but it’s useful for a very simple task already:

Counting words.

In English and German, I can get pretty accurate results with a String.split at punctuation marks and whitespace. In French, you will get skewed results because these nices folks decided to put whitespace between quotation marks and quoted text.

“Quote in English”
« Citation en français »

Splitting the string at whitespaces will produce 3 for the English line and 5 for the French.

That’s where the linguistic tagging of the Natural Language toolkit comes into play!

It also works for a lot of other languages, even Chinese. (Guess what I’ll change the international algorithm for the WordCounter to.)

To get it to work, create a NSLinguisticTagger with the scheme set to .tokenType: that’s for detecting words and punctuation instead of, say, sentences.

let wordCountTagger = NSLinguisticTagger(tagSchemes: [.tokenType], options: 0)

The actual API to enumerate over all words is more useful for this purpose since macOS 10.13 since you can specify that you’re interested in NSLinguisticTaggerUnit.word directly. With older versions of the OS, you have to tweak the counting a bit and filter matched tags for NSLinguisticTag.word.

Don’t ask me what the purpose of NSLinguisticTag is over NSLinguisticTaggerUnit, and how you can combine these with great success. That still beats me.

func wordCount(tagger: NSLinguisticTagger, text: String) -> Int {
    let range = NSRange(location: 0, length: text.utf16.count)
    let options: NSLinguisticTagger.Options = [.omitPunctuation, .omitWhitespace]
    tagger.string = text

    var count = 0

    // Since macOS 10.13, you can limit the enumeration to words directly:
    if #available(OSX 10.13, *) {
        tagger.enumerateTags(in: range, unit: .word, scheme: .tokenType, options: options) { _, _, _ in
            count += 1
        }
    } else {
        tagger.enumerateTags(in: range, scheme: .tokenType, options: options) { tag, _, _, _ in
            guard tag == NSLinguisticTag.word else { return }
            count += 1
        }
    }

    return count
}

Keep the counter off your main queue

I don’t know how hard linguistic tagging hits your performance. In The Archive, the word counting stats aren’t mission critical; a delay, aka. “eventual consistency”, is totally acceptable. Asynchronous processing keeps the main queue free for important work.

Without RxSwift or other reactive setups, you can start by calling the wordCount(text:) function above from a background queue. It’s generally advised not to use NSLinguisticTagger from multiple threads – but it should be fine to use it from a background queue exclusively. Do exercise caution with all this.

Heads up: You will need a serial dispatch queue. Without a serial queue, results from callbacks may appear out-of-order. For example, if you dispatch the counting on a concurrent background queue with a string that is 10 MiB large, and then immediately dispatch another piece of work with just a couple of characters, the callback for the shorter string will likely finish first. If you display counts for multiple strings in a table, the order doesn’t matter. If you want to abort the previous request and only get back the results for the shorter string, you need a different approach.

With serial dispatch queues, results do come back in order, and a longer running but outdated request blocks execution of later requests. If you always want to display the current word count for a document, for example, cancelling previous requests is the sensible option. For this, you will have to use DispatchWorkItem or come up with your own concoction of a cancellable dispatched block.

Here’s what a work item approach may look like:

let wordCountQueue = DispatchQueue(label: "wordCounting", qos: .background)

/// Keep track of the previous work so you can cancel it.
private _previousWorkItem: DispatchWorkItem?

func wordCount(text: String) { /* Expensive calculation here */ }
func asyncWordCount(text: String, 
                    completion displayCount: @escaping (Int) -> Void) {
    _previousWorkItem?.cancel()
    let workItem = DispatchWorkItem {
        let count = wordCount(text: text)
        DispatchQueue.main.async {
            displayCount(count)
        }
    }
    _previousWorkItem = workItem
    wordCountQueue.async(execute: workItem)
}

In The Archive, a lot of UI code is written based on RxSwift, so I had Rx event streams at hand anyway. I wrapped the counting in RxSwift Observables, quite similar to the following:

struct WordCountViewModel {
    let wordCountTagger: NSLinguisticTagger
    
    // Like the function from above
    private func wordCount(text: String) { /* ... */ }
    
    // Input port
    let textChanges: Observable<String>

    // Output port
    var counts: Observable<Int> {
        return textChanges.flatMapLatest { text in
            Single<Int>
                .create { single -> Disposable in
                    // jiggleThread(minimumDelay: 0.5, maximumDelay: 2)
                    single(.success(self.wordCount(text: text)))
                    return Disposables.create()
            }
            // Perform request on other queue
            .subscribeOn(ConcurrentDispatchQueueScheduler(qos: .userInitiated))
        }
    }
}