Making Your App Extensible with JavaScriptCore: Annotated Presentation with Full Transcript

Aug 15th, 2023

Last year, I posted my presentation video and slides for the CocoaHeads Aachen talk “Making Your App Extensible with JavaScriptCore”.

Today I read about Simon Willisons’s presentation annotation tool. It’s a simple HTML file where you can put your slides, have Tesseract generate alt-text for each, and annotate the whole thing with simple Markdown. The generated output is HTML. It is genius. I love it, and here’s the slides + transcript (which I happen to have from editing the video anyway) of the JSCore talk.

Making Your App Extensible with JavaScriptCore (Full Text Version)

Make your app securely extensible with JavaScriptCore

christiantietze.de // @ctietze CocoaHeads Aachen

Thanks again for having me. And what I want to show you today is how to securely use JavaScript core in your own make applications.

The idea here is that JavaScript is basically the only way nowadays that you can ship plugins or scripts with your apps on macOS that is not broken since Big Sur, I think. Because Ruby and Python and all these other scripting languages were yeeted from the operating system.

The JavaScript core framework is still available, and it doesn’t depend on an actual runtime. So you can use the framework from within the apps. And the nice, yeah, the bad thing about this is you have to use JavaScript. The nice thing is it’s built in and basically works since forever because the framework is so old.

So it’s not like new technology. It works with really old stuff. And the thing I’ve found is that you can actually make plugins for your applications that don’t expose the app’s internals or the user’s file system or any sensitive data unless you actually decide to expose this. Which means that plugin installing can become rather safe. It’s easy to break things, but it’s also easy to make things secure.

And today I want to show you how I approach this to ship application plugins with my app in a week or two, two beta testers. And the app I’m talking about is the archive. It’s a note-taking app.

Screenshot of note-taking app 'The Archive'

Here you see a demo note on the right side with some syntax highlighting for Markdown. On the left, there’s the list of search results, which is currently showing just everything because I didn’t search for anything.

And yeah, the layout of the application is this: there’s search to the left, there’s content to the right. I’m developing macOS applications first and foremost, but the approach is possibly universal or cross-platform because I’m not using anything Mac-specific. As long as you have access to the JavaScriptCore framework.

We will be exploring how to make this available to plugins: how can a script access selected notes on the left, or the selected text on the right, and how can a plugin insert text on behalf of the user? How can a plugin create a new file that is then managed by the app? How can a plugin change a file that is not visited at the moment, that is not visible in the editor, but still managed by the app itself? For example: to always update the same statistics file, which is a use case I’m going to demonstrate to you.

Schematic representation of an app: buttons, clicked on, produce change via JavaScript

That’s the very, very basic abstract flow of things. It can be a toolbar button as I tried to sketch here. It could be a shortcut. It could be a main menu entry, it could be a proper button. It could maybe even automatically be triggered when the user types certain text, like, like, like macro expansions, things like that.

So on some trigger, do execute JavaScript, so that some change happens.

This is the basic approach of making plugins work in the application. This means that I have this JavaScript and for specific user interaction, I have to execute the JavaScript to do something.

I initially had no clue how to approach this the best way because I never used the JavaScriptCore framework before. So like any good programmer, I invented the wheel all over again, possibly.

But well, this is the topic of my talk today, and I want to show you how I did it, and I do hope you have some feedback for this.

So the simplification of user interaction is that we have inputs and we have outputs and stuff happens inbetween.

This brings us to a very well known abstraction, which is usually called a black box.

A black box could be a function, a black box could be a class, a module, a black box can be a whole program.

You don’t need to know what the program is doing when you execute it. The important thing is that you kind of know that the program is doing the correct thing, given certain inputs to produce outputs you expect, and I try to approach the JavaScript stuff the same way.

So as one possible black box that we’re going to explore is like the basic function, given some input to a function, we expect some output to a function. This is the most fundamental thing we have in Swift. You don’t need classes. You don’t need modules.

Thinking about functions, that already is your black box. Applied to a very, very, very, very basic language feature, we are going to approach this whole thing with them.

I try to not introduce too many object trees or complex magic interactions and trying to make functions that understand that, perform certain tasks and then produce a change.

Screenshot of JavaScriptCore.framework documentation

This is a screenshot from the documentation. If you look at the Apple Docs as of this month, November, 2022 this is what you see. The JavaScriptCore framework has a couple of classes and one protocol, and then there’s a C API, which is a bit different.

And then there’s JavaScriptCore constants, which we are also not going to look at. Most things I found can already be accomplished by using the JS context class.

Here’s an “execution environment”. What does this mean?

An execution environment is basically running the JavaScript code. It’s where variables are stored. It knows what kind of free functions are available, where the objects live that the script can access. Basically everything lives inside the context. It’s just this thing wherein the script is running the virtual machine. I don’t even recall what exactly the VM does. I know that I needed it for something to get a context, but I can’t tell you more about that part.

The most important piece that I found is the context, and the JSValue that you see below, and the JSManagedValue, which is something we will completely ignore. If you have worked with CoreData, for example, you will know the NSManagedObject, and the JSManagedValue is kind of similar: It’s bridging into the JavaScript context and automatically updates when you mutate an object from within the script. And the JavaScript site sees values reflected as you change them in your swift code on these managed values. This is very handy, I guess, but it’s something that I find, well, utterly unnecessary for my purposes. And too hard to control.

So we stick to the basic values and the context, and if we look at the context in more detail then we will see that the context has some functions.

Screenshot of JavaScriptCore.framework documentation focusing on the 'evaluate' function

But there’s one thing that is reflecting what we’ve just talked about and what I just demonstrated: that’s this evaluateScript method. It takes an input that’s the JavaScript code, and it produces an output. So usually, the script returns a value.

I would rather not return a value because I wanted my plugins to be written in a way that they perform side effects. So if you think about a function that returns a value, but want to make it, let’s say a synchronous, then the next step you would do is to pass a completion handler and not return a value, and then pass the same value into the completion handler for later processing.

And that’s basically what we are going to do. We’re using this evaluateScript function as the main entry point, as the core player here. But we are going to ignore the return value. The actual results of the script will be captured differently. Going to demonstrate you in a bit.

Now imagine this is the JavaScript file, the contents, or the script code.

What we’re going to do is this: we are taking this, and then we are putting this into the executeScript function to make the context do something with the script. But still, even when the context is executing the JavaScript code string, it would need inputs to see in the case of my app, the notes, the selected text, and we have to teach the JavaScript code itself how to access this.

Even with the JS context method that I’ve just shown you, we still need to figure out inputs and outputs to the script itself

As I hinted at, my approach is quite functional, which means functional input and output.

I’m not leaning into the Haskell IO Monad, if you’re familiar with that, because I’m not familiar with that. I just know that it exists and that you’re basically able to do functional programming without any side effects, and that the IO thingy captures all these side effects.

Since the JavaScript code itself has no clue about the rest of the app, I’m introducing similar things. At least from my superficial understanding.

I’m introducing a function for input, and a function for output.

And I’m teaching the script to call a function for the input and I call function for the output. And then the script isn’t, isn’t blind anymore and isn’t a useless waste of computing power, but can actually produce effects and outputs.

const aValue = input();
...
output(anotherValue);

With this input and output in mind, this is what we’re going to do in the evaluation loop:

We are getting a value. This is inside the JavaScript that you write, and we’re calling some kind of input. input() is a function that is injected into the JavaScript code. You call the input, and then you get whatever this function returns.
And can then you work with the value to do some meaningful tasks or value transformation.
And when you’re finished and want to do something with the result, then you call another function that is taught to the JavaScript context and thus the JavaScript code, the output function, and you pass the result in there.

And this is what I meant when I said I don’t want the context to return the result, I want to forward another value to a function inside the script, because the app doesn’t know what your script produces and what to do with that

Additonally, you can call the output function multiple times for multiple outputs. Or you could not call the output function at all, which is kind of useless because then the script doesn’t do anything. It would compute something, but then it’s not reporting the results back.

So you would need at least one call for this, but you don’t need to return this synchronously. You can also do asynchronous work. (I believe that’s the case at least, because I haven’t tried to do anything too fancy with my JavaScript files.) But the idea here is that this pattern of output callbacks and input functions already allows writing interesting scripts.

Since the output is a callback that is injected into the script, you could copy and paste whole JavaScript libraries from the web that do complex things. Maybe that even, let’s say, do statistical analysis and then create images or some other kind of visualization, which maybe even takes some time, and then pass the result to the output function as a parameter.

And there you go. So this is the basic approach. To make this a bit more concrete, I want to show you a rough simplification of the code I’m using.

So the plugins in my app are really simple. A plugin object has a manifest, we’re going to look at this in a second. It has the actual JavaScript and checks on the manifest declares the kinds of inputs and outputs that the plugin requests from the application.

struct Plugin {
    let manifest: Manifest
    var name: String { manifest.name }
    let javaScript: JavaScript
    let checksum: Checksum
}

Going to show you what this looks like and why in a second. But it’s basically all the metadata, the inputs, outputs the author. So everything about the plugin, the plugin itself, so to speak, is the JavaScript code. That’s just a string. The JavaScript type is a strict, but it just has one value. It’s a string for the code, and it doesn’t do anything else.

It’s just a very thin wrapper. And the checksum is important. It’s a SHA256 hash, I believe. I can’t even recall. But it’s a computed hash from the file to make sure that when the user enables a plugin that this enablement is then stored with a check sum. So if you change the file on.

The app will notice that the check sum no longer matches and prevents, can prevent the plugin from automatically loading again. In the demo that we’ll follow I will show you what this looks like. It will feel rather straightforward when you see it, but the checksum is important for that part.

struct Manifest {
    struct Author { ... }
    struct Input { ... }       // Focus on this
    struct Output { ... }      // And this, with

    let name: String
    let title: String
    let description: String
    let version: Version
    let releaseDate: Date
    let appVersion: Version
    let authors: [Author]
    let input: Input           // this property
    let output: Output         // and this one
    let dependencies: [String]

    ...
}

The manifest declares required inputs and outputs. These are subtypes and this whole thing is Codeable, so you can load this from an adjacent file and it contains all the metadata.

It’s well, metadata for the plugin, like the name, the title, the ID. The title is a user-facing title, but all of this is just potentially interesting metadata for the user and configures the app. So the app can present something nice in a plugin list, let’s say. At the same time, the inputs and outputs are where the real complicated stuff is happening.

It’s not really that complicated, but it’s the most complicated. The rest is really, really dumb.

struct Manifest {
    ...
    struct Input {
        enum Selection: String, Hashable {
            case selected 
            case all
        }

        let notes: Set<Selection>
        let text: Set<Selection>
    }
    ...
}

If we look at the Inputs type, that’s the declaration of the expected input functionality. It’s basically saying this plugin needs, for example, access to all notes or to the selected text.

It can also declare that it wants the combination of all notes, and the selected notes, and all text, and the selected text. What does this mean?

“All notes” in the context of my app means you get well, all notes that are visible in the folder that are known to the app.
The “selected notes” input would just return the currently visited file, aka the note the user is currently editing. And if there’s no note open at the moment, then this would return nil. Same for text.
“All text” means: give me the complete content of the note that is currently being visited.
And “selected text” means just the marked region, which could be useful for replacements.

One example that I haven’t implemented in the samples, but one that will make intuitive sense to you, I think, is refactoring.

Think about the use case of selecting a part of text in the note, right click, refactor, then extract into a new file. (It’s basically what Xcode is supposed to offer, but usually doesn’t.) This would perform a couple of effects:

It would create a new note,
add the text that was selected into it (to do this, it would have to read the text selection)
and after the new file is created, it’ll also replace the selected text with a link to the new file.

And for stuff like that, it’s important to not just get all of a note’s text, but also the selected text, or a representation of the selection, so that you can simulate user input.

But that was just the input part. Now the output has to offer similar facilities.

struct Manifest {
    ...
    struct Output {
        enum File {
            case change(Filename)
            case new
        }

        enum Text {
            case insert
        }

        let file: File?
        let text: Text?
    }
    ...
}

So this is the output. The options I currently allow is either:

A file output, which can mean change a certain file, a known file with a static file name, like statistics.txt, or create a new file according to the rules of the app. Which means that the plugin doesn’t have a say which file is going to be created. The app is creating the file, and then yeah, the plugin has to live with the result.
And the text output is just one case that’s possible at the moment: It’s “insert”, and this will replace either the selection (if there’s any) or try or type text at the current cursor position, which is in, in a text view, the same operation. Inserting means: at the point of the insertion point, insert the string as if pasted or typed. And if the insertion point is actually not a point, but the region or range, then override this with some other text.

So there’s just this one interesting case for text manipulation at the moment. There could be more like “append” or “prepend” to the file. But yeah, I haven’t explored that. This is complicated enough already. And useful enough for a first release. These are the basics that I’m going to show you and that are shipping in our app.

So what happens when we have the inputs and outputs? The inputs and outputs are going to be executed. And I wrote a wrapper for the JavaScript context here.

/// A context defines available inputs and outputs for a `JavaScript` during execution.
class Context {
    let input: Environment.Input
    let output: Environment.Output
    let utils: Utils
    private let jsContext: JSContext
    
    ...

    func execute(javaScript: JavaScript) throws -> Effects {
        let outputCollector = output.outputCollector()
        configure(input: input.jsValue(context: jsContext),
                  output: outputCollector,
                  utils: utils)

        jsContext.evaluateScript(javaScript.code)
        if let exception = jsContext.exception {
            throw JavaScriptError(message: exception.toString())
        }

        return outputCollector.effects()
    }

    private func configure(input: JSValue?, output: JSExport?, utils: JSExport?) {
        jsContext["input"] = input
        jsContext["output"] = output
        jsContext["utils"] = utils
    }
}

First, just give you a bit of moment to, to orient yourselves.

There’s this execute function. It takes the JavaScript, which is just the code, and then it calls the configure method here. And the configure method injects into the JavaScript context from the JavaScriptCore framework.

It injects a reference to input, output and utils. The input and output is interesting for our case here. So what does this jsContext subscript with the input and output strings do? It creates a new global variable inside of the JavaScript execution context in the environment. So the script starts with access to these global variables.

That’s how we are injecting input and output into the context for execution.

And then we have access to this afterward. We are evaluating the script. That’s the method. We looked at it in the documentation earlier. You see here, the string of the JavaScript code is being fetched, and it’s being evaluated.

After the three global helper variables, the global references have been declared. So you can think of this as input, output, and utils being declared first on the JavaScript side, then the rest of the script is being copy-and-pasted afterward, and then the resulting code is being executed.

And in the end, if everything is finished, we return from this wrapper function.

From this execute function, we return the effects of the “output collector”. The output collector is a reference to an object that is being bound to the JavaScript context output. And the effects are being computed at the very end after the script elevation is complete.

Let’s have a look at the output collector and the effects. So this is a JS exports protocol. This is the only JavaScript magic binding, let’s say that, that I’m using as far as I can remember at least. This output collector JS exports, that is the output collector and it declares the the, the available object properties, attributes, functions, whatever that JavaScript has access to. Let’s look at each one of these.

protocol NoteContentCollectorJSExports: JSExport {
    var filename: String? { get }
    var content: String { get set }
}

So there’s this NoteContentCollectorJSExports. It’s not public API, so I picked a very technical name. This filename and content string, these are two properties of this collector and these correspond to the manifest output file declaration that we’ve looked earlier.

That’s the input file name, and the content is the supposed new content of a file once it’s going to be created. It’s also going to be used when the file, when an existing file is being replaced, when the contents are replaced. So this is the new content of a new note or a new content of an existing note after the script execution has finished.

protocol InsertCollectorJSExports: JSExport {
    var text: String { get set }
}

The other collector of outputs is the InsertCollectorJSExports collector, and it corresponds to the “insert text” case, which is the only case for the text output that I’ve shown you. Its text property is mutable, so the script can set this text value to some string, and at the end of the script’s execution, we will know what the supposed replacement text is going to be.

For the refactoring example, extract something from the current note and replace the selection, after the refactoring is complete and the extraction is complete, replace the selected text with a reference to the new note. And this will be that this text property will be set to the reference to the new note. And this is then in turn being evaluated by the app, and it will know what to do with it. It will know how to manipulate the NSTextView to change the content at the insertion point or at this selected range.

So this is all the output collector stuff. And you see the combined output collector here:

protocol OutputCollectorJSExports: JSExport {
    var newFile: NoteContentCollectorJSExports? { get }
    var changeFile: NoteContentCollectorJSExports? { get }
    var insert: InsertCollectorJSExports? { get }
}

Its three properties correspond to the note content collector for a new file, the note content collector for a change to a file, and the insert collector for a text change. This is why NoteContentCollectorJSExports was a bit weird: because it’s not a one-to-one match to just “create a new file” and to just “change a file”. I do consider splitting this up into two protocols. I haven’t yet, though. It worked so far, but this API isn’t finalized.

So this is probably the best point in time for a short break so you can ask questions.

Questions

Will this approach work for more complex applications than text input and output?

May maybe I have a question. So, so basically this is probably really good for, for text processing, I think, right?

So if you would want to make a, a very complex plugin. I mean also let’s say maybe it has a, has a UI and , it’s maybe very complicated with JavaScript. But what do you think are the limits here? So, I mean, yes, maybe if your input and outputs here, you could define a lot there. And maybe you could also define UI components there.

But as far as I can see now, to me, to me it looks like it’s, it’s a very well. Fitted to, to, let’s say text processing or editor stuff, but for more complex plugins, maybe it could get too complex. What do you think?

I’m thinking of like the, the design app “Sketch”. I guess everyone of you knows this, so they, they support plugins that can either display windows with controls or maybe sidebar content.

I’m not really sure, but plugins can display their own UI and then do stuff. I haven’t tried, so I don’t know if the evaluation has to be changed. I think the context has to be kept alive for while the plugin is active. So it can be stateful. I think that’s a change that needs to be made.

And then also, as you said, how do you make the UI stuff available? I believe the Sketch people offer to basically bridge or import Cocoa framework, the AppKi stuff, and offer all the NSWindows, NSButtons, stuff like that. You can access this directly in the JavaScript code.

You can write UI in JavaScript, and this is being all breached to Objective-C land and then being executed there, and then you get some UI. But you also have to, well write all your view controllers. Then it’s, it’s getting a lot.

The examples I saw were interesting, but they were also quite complicated. So I don’t believe that it’s simple to write these plugins, but it’s possible. So you’re right, the, the approach I’m showing is evaluate, finish evaluation, and then report the result after evaluation is finished, finished. You could also make this observable instead of collecting the outputs in the end assume there is no end, and then just collect the outputs as the objects are changing.

The JS exports protocol, which I used for m youtput collectors supports this. So this would work, but I’d rather not go in that direction too far. With SwiftUI, it could be interesting because it’s more declarative, resulting in shorter plugin UI code and simpler bridging.

Maybe you can bridge this with custom code, with the custom bridging, but exposing? Not sure.

What’s making things secure?

Your presentation title points out the aspect of secure existencibility. I understand the checks and implementation. Are there other aspects?

Yes. If the manifest is not declaring the inputs and outputs, the script doesn’t have access to them. That’s the rule of my framework of the library that I’m writing.

You only get access to the output collectors that you declared in the manifest file, which means when the user activates the plugin, you can display what the plugin has access to. And then you can also then you can also protect users from maybe from, from file access. Let’s say users are only allowed unless they, they explicitly give consent.

They’re only allowed to do text changes, all the other plugins can’t be enabled. That’s like, I’m thinking about this for the app myself. Actually, this is like tier zero. Access to, to, to dangerous features. And then it gets more and more dangerous, like create a file in the background, potentially overriding existing files.

This is, this is, this is already getting, getting a bit dangerous, but it’s not as dangerous as delete files or maybe excess files outside the applications. Directory of, of, of notes. So if you, if you have one directory full of notes and the plugin tries to write to a file, or you teach your plugin system to write to a file in the ~/Downloads or ~/Documents folder, this can get really dangerous if you replace user files outside the current scope of the application.

So I’m not even allowing this by not offering an output part that can do this. But we are actually considering adding tiered access restrictions. Let’s say a “danger zone” where these things are allowed. Like if people really, really switch the lever, pull the plugs, do whatever, and scribble their consent on paper and fax it into our offices, then they get access to this stuff, and then they can do the most dangerous things like perform any file I/O. Until then the plugins will start innocently enough.

I can teach the framework more outputs. I can teach it to output to user-selected folders with security scoped bookmarks, maybe produce exports on the desktop or wherever people put their HTML exports, or PDF conversions.

I really don’t want to go into the direction to teach the plugin authors or the plugins to do web requests. I’d rather leave that out, because the web is a dangerous place, but it’s also a technical possibility.

The security part is that the app can gate features. For other applications, you could even do this within different pro subscription tiers. The free version could access the basic stuff, and you have to pay to unlock the pro tier or whatever, to get access to the more advanced plugins.