Before starting, I want to lay out my intent with this blog post.
I'm frustrated with much software I use every day. I want to outline a kind of software I would like to use.
I hope that in putting this out into the world, someone will tell me this already exists, and I can stop thinking about building it.
Unfortunately I have to eat 😔, which means that I cannot build whatever I want. Unfortunately "wouldn't this be cool" is not a good monetization strategy. In fact for me "wouldn't it be cool" often sits in direct opposition to monetization. The software I find most exciting removes centralized control, provides data ownership, and gives interoperability.
So, say money were not an issue. Maybe I become unimaginably rich. Maybe a nice VC who wants to throw away some money. What would I build?
First, a grab bag:
These are split into two groups:
First I would make solving these problems easy (make a library), and then easily solve these problems (make some apps).
All of the problems I want to solve can be structured as block storage with transactions and peer-to-peer sync.
Define a Block to be composed of:
id.type.created_at, modified_at, deleted_at.blob which is just a bunch of opaque bytes used by an app.Define a Machine as a unique identifier which is stable for a given user-machine-app triplet.
The library would be built out of layers:
At rest a Block is stored in the storage layer. The storage layer needs to be able to do two things:
Block by its ID.Blocks at once.You can imagine implementing this in any number of ways: A SQL database, a filesystem, cloud object storage.
This structure must give O(1) access to a given block based on its ID. Using this and transactional writes, you can support other more interesting kinds of lookups. You could create a Block which is an index. Or you could store a number of other blocks' keys.
To start this would be very simple. Each Machine would catalog a set of peers it wanted to connect to, and a mechanism to trust that peer. Imagine a manifest which described IP addresses to connect to and that peer's base64 encoded public key. You can imagine implementing auth on this based on a ed25519 challenge-response.
Later I would like to enable end-to-end encryption, so that one could share their data over an untrusted relay knowing that only their nodes can access their data. I know that this is possible, but I don't know how to do it, so I'm ignoring it for now.
I would want to make this system support multiple transport mechanisms. It should work on any transport mechanism that can send a logical stream of bytes. To I would target good old fashioned TCP (for most nodes) and websockets (for the web). It should be easy to support a new transport protocol.
The sync algorithm would allow two peers to symmetrically exchange data, with either end initiating a connection. The sync algorithm would choose the minimum number of Blocks to send over.
At a high level the sync algorithm would look like:
[0, 1].json.Machine?Blocks to stream based on type and recency of modification.Blocks and the other reading.I am a bit afraid of sync algorithms, because I know the cost of under-selecting Blocks to share. I'm sure the final version of this would be better thought through. Perhaps this would be a good place to use deterministic testing like Antithesis.
This is the part where you expect me to say "everything is a CRDT," but I actually don't think that's the right approach!
Merging data should be an application-level concern. The core library only understands Blocks as containing an opaque series of bytes. The only conflict free algorithm we could implement is "last-write-wins," which is not suitable for many applications.
Instead, I want to let each application define a set of merging algorithms. Each one would apply to one or more types of Block. If the application contains such a merging algorith, it is applied. If it does not contain a merging algorithm, we use last-write-wins.
I know this isn't perfect! There are a couple of things I'm thinking about, even right after this stream of consciousness blog post.
How does live multiplayer work? I'm thinking you just have a keepalive connection to a peer where you continuously stream updates to a set of Blocks. Who knows! 🤷
How do you evolve schemas (Block and blob)? Block schema is hard, but at least for the blob perhaps we include a version tag. If someone sends a version that you do not support, you could just ignore the Block update entirely.
How does it scale? I don't think it does! I don't know if it needs to. I'm concerned about unique identifiers for Machines, and each Machine having to track each other Machine. I plan to deploy these apps with at most 5 or 6 nodes (all of my and my wife's devices).
How do you do permissions? I think this can be solved at the application layer. If you have listed a peer as trusted, presumably you trust the code it's running. You can mark certain Blocks as being owned by certain users, and preventing other users from updating them. A particular node could reject modifications to Blocks from Machines belonging to users who do not have write permissions, but this leaves the machine in an inconsistent state.
If you have more questions or critiques, please share them! I want this idea to be as good as it can be :)
What could I build with this?
A relay would blindly collect and relay Blocks from every Machine it can peer with. It would need to act a little differently, as it would have no "application layer" to perform merges. Instead it would need to store each most recent version of each Block as per each Machine who has synced to it.
A budgeting app would be structured as 3 kinds of Blocks, with supporting indices. An account, a transfer, and a recurring transfer (which would generate a transfer on a recurring basis). One can import data from their financial providers using SimpleFIN.
A task manager and planner would be structured as two kinds of Blocks. A task and a graph. A task would contain all of the metadata typically associated with a task in task management (title, description, due date, etc.). A graph would just contain connections between task IDs.