CRDTs: The Hard Parts by Martin Kleppmann
Transcript
[00:00] Hello everybody, my name is Martin Kleppmann. I'm a researcher at the University of Cambridge and today I'd like to talk about some of our work on conflict-free replicated data types or CRDTs. So I'm going to start this talk with a very brief introduction as to what CRDTs are and why they are useful, but most of this talk is going to be about new results that we have figured out only within the last year or two. So you might know me from this book here, which I wrote a couple of years ago. It's called Designing Data Intensive Applications, and it's a broad overview of the architecture of data systems and how to choose your database for your particular application.
[00:40] But that's not what we're talking about today, so today we're talking about collaboration software. So in collaboration software, I'm taking a fairly broad view of this. This is any kind of software where several users can come in and update a document or database or some kind of state, shared state, and various examples of software that fits in this category. Google Docs is an obvious example where you can have several people editing a document at the same time, be it a text document or a spreadsheet, similar things you get with Office 365. Figma is an example of graphics software that supports multi-user real-time collaboration.
[01:19] Trello is an example of project management software where several users can update their status of various tasks, assign the tasks, add comments and so on. So all of these have in common that the users are updating some kind of shared state. Now I'm going to use as example collaborative text editing. For now, there will be some examples from other types of applications later as well. So In collaborative text editing, you can end up with a scenario something like this.
[01:49] You have two users who, at the beginning of the time that we're looking at, they have the same document, which is Hello! And then the red user comes along and changes this document to hello world. So by inserting the word world before the exclamation mark and concurrently while the red user is doing this, simultaneously the blue user changes the document to add a smiley face after the exclamation mark. So these two changes happen concurrently without knowledge of each other and then later on as the network communicates those changes, so these users communicate over the internet or whatever network they are using and they should converge to the same state. So we want that at the end, both users have the same document on their screen and that document reflects all of the changes that were made.
[02:40] And in this case, the merged outcome is quite reasonable. What we expect, we expect the word world before the exclamation mark and the smiley face after the exclamation mark. Okay, and so there are broadly two families of algorithms for achieving this kind of collaborative text editing. One is operational transformation or OT, And that's kind of the classic of multi-user collaborative editing that has been around for a long time and that is used, for example, by Google Docs. And then there's a newer family of algorithms called CRDTs, which is what I work on, which take a bit of a different approach to solving a similar problem and we will compare these two in a minute.
[03:22] So I will start with a brief summary of how operational transformation works because that is then useful as a comparison point for CRDTs. So with operational transformation, we have to first of all describe how changes are made to a document. And so when a user changes a document, they insert or they delete a character at some position in the document. And so we can describe the positions at which those insertions and deletions happen by using indexes. So we simply number the first character in the document to be H, the first character to be 0, the second character to be 1, and so on.
[04:06] And so now you can have two users updating their document, and here on the left-hand side the user changes the document H-E-L-O by adding a second L at position 3. And on the right-hand side we have the user, the purple user, adding an exclamation mark after the letter O. And so now, after these two changes have happened, these changes need to be communicated over the network so that one user finds out about the changes made by the other user. And so from the left-hand side here we have this insertion of the letter L at position 3, or at index 3 in this document, and we have the insertion of the exclamation mark at index 4 on the right-hand side. So let's first of all take the insertion from the left-hand side and move it over to the right-hand side.
[04:59] And so if there's a server, for example, that forwards these messages from one user to the other, the server can just forward this insert L at position 3 operation over to the right-hand side, and on the right-hand side we insert L at position 3 and we end up with hello with two Ls and an exclamation mark, just as we wanted. Now what happens if we go in the other direction, so taking the exclamation mark insertion from the right-hand side to the left-hand side? Note that if we simply forward this operation, insert exclamation mark at position 4, if we simply forward that unchanged, then the user on the left-hand side would end up with the document saying, hell! O, which is not what we want. We want both users to end up in the same state, and for that state to reflect the intention of what the users were writing.
[05:50] And so instead what has to happen is instead of inserting at position 4, the exclamation mark needs to be inserted at position 5. The reason for that is that concurrently there has been another insertion at position 3 before the position 4 insertion. And so the insertion at position 4 needs to be transformed into position 5 due to the concurrent insertion of the L earlier on. So this is why the algorithm is called operational transformation. We have to take these insertion and deletion operations and depending on which other operations happen concurrently we have to transform the indexes or the positions at which those operations take place.
[06:33] So this algorithm does work, but it requires one very particular assumption, and the fundamental assumption in most operational transformation systems is that all of the communication goes via a single central server. So in Google Docs case, that server is provided by Google. And the server is, you know, it takes a really key role in this algorithm by sequencing all of the operations. And the assumption of operational transformation is that because we have this central server that is required to sequence all of the operations, there cannot be any other communication channels in the system. So Even if two users who are collaborating are actually sitting in the same room and they could just be using the local Wi-Fi or even Bluetooth to be communicating the changes between their two devices, using that local communication channel is not allowed because it would undermine the assumptions that are made by the operational transformation algorithm.
[07:34] So OT requires all of the communication to go via the central server, even if that server is on a different continent and even if the two users are actually sitting in the same room. So this is the difference of where CRDTs come in. CRDTs allow this kind of multi-user collaboration without making any assumptions about the network topology. So it doesn't assume anything about servers or what kind of network is used. Any type of communication channel can be used to communicate the operations or the updates from one user to another.
[08:06] And the main correctness criterion that we have for both OT and CRDTs is what we call convergence. So what we require is that whenever any two users have seen the same set of operations, then those users must be in the same state. And this happens even if the users actually saw the operations in a different order. So What we do in CRDTs is we make the operations commutative, which means that even if they are swapped around, their order is swapped around, the end result remains the same. And this is how we can achieve convergence.
[08:41] It's just ensuring that after everyone has communicated, everyone has the same document on their screen. And this makes sense, it's of course a very useful property. If we did not have this, we would have some kind of permanent inconsistency between the users, and that would be bad. So convergence is clearly a minimum requirement for any of these sort of collaboration software. But as we shall see in this talk, convergence by itself is not really enough because convergence doesn't say anything about what that final state is about after the nodes have communicated.
[09:15] So is that final merged state actually the state that we wanted? And this is a little bit more difficult to define because what state is desirable or not desirable is really a matter of human judgment. It's what are, as humans, what is our expectations of the software, of how should the software behave. And unfortunately it turns out that several of these algorithms for collaborative editing sometimes have behavior that is not really desirable, that behaves weirdly in some ways that we as humans would not expect. And I have come to the conclusion after several years of working on CRDTs that basically these problems are quite hard, And a simple version of CRDTs is very easy to implement, but actually getting it right in such a way that it actually satisfies the user expectations is difficult.
[10:10] So CRDTs are easy to implement badly. And what we shall have a look at in this talk is how we can implement CRDTs well in a way that actually behaves in a way that we want. So today I want to look at these four topics. These are all areas in which CRDTs are difficult in some way or another, and for most of them we have solutions as well so I'm going to just talk through those one by one and the one that I want to start with is interleaving. So interleaving is a problem that can happen in collaborative text editors if several people insert text at the same position in the document.
[10:49] Now, I'm going to, in order to explain why this interleaving happens, I'm going to first give a little introduction as to how CRDTs for text editing work. And I'm going to start this with an example. So as I said earlier, when we were looking at operational transformation, we identify the position where we want to insert or delete a character by its index, simply by counting from the start of the document. And that has the problem that those indexes need to be transformed. So CRDTs avoid the need for this kind of operational transformation by instead taking a different approach.
[11:26] Instead, CRDTs create a unique identifier for every single character in the document. And that unique identifier remains stable, it remains the same, even if stuff is added and deleted elsewhere in the document. So each character keeps its own permanent unique identifier. And there are several possible ways of constructing such identifiers. One way is we can choose a number between 0 and 1, so some kind of fractional number, a rational number, and we say, for example, 0 is the beginning of the document, 1 is the end of the document, and for every position somewhere, every character somewhere within the text document, we will just pick a number that is appropriately positioned within that number line.
[12:12] So here in our example of H, E, L, O, we could say, OK, let's assign the number 0.2 to the h, the number 0.4 to the e, the number 0.6 to the l, the number 0.h to the o. And so this way we have spaced out our characters over our 0 to 1 number interval. And now when we want to insert a second letter L and an exclamation mark, we pick numbers in between the intervals. So we want to insert a second L in between the first L and the O, so in between 0.6 and 0.8. OK, let's pick 0.7 as the midpoint, and likewise for the exclamation mark we pick a number between 0.8 and 1.0, so we pick 0.9 for example.
[12:55] And so this is a simple enough scheme, and we can use this to implement a CRDT in quite a simple way. So what we can now do is we can think of our text document as a set of tuples. And each of these tuples contains the character at a particular position. It contains the numeric value that tells us where this goes. And then finally here, the A that I've put in these triples is just a way of referring to the node ID that created this particular insertion operation.
[13:31] So this is just some kind of unique identifier for a particular user or a particular node in the system. And the reason we have that is just in case two different nodes pick the same number, then we can define the ordering on these characters based on the node ID as a tiebreaker. And so we now just have this set of tuples representing the document and we can make edits, we can insert new characters into this text document just by picking numbers for them and then calculating the set union and we get a set including the new tuples, and we can reconstruct exactly what the state of the document is by just sorting those tuples by their numbers, tie-breaking based on node IDs, and then we get the text document. Several CRDTs for text take essentially this approach. They tend to not actually use numbers between zero and one.
[14:24] Instead, they use things like a path through a tree, but the effective idea is still the same. They are just essentially picking positions on a number line. And so when we implement a CRDT in this way, we can have some unfortunate behavior. And this is an example of behavior that can occur in some text editing CRDTs. So here we have two users.
[14:47] On the left-hand side, user 1, who starts off with a text document saying, hello! And user 1 inserts the word Alice between hello and the exclamation mark. And concurrently on the right-hand side, user 2 inserts the word Charlie in between hello and the exclamation mark. Now the two users communicate, they merge their changes, and we got what we get out at the end is hello and then some kind of weird jumble of letters that we can't read. What does this even mean?
[15:19] How did this happen? We started with Alice and Charlie, and we ended up with some kind of mess of letters. Well, what happened is this here. So we started off with a text document where Each character has a number on the number line. And hello, the O had 0.64 and the exclamation mark had 0.95.
[15:40] And so then the insertion of Alice or space Alice just chose some numbers in between that interval, 0.64 and 0.95, and likewise Charlie insertion independently chose some numbers within that interval. And so what a lot of CRDTs do for this kind of thing is they actually randomize the numbers a little bit, because that reduces the probability that two different nodes will pick exactly the same position, pick exactly the same number for a particular character, so reduce the risk of collisions. So what has now happened with this insertion is that, well, we have taken all of the numbers from Alice, which are coded in red here, and the insertions of the Charlie letters, which are in blue, and we simply ordered them based on the numbers. And the result, unfortunately, is that we've now interleaved those two character sequences that have come from the two different users. And the result is, unfortunately, a complete unreadable mess.
[16:44] Now, This problem is bad enough if we're just inserting a single word, as we are here. It gets even worse if what the two users have done is concurrently inserted an entire paragraph, or maybe an entire section, into their text document. If you have a paragraph or a section in which the characters have been interleaved on a basically random basis, you will not be able to use that text anymore, you will just have to delete it and rewrite it again, and the users are not going to be happy about this. Unfortunately, this behaviour really does occur in real CRDTs, and we have examined a number of different CRDT algorithms for text editing, and we have demonstrated this problem at least in the algorithms called Logut and LSEQ. And so in these algorithms, unfortunately, this interleaving problem is very deeply baked into the fundamental way how the algorithm is constructed.
[17:42] I cannot see any way of fixing this problem, fixing this bug in the algorithms, because the way how these numbers are assigned or the way the paths through the trees are assigned, this interleaving idea is just very inherently baked in to the way these algorithms work. Some other algorithms we have not found this problem. So in TreeDock and Woot, for example, we did not find this interleaving problem. This does not entirely guarantee that it can never happen, but we think it won't happen. And so In a way, these algorithms are safe.
[18:17] However, unfortunately, these two algorithms are also the ones that are least efficient. The entire reason why the LSEQ algorithm was designed, for example, is because the designers wanted a CRDT that was more efficient than TreeDoc and Root, because TreeDoc and Root do unfortunately take a lot of space for their identifiers. And so unfortunately this performance optimization has now introduced bad behavior in the algorithm, and in my opinion that makes the algorithm like LSEQ and Loggood basically unsuitable for use in many text editing applications. Moreover, there is a specification of collaborative text editing called ASTRONG here, which was published by Atiyah and others in 2016. This specification also allows interleaving as a valid behaviour, though we have a paper in which we fix this specification to rule out this interleaving behaviour.
[19:14] Finally RGA is an interesting one. So RGA is another text editing CRDT algorithm which does not quite have this same interleaving problem. It has a bit of a less of an interleaving problem, and so I will mention that briefly because it's an interesting case study. So in RGA, the interleaving problem that can happen is like this. So in this example here, on the left-hand side, we've got user 1, who starts off with the document saying hello, then the user inserts the word reader between hello and the exclamation mark, then the user moves their cursor back again after the word hello and types the word dear.
[19:54] So what we have here is typing the word reader, moving the cursor back, and then typing the word dear. On the right-hand side we have user 2, who just types the word Alice between hello and the exclamation mark. And what can now happen in RGA is that the word Alice can get interleaved in between the word Dear and the word Reader. So even though the text Dear Reader was typed by the one user and the word Alice was typed by the other user, we can end up with the two mixed up a bit. So we don't get the same character by character interleaving as we get with some of the other CRDTs, but we do still have the ability for text to be inserted into the middle of a different user's insertion.
[20:36] And the reason this happens is the data structure that RGA uses looks a bit like this. So it's a kind of tree data structure with very deep subtrees and a small branching factor usually. And what happens here is that each node in this tree is a character that was inserted and the parent of each node is the character that was the immediate predecessor character at the time of insertion. So we start off with a single special node at the root, which is the beginning of the document, and then when the user types H-E-L-L-O! Each of those letters has as its predecessor the character that was just typed, and so you end up with this chain of H-E-L-L-O, like a linked list almost, where each character's parent is just its immediate predecessor in this tree.
[21:34] And then the users come and position their cursors in between the hello and the exclamation mark, and one types space reader, another types space Alice, and the first user who typed reader moved their cursor back and typed dear. And so you can see every time the cursor gets moved, that results in a new subtree being formed here. And then otherwise, the order of the characters in this document is a depth-first traversal over this tree. And so we can see that in the interleaving that has happened here, hello, dear Alice reader, what has happened is we visit hello, then first the dear subtree, then the alice subtree, then the reader subtree, and then the exclamation mark. And the order in which we visit those subtrees is defined by a timestamp that is attached to every node in this tree.
[22:28] And so if we have, in this case here, several siblings, so the O node has several children. Then we visit those children in descending order of timestamp. Because the timestamps are unpredictable in concurrent editing cases, you can end up with these three possible outcomes as the merge results. So it's not as bad as the character by character interleaving that you get with the others. I should be precise.
[22:56] It is possible to get character by character interleaving in RGA. The circumstances in which that happens is if the users type their document back to front. So if the users start at the end of the document and first type the last letter, then move their cursor back to the beginning, type the second to last letter, move the cursor, type the third to last letter, and so on, and type the entire document from the end to the beginning. If they do that, then they would actually get character by character interleaving in RGA. Because in that case, in this tree here, all of our characters would be siblings.
[23:30] They would all be children of the root node. And in that case, the ordering is just defined by timestamps again. But fortunately, in practice, people tend to not actually type documents like that. People tend to type documents in a reasonably linear fashion from beginning to end. And of course occasionally people will backtrack a bit, delete a bit, copy and paste some sections and so on, but as a general rule, typing still happens in a fashion from beginning to end.
[23:58] And so for that reason the interleaving problem is not actually as bad in RGA as it is in the other algorithms. However, we could still say we want to be strict and we want to not allow any interleaving at all, and we did actually devise an algorithm. So for RGA we were able to find a fix, a modification to the algorithm which changes it such that this kind of interleaving of concurrent insertion does not happen regardless of how the cursor was moved. I don't have time in this talk to talk about this particular algorithm, but if you're interested there's a paper that we published in 2019 called Interleaving Anomalies in Collaborative Text Editors, and in this paper the algorithm is described exactly. So let's move on to our next topic.
[24:48] Our next topic today is reordering list items. That is, we have a list of things and we want to move an item from one position in the list to another. So one example in which, an example application in which we might want this is a to-do list. So in a to-do list, often you can drag and drop items into whatever order you want. And so what happens here, for example, is you have a to-do list consisting of firstly, buy milk, secondly, water the plants, thirdly, phone Joe, and the user is dragging and dropping the phone Joe item and moving it to the top of the list, so that this is now at position one.
[25:27] So, how do we do this? Well, We can represent the list using any of the CRDTs that we talked about earlier. So all of the CRDTs for text editing are also CRDTs for lists. And so we can use those to represent this ordered list of items. However, all of those CRDTs don't actually support a move operation.
[25:48] All they have is operations for inserting and deleting items, but no operation for moving. Now, you might think, OK, well, that's not too bad, because we can always implement moving by just deleting the item from its old location and inserting it at its new location. And this works, except it has a problem if several people concurrently move the same item. So in this example here, replica A and replica B both move phone Joe to the top of the list. And so what happens if we implement this moving by deleting and reinserting?
[26:23] Well, they both delete PhoneJoe from position 3, so it's definitely gone from position 3, and then They both insert PhoneJoe at position 1. But the result now is that these two insertions at position 1, they both happen. And so we end up with two copies of PhoneJoe at position 1 and 2. So this item that we wanted to only move has accidentally got duplicated in the process. And this is really not the kind of behavior that a user would expect, I think, of a to-do list application.
[26:54] Just because they reordered some items on their list doesn't mean they suddenly want two copies of those items appearing. So what is the behavior actually that we want? What is the desirable behaviour? Let's look at a similar example here. In this case, we've also got two replicas moving phone Joe, but we've got them moving that item to different positions.
[27:18] So replica A moves phone Joe to the top of the list, whereas replica B moves phone Joe to be at the second position, so after buying milk. So what is the desirable behavior in this case? Well, we could say that now phoneJoe should appear in both positions, before buyMilk and after buyMilk, but then we've got duplication again and we did not want duplication. So really what we want is phoneJoe to appear in one of the two positions, I guess, maybe show a warning saying, hey, user, there's a warning. We picked one of the two positions.
[27:51] Or maybe it's just fine to just pick one of the two and forget that there were two different list positions. So I'm just going to say now, we're going to pick one of the two concurrent positions as arbitrarily but deterministically as the winner and so in this example here we've got PhoneJoe being at the top of the list as the winner so from the point of view of the first user nothing changes but from the point of view of the second user, after they've moved phone Joe to position two, then additionally the phone Joe is then subsequently moved to position one. And now let's think about what is happening here. So we've got two concurrently written values for the two different positions. And we want to deterministically, but arbitrarily, but deterministically, pick one of them as the result, as the winner, as the outcome.
[28:47] And this may look familiar to you if you have studied existing CRDTs because it's just what happens in a last writer wins register. In a last writer wins register, we've got a register which is a variable effectively, which can concurrently have a value assigned to it by different users, and if we have those concurrent assignments, the end result after everybody has merged their state is that we pick one of those values that has been written as the winner and we throw away any other values that have been written concurrently. And so you can think of here as the position of the list item phoneJoe as having a lastWriterWinsRegister semantics to it. That is, first one user assigns the position of phonejoe to be head of the list, the other user assigns the position of phonejoe to be after buymilk, and then as the two users communicate, then the end result is the winner is one of those two values, either head of the list or after buymilk. So let's make this a little bit more formal.
[29:53] What we need is, if we want to implement this here, we need a last writer wins register, okay. Second, We need some kind of unambiguous way of referencing positions in the list. But if you think about what we've just done earlier, when we were talking about unique identifiers for the characters in a text document, what the list CRDTs and these text CRDTs are already doing is providing us with a stable and unique way of referencing particular positions in the list. So we can just use one of these existing algorithms and use its unique identifiers for list elements in order to reference the position of a particular item in the list. Now different CRDTs have different ways of constructing those identifiers as we saw, like TreeDoc uses a path through a binary tree, Logoot uses a path through a tree with a higher branching factor, RGA and causal trees use some kind of timestamps, but the basic idea is still the same.
[30:51] We've got unique identifiers for particular positions in the list, and so we can take those identifiers and put them as the value inside our LastRightToWins register. Secondly, what we need now, if we've got several list items, is we need a separate register for each list item to hold the position for that particular list item. And so, well, we can just construct a set. We can, for example, use a at wins set, set CRDT. And the contents of that set are going to be all of the items on our list.
[31:24] And so we're going to say that each item on the list is going to be a pair consisting of a value, which is like the text on the to-do list item, for example, and a lastWriter wins register containing the position of that particular list item. And so now we can add new items to the list by adding them to the add wins set, we can move item from one position to another by updating the value of their LastWriterWins register. In particular, we use a ListCRDT to create a position identifier for the location where we want to move a particular item, and then we take that position identifier and use it as the value in the LastWriterWins register. So here we've constructed a new CRDT operation now, a move operation for lists, just by using three existing CRDTs, any of these range of different list CRDTs, plus an AddWinSet, plus a LastWriterWins register, and we've constructed a new operation. And because we've just composed these existing CRDTs we know that the end result is also going to be a CRDT.
[32:26] So, this is very nice and it allows us to do these atomic moves of a single list item at a time. Now things get a little bit more difficult if we want to move more than one item at a time. And so in a to-do list you really only need to move one item at a time because typically drag and drop just lets you pick up one item at a time in the user interface. So that's not really a problem. But in text editing, you could imagine other circumstances arising.
[32:55] So in this example here, we've got a to-do list, this time represented just as plain text, as a list of bullet points, each bullet point starts with a bullet point character and ends with a newline character. And so here replica B is taking the item milk and wants to move that in front of bacon. And so that means taking all of the characters, taking the character bullet point space M-I-L-K U line, take that whole range of characters and move the whole range to be in front of bacon. Concurrently with that happening, a different user is updating the milk item to say soy milk and so they do that by deleting the uppercase M and inserting the text soy and a lowercase m. Okay, so what do we expect to have happen here?
[33:44] So firstly, the soy milk, sorry, the change from milk to soy milk takes effect, and the move from milk to be in front of bacon takes effect. So we would expect the desired outcome here is that we have soy milk first in the list in front of bacon. So that means the two changes have become merged together cleanly. This is what we want to have happen. Unfortunately, it's not what actually happens.
[34:10] So if we look at the algorithm that we just constructed for moving a single list item, and if we just move each character individually according to this algorithm, the end result that we get is this here. So we end up with milk is correctly moved in front of bacon, that happens exactly as we wanted, but the change from milk to soy milk remains attached to the old position of where milk was previously, not the new position where milk has now been moved to. And so the end result now is that the list reads milk, new line, bacon, new line, soy M. And that soy M is just standing there without any context because its context has been moved away in the meantime. And this is a problem that I do not know how to solve for the time being.
[34:58] I've spent a while thinking about it, come up with a couple of half solutions that don't really work. A couple of other people are thinking about it as well, so if you're interested in this, feel free, think about it and do publish the result if you manage to figure it out. So, this is the problem of moving items in lists. So we've figured out how to move a single list item at a time, but these ranges of characters, moving them in a way that behaves nicely is still an open challenge. Let's move on to point three, which is again about moving, but now rather than moving elements in lists, we're going to move subtrees in a tree.
[35:39] So let's again look at an example. Let's say we have a tree structure which might be, say, a file system, And like in the case of moving elements in lists, we can think about what happens if the same node in a tree is concurrently moved by two different users to two different locations. And so you can see here we start off with the tree on both replicas where A, B and C are children of the root, and on replica A we're going to take that node A and move it to be a child of B, whereas on replica 2 we're going to take the same node A and move it to be a child of C. So now A has been moved to two different locations on two different replicas. In this case, what do we expect the outcome to be?
[36:28] And it's kind of Similar to the case of moving elements in lists, so one option is to duplicate it, and so what we can have is that we have A appearing as a child of B and also A appearing as a child of C. This means then further any children of A will also have to be duplicated, and that's what you get here in this box labelled A. That is, you have the A subtree with A1, A2, A3 children, and then another whole copy of the whole subtree. So this is duplication like in the case of the lists. I think duplicating nodes on concurrent moves is not a good behavior.
[37:11] So we don't want that. Well another option is for A to be actually shared as a common child of both B and C. This works, but it's no longer a tree. So if what we expect our data structure to be is a tree rather than a DAG or some other kind of graph, then this is not an acceptable outcome, because here A has two parents and in a tree no node ever has two parents. So that leaves C and D as the possible outcomes.
[37:43] C is simply we take the replica 1 operation as our outcome and ignore the move made by replica2, or vice versa, we can choose replica2 to be the winner and to ignore the move made by replica1. So in this case, either A appears as a child of B or A appears as a child of C, but not both. So like in the case of moving atoms in lists, what we have is a kind of last writer wins behaviour here, and I think that is reasonable. Now, in trees, additional complications appear. And so trees are definitely a harder case than lists.
[38:26] And so another example of tricky things that can happen in trees can be illustrated with your file system. So if you want you can try this on your computer and just create a directory called A and then create a sub-directory within A called B and then try moving A into B. So move A to be a child of A slash B. This may sound weird. What you are doing here is effectively trying to move a directory into itself.
[38:57] I don't know if you've ever tried this. I did try it on Mac OS. I get this response here from the shell saying invalid argument. I am not allowed to do this move. And this makes sense because if we did perform a move where we move a directory into itself we would kind of be creating a cycle and then our data structure would no longer be a tree, it would be some kind of graph with cycles in it, and this would be very confusing for a file system.
[39:25] So, if we build CRDTs for trees, it also makes sense that if we allow moving sub-trees, we also prevent this kind of cycle, prevent somebody moving an object inside itself. But unfortunately with CRDTs, we have the additional complication of concurrent changes. So consider this example here, where we start off with the same tree again on both replicas. On replica 1, we take the node A, sorry, we take the node B and move it to be a child of A. So here B appears as a child of A, while C remains as an existing child of A.
[40:05] On replica 2, we start off with the same tree, and in this case we move A. We move A to be a child of B. So we end up here with A as a child of B, and C as a child of A as before. What happens now if we merge these two? So we've got here two move operations, each of which by itself is perfectly fine.
[40:26] There's nothing wrong with moving B to be a child of A and there's nothing wrong with moving A to be a child of B. But if we combine both of these, we end up with exactly the same problem as we just had, where effectively one directory gets moved inside itself. And so if we're not careful with this algorithm, we end up with A and B in this kind of cycle, disconnected from the root of the tree, and well, this is again no longer a tree, it's some kind of more general graph data structure. So we don't want this if we want to preserve the fact that we have a tree. Another option, as before, is we can duplicate nodes, so we can say that, well, we move B to be a child of A, and we move A to be a child of B, so we have both options, both B is a child of A, and A is a child of B, both simply exist in our directory graph, in our tree, and so again we have to duplicate any of the children of those, and as before again I think duplication is the wrong way of doing this.
[41:26] So it seems as before what we really want is a last-writer-wins semantic here where either this option here that is the outcome of moving B into A and ignoring the other move, or we choose the outcome of moving A into B and ignoring the other move. So in this case we want to pick one of the two conflicting operations and we want to just ignore the other one. But we have to ensure that all of the replicas end up picking the same operation as the winner and ignoring the same other operation because of course otherwise they won't converge and then it's not a CRDT. So how do we actually achieve this now where we want to ensure that of these conflicting operations exactly one gets picked? Well, I don't know, we can try some existing software.
[42:18] I tried this with Google Drive, for example, and I got this wonderful error message here. I tried exactly just having two directories, A and B, move A into B and move B into A, concurrently on two different computers. What happens? I get some kind of internal unknown error, try again, but this error never goes away, it just keeps sitting there in a loop trying to sync. So Google Drive hasn't solved this problem either.
[42:44] Okay, I guess we'll have to think about it for ourselves. So the way that we solve this is by thinking about the sequence of operations that get applied on each replica and having a timestamp on each operation. And We're just going to assume some globally unique timestamp as usual, so you can use Lamport timestamps, for example, that works fine. And of these operations, some of them are going to be move operations and some might be some other kind of operations. And here we have on replica one, we have moving A into B and moving B into A on replica 2.
[43:22] And as I said before, each of these operations by itself is safe and fine to execute, but we need to ensure that we detect the case when two operations are in conflict with each other in the sense that executing both of the operations would create a cycle. So we have to ensure that no matter what we do, we do not create a cycle. Now, what can happen now if these two replicas merge, their sequences of operations is, well, we take the union of the operations from both of the replicas, we can put them in timestamp order, and now if we say that the operations get executed exactly in increasing timestamp order, then the first move is fine because by this point nothing bad has happened yet, but then by the time we get to the second move operation, that move is unsafe. And So if we want to ensure that there are no cycles in our tree, then we have to skip that operation and just pretend it didn't happen. But this is now tricky because Replica2 has already executed this operation.
[44:28] So Replica2 will somehow have to undo this operation that it has already performed in order to ensure that it converges again with replica1. So the basic principle we're going to use is to take these sequences of operations where we have a timestamp on each operation and we're going to merge them such that we get a single sequence of operations in timestamp order. And so, for example, here on replica 1 we have timestamps 1, 3, 4, and 5. On replica 2 we have timestamps 2, 6, 7, and 8. And so if we imagine you're replica1 and you've executed 1, 3, 4, 5, and now you receive the operation with timestamp 2 from replica2, well what you need to do is you need to take this timestamp2 operation and insert it somewhere earlier into your sequence of operations that you've executed.
[45:24] But, you know, time has already moved on by this point and we've already executed the operations with timestamps 3, 4, and 5, So how do we handle this? Well, we can just take a fairly simple approach, which is to undo and redo operations. And this is exactly what we do in our algorithm. And so that is, if you are replica 1 here that wants to apply operation 2 with timestamp 2, you've got your log of operations that you've performed with timestamps 1, 3, 4, and 5. And we're first going to undo operations until we are back at the point, until we've undone all of the operations with timestamps greater than the new operation.
[46:03] So that is, we're going to undo op5, we're going to undo the move, we're going to undo op3. Now we're only left with op1, so now we can apply op2 on top of that with timestamp2, and now we redo the three operations that we've undone. And the end result is now that we have this log of operations where op2 has been inserted in the right place. And if we keep performing this kind of logic on each of the replicas, then this will actually ensure that all of the replicas end up with the same operations executed in the same order. It just means we have a bit of extra work to do, because if some operation needs to be inserted somewhere quite early on in the order, we have to do a lot of undo's to do that one operation, and then a lot of redo's to replay all of the operations that have happened.
[46:50] So you might wonder what is the performance of this? Isn't the performance going to be terrible? Well, we did some performance experiments and it was okay. Of course, there was certainly a cost in all of performing all of these undo's and redo's. And you can see that here.
[47:05] So we set up a system with three replicas on three different continents. And so there was a delay of over 100 milliseconds around trip time between these different replicas. And we then just started generating move operations at a high rate. And the higher the rate of move operations in the system, the more undo's and redo's need to be done. And so the slower it becomes to execute each individual operation.
[47:34] And we were able to get this system to at least 600 operations per second, which is not massive on big data scales, but on the other hand, for a single user updating their file system, I think a single user is not going to perform more than 600 move operations per second on their own local file systems. So for a lot of collaboration software, actually, this performance is perfectly fine and we don't have to worry too much. So I will explain a little bit more about how this algorithm works in terms of the data structures that we use. And so we have to first find a way of describing the move operation that we want to perform. And we're going to say move operation is a structure here with the following fields.
[48:21] First of all, like I said earlier, a move operation has a globally unique timestamp, such as a Lamport timestamp. Then a move operation has got to have a child, which is the node in the tree that we are moving, and the parent which is the new location, the new parent of that child in the tree. Now we're not going to note in this operation here what the old parent of this child was, So this is like, we can call this a stealing move, that is, we're simply going to take a child from wherever it currently is in the tree and move it to be a parent of this new parent here. And further we can have a bit of metadata associated with the operation. So for example in a file system you have one file within a directory and a file has a name within a directory and so that can be the metadata that is associated with a particular file in a particular directory.
[49:15] And then we have to construct this log of operations in order to perform the undo and redo. And in order to perform this undo, we need to associate a little bit of extra data with each entry in the log. So each entry here has several fields that come straight from the move operation. So the timestamp is from the move operation as before. The new parent, the new metadata, and the child, these come directly from the move operations.
[49:41] So in addition to these, we now also record the old parent and the old metadata. So that is what was the parent of child before we performed this move and what was its old file name, for example, if we're using the metadata for file names. And so Now this allows us to perform the undo because what we do in order to undo this operation is just to set child's parent and metadata back to its old parent and its old metadata. And given this log of log entries of move operations, we can now construct the current state of the tree. The tree is just a set of triples, of parent metadata and child triples, whenever child is a parent or parent in the tree.
[50:31] And so we can now define this ancestor relation here, so we say that A is ancestor of B if either A is a direct parent of tree, as in A, M, B triple exists in the tree, or if there exists some tree node C such that A is a parent of C and C is an ancestor of B. So this defines just a transitive closure over the parent-child relationship. And So given this definition of ancestry now, we can define what it makes for a move operation to be safe. So for a move operation, if we want to perform this move, we look at the child that is being moved and the parent, which is the destination of where that child is being moved to. And if the child is currently already an ancestor of the destination where it's being moved to, then this would be unsafe, we would create a cycle, so we do nothing.
[51:30] Similarly, if the child and the parent are in fact the same node, that would also be unsafe. So again, we do nothing. But in all other cases, it's fine. So we can update the tree by removing from the tree the node, the existing location of C. That is, this is exactly what I said, we steal the child node from wherever it currently is in the tree for any P, M here, and then we add this new parent metadata child relationship to the tree, and this defines our updated state of the tree.
[52:03] And so given this function here for performing a move operation, and given the undo-redo procedure that I explained previously, we now have an algorithm for safely performing move operations on a tree CRDT. And we proved several theorems about this algorithm. So in particular, we showed that this algorithm does indeed preserve the tree structure. So to preserve a tree structure, what we need is that every node has a unique parent and that the tree contains no cycles. That is the definition of a tree.
[52:35] And so we prove that these things hold. So for any sequence of operations that are executed, the data structure remains a tree. And then obviously we have to prove that it's a CRDT, which means that it converges. So given any two sequences of operations on this tree, if one sequence of operation is a permutation of the other sequence of operations, then applying those sequences of operations leads to the same state. In other words, operations are commutative.
[53:04] So this algorithm works and we have implemented and proved these properties about performing move operations on trees. OK, moving on from moving subtrees in CRDTs, let's talk about efficiency and making CRDTs more efficient. So one particular problem that occurs with a lot of CRDT algorithms, especially the text editing CRDTs, is the overhead that the CRDT metadata incurs. So if you think about a text document, what we said is we give every single character in the text document a unique identifier. And so what you end up with is each character in English text would be one byte for the ASCII character, the Unicode UTF-8 encoding of that character.
[53:55] But then you have additional metadata, so you might have, for example, some path through a tree, which might take a couple of dozen bytes to encode, then you might have some identifier for the node that inserted a particular character, which, if that is a UID, will be at least another 16 bytes. If you encode it in hexadecimal, it'll be like 36 bytes or something like that. And so very quickly you end up with this really disproportionate situation where you have one byte for the actual data that you want and something like 100 bytes of metadata or even more than that just for the CRDT metadata in order to make the CRDT work. And that's not a particularly great situation to be in. So what I would like to talk about now is some work that we've been doing as part of the AutoMerge project, which is a CRDT implementation that I work on, in order to bring down that metadata overhead.
[54:49] And we've had some really good successful results by using some carefully designed algorithms and data structures. So let me start with the results of what we have here. So the old version of AutoMerge uses a JSON encoding in order to encode all of the metadata and send it over the network and write it to disk. And this JSON encoding is incredibly verbose. So what we've done here is as an example to take a particular case study, which is the editing history of the latex source of a paper.
[55:25] So in order to generate this editing history, a colleague and I actually wrote this paper using a homegrown text editor, and so we were able to capture every single keystroke that went into the writing of this paper. So the final file is about 100 kilobytes of plain text, so this is just ASCII text containing LaTeX markup, and we captured over 300,000 operations, that is the 300,000 individual keystrokes of the editing history going into that. And so that includes all of the individual characters that were typed in order to create this document, all of the characters that were deleted for any stuff that was added or removed, any typos that were corrected and so on. And we also recorded all of the cursor movements. So every single time the cursor was moved from one position in the document to a different one that was also recorded.
[56:18] And so if you take this history of 300,000 or so changes and you encode it in the current AutoMerge data format, this JSON data format, the file size is about 150 megabytes. So that is, we're using almost 500 bytes per change here, which is pretty terrible. And so you can see now in the new binary data format that we've designed in order to bring down that metadata overhead, we've actually made some really big improvements on that. And so what we have done is first of all designed a binary format that encodes the full editing history, including every single insertion and deletion, and including when that happened, and encodes that in a compact form. So it doesn't lose any information at all.
[57:13] This is a completely lossless encoding. And what we've managed to do is reduce this file size by over a factor of 200, down to about 700 kilobytes, containing exactly the same information just encoded differently. So I should emphasize if you would just take the JSON history and encode it using one of the binary equivalents for JSON, you would not get anywhere near this. So if you use protocol buffers or message pack or something like that, you know, you might reduce it by a factor of two or three at the most. But in order to get this factor of 200 improvement, you have to design a data format that is specific to CRDTs and specific to the particular editing patterns that tend to occur in these sorts of applications.
[58:03] Now you can see that by gzipping these files, you can reduce it. So you can reduce the 150 megabytes down to about 6 megabytes, but the 6 megabytes is still huge compared with the 700 kilobytes that we were achieved uncompressed in our binary format and that still does gzip down further to about 300 kilobytes without losing any information at all. So note that by this point here now we've been able to encode a change history. The change history consists of over 300,000 changes. We've encoded that in 300 kilobytes.
[58:41] So we're now actually using less than one byte per change in order to encode the full change history. So we can look back at any past version of this document. We can diff any two versions at any point in time. And all of that data is still there encoded in this very compact form, which I find very exciting. Now, of course, we can still compress it further by throwing away some of the history.
[59:06] So one of the first things we might choose to throw away is the cursor movements, because to be honest, who cares about exactly where the cursor was at which point in the editing history? So we can save about 22% by throwing away the cursor movements, and we can go down to about 230 kilobytes by also throwing away all of the change history for the text. So by this point here, where we're at 230 kilobytes, we still have all of the CRDT metadata, including all of the tombstones. So this means at this point we're still able to perform arbitrary merges between different replicas of this data, and yet we're down to only about twice the raw data set size. So about 100 kilobytes is the raw text document with no CRDT metadata at all.
[59:59] That's just the final state of the text document. And all of the CRDT metadata is adding only a bit more than 100% overhead to this without gzip. If we actually apply gzip, then the file sizes are almost the same. So the file containing the CRDT metadata gzipped is about the same size as the plain text file, non-gzipped. Now, we can get even further down by removing the tombstones.
[01:00:27] The tombstones are the markers for deleted characters in the document. And if we do so, then we save, what is that, another 70 kilobytes or something like that for the tombstones. Whether you want to do that or not depends because if you throw away the tombstones it means you can no longer merge with edits that happened concurrently. So any edits that happened before that tombstone removal or concurrently with the tombstone removal now can't be merged anymore. But if you do remove the tombstones, then just the raw CRDT metadata, that is just the unique identifiers for the characters, at that point, takes only 50 kilobytes.
[01:01:13] So it's only about 48 percent overhead on top of the raw text document using our data format. Very exciting, I find this. So, I should tell you a bit about how this actually works and how we were able to achieve such good compression. The basic idea is, as I said, we want to still maintain all of the change history as much as possible, and so we're going to store the full set of all of the insertion operations, all of the deletion operations, and all of the cursor movements if necessary. Each operation is given a unique ID, which is just a Lamport timestamp.
[01:01:48] Lamport timestamp is just a pair of a counter and the ID of the node that generated it, or we call it actor ID in AutoMerge. And the algorithm that we use for text editing is based on CRDT, so it's based on whenever you perform an insertion, you reference the identifier, that is the Lamport timestamp, of the character that is immediately before the inserted position. And We keep this set of all of the operations and store it in document order. That is, in the order in which the characters appear in the document. And what we get is this kind of table here.
[01:02:28] So you can think of this almost like a table in a relational database where each row is one operation. You can see this is typing H-E-L-L-L-O, so it's hello with three L's, and then somebody came along and deleted one of the three L's so that the final result is just hello. And so the way this is encoded is each row here is an insertion operation into this document. And you can see that each row has an operation ID, which consists of a Lamport timestamp. So it consists of a counter that just increments and the actor ID.
[01:03:06] I just wrote A here, but in reality, our actor IDs are actually UUIDs, so they're actually like 16 bytes or even 32 bytes long. And so we have this unique identifier for each character that was inserted. Here we have the reference element that is the predecessor character, the ID of the predecessor character. And so this is just saying here, the E comes after the H. So the H has ID 1A, the E has ID 2A and comes after 1A.
[01:03:37] And so we can then encode all of the operations this way. Furthermore, if an operation is deleted, we do that by having these additional deletion columns that record the operation ID of the operation that deleted this particular character. In fact, it could happen that there are multiple deletion operations for the same character if multiple users deleted the same character. And now we can take this table and encode it in a very compact form, using some fairly simple tricks. So we're going to first of all encode each column of this table individually.
[01:04:15] And so, for example, we can look at this first column, the counter column, which consists of the numbers 1, 2, 3, 4, 5, 6. How do we encode that efficiently? Well, first of all, we delta-encode it, which means for each number we calculate the difference between that number and its predecessor. So 1, 2, 3, 4, 5, 6, the difference between any two subsequent numbers is 1, so this delta encodes to 1, 1, 1, 1, 1, 1, which we can then run length encode to 6 times the number 1. And then we can use a variable length binary encoding for integers to encode 6,1 in 2 bytes.
[01:04:50] So this variable length encoding just ensures that small numbers get represented in 1 byte, slightly bigger numbers get represented in 2 bytes, bigger numbers still get represented in 3 bytes, and so on. And so here, we've now encoded this entire counter column here in two bytes. Similarly, we can proceed to the actor ID column. So for this actor column, well, first of all, we make a lookup table. We don't need to keep repeating the UIDs.
[01:05:15] So we just map A to 0, B to 1, for example. And so now this translates into 0, 0, 0, 0, 0, 0. We can run length encode before, as before, and then use the variable length binary encoding again. And again, we've encoded this entire column in two bytes. Moving on, let's look at this actual text column here.
[01:05:37] What I've done actually is represented the text in two separate columns. So here, this is the UTF-8 byte string for that particular character that is being inserted, and in this length column here we have the number of bytes for that particular Unicode character. So if what we're doing is just typing English text, then every single character is going to be within the one byte encoding of UTF-8. And so this column here is just going to consist of lots of ones. It's just all of the characters have length one.
[01:06:09] So that encodes again, super compactly. And now that we've encoded the length columns for the actual characters, we can just concatenate them, because we can still then later split out where the boundaries are between the characters by just looking at the length column. And so we're simply going to concatenate all of the bytes in the UTF-8 column, hello, with triple L, which is encoded in six bytes. So what we've done here, just by encoding each column individually, we can represent each column extremely compactly in just a couple of bytes. And it turns out then that Because of the editing patterns that you tend to get with text, you tend to get these increasing sequences because people just tend to type letters in sequential order, which means you get these increasing sequences of counters that always increase by one each time.
[01:07:03] So the whole data has a structure which compresses really nicely, and we can take advantage of our knowledge of this structure to encode it very compactly. So this gives us the set of operations. Now, in order to have the full change history we need some a little bit of extra metadata that is for each set of changes that were made at some point in time we need the timestamp of when that was occurred and the range of operation IDs and with this extra metadata we can now reconstruct exactly what the document looked like at every past point in time, even while using a very small amount of additional storage space. So if people tell you that with CRDTs, you know, you have to put effort into tombstone collection because tombstones waste a lot of space, you know, just think back at this table here. The tombstones actually just cost us 48% of the space.
[01:08:00] It's a fairly small amount of costs to maintain here, whereas choosing the difference between a simple JSON format and an optimized binary format made a factor of 200 difference. So I think this means we have the ability, if we want, to store the full change history and to allow this to do, you know, diffs and interesting visualizations of the change history for our documents with very low cost and we don't actually have to throw away all of this extra metadata that we have. So this brings me to the end. As I said at the beginning, I think CRDTs are very easy to implement badly. So they're very easy to implement in a way that is inefficient and that has all sorts of weird behaviors like the interleaving anomaly that we saw, and there are all these problems of how do we do a move operation both on lists and on trees in a way that is safe and that is reasonably intuitive to our users.
[01:09:03] But, as you've seen from this talk, we have made a lot of progress in just the last few years on these topics. And I think it's going in the direction where really CRDTs are starting to become something that can be really the foundation for a very large set of applications. So we can start to build like actual real practical user software, not just research prototypes on top of these basic principles. So I hope you will join me in finding this very interesting and exploring where this ends up going in the future. So I will end with some references.
[01:09:41] That is here first of all the list of the various text editing CRDTs that I talked about in the context of interleaving, and finally the publications that I've worked on for the various algorithms and anomalies and problems that we talked about in the context of this talk. All of the slides are online so you can find them there. Thank you very much for listening. I hope you enjoyed this talk, and I'll see you again soon.