Written: 2017-10-24
Author: Bo Waggoner
Given a set of emails (in MIME format), the "jwz" algorithm organizes them into threads (conversations). It's super awesome, but it doesn't address two issues:
Either of these alone turns out to be pretty easy with minor tweaks to jwz's algorithm. Unfortunately, doing both together requires giving up on correctness or adding complexity. I'll explain why and outline my best-effort solution.
First thing to know -- every email comes with a globally unique identifier, a message-id. It also has an "in-reply-to" header saying which message-id is its "parent", if any; and a list of "references", message-ids that (probably) also belong to the same thread. However, your database might not have some or all of these emails. This makes threading pretty tricky, which is why we need jwz's algorithm in the first place.
Note 1: I don't even bother with the subject-line portion of the algorithm, as I like the results without it; but that's personal preference and shouldn't affect the below too much.) Note 2: As jwz discusses, we can assume (and I've found this to work fine so far) that a list of references in an email is an ordered list of "parent of parent of ...".
Dynamic updates. Let's say you want to handle dynamic updates, but you don't mind loading your entire email database initially. The idea is to view jwz's algorithm in two stages: First, it computes threads, then it "prunes" them. (The pruning step merges down threads by cutting messages that you know exist due to the email headers, but which you don't actually have.)
All you need to do here is store the data structure at the end of stage 1, before pruning. (a) When a new email arrives, run jwz's algorithm from Step 1 onward, but only on the new email (keeping all the other data you already have). However, for any messages touched in the course of Step 1 (i.e. the new email's parent and references), run them through Step 1 as well, along with any they touch, and so on. (b) When an email is deleted, remove it but keep its "container" around, just make it empty. That's all. In either case (a) or (b), after doing this update, store the end-of-stage-1 data for future use, then re-run the pruning step on it to produce a pruned list of threads.
Partial-database threads. Let's say you have a couple million archived emails and a search query comes back with a list of twenty. You want to get the entire thread for each of these twenty emails without loading the entire archive. The problem is that, while these emails alone will tell you who their "parents" are, you have no way to know who their "children" are (the emails that reply to them).
My solution here is to store just a little bit extra metadata. For each email A, we add a "referenced-by" header (inside the email file itself, or however you'd like to store it) which lists the message-ids of every email in our database that is a child of A or references A in some way. (I'm also assuming we store a list of (message-id, file name) pairs so we can look up the email file for any message id, if we have it.) Now, given an email A (one of our small list), we can do a local search for all messages that could possibly be in its thread: Use the headers of A to collect all the files that are either parents of A, references of A, or in the new "referenced-by" header of A. Now recursively do the same with this larger list of emails we have, expanding our "local search" until we've collected all possible relevant email files. Now call jwz's algorithm on just these emails.
Notice that, given a set of emails, we can build this extra metadata in advance with just a single pass through the archive: For each email A, go to its parent and every email it references and add A to their "referenced-by" header. But if a new email arrives, life isn't necessarily so simple; that's what we'll get to next.
Let's say we're implementing the partial-database approach above, and a new email "A" arrives. We update its parent and all its references to include "A" in their referenced-by header. Then, we apply the dynamic update algorithm described above. Good enough?
Unfortunately, no. If A is actually "new" in that it was just sent, then we're fine. But if we're importing some email that's older than others in our archive, we're in trouble. To figure out whether there are emails in the database that are replies to A, we have to search the entire archive (or at least all files with a newer date than A's).
So I see three options:
The second approach is perfectly fine if you don't mind managing the database, but here I'm going to take the third. There are two key points: (1) as above, if a newly arrived email A really is new, then there's no problem (and this is by far the most common case); (2) oversights will get quickly/easily corrected, e.g. by loading all emails once.
We're going to create a dynamic data structure. The resources we have available are, first, a list of emails. For our purposes, an email contains:
msgid (a globally unique message-id) filename (name of email file if we have it, or None if we don't) parent (msgid that this email is replying to, or None) refs (a list of msgids probably in the same thread, or None) ref-by (a list of msgids that reference this one, or None)
If we have an email file, then we can get all this information from it. Notice we may know that a particular email (msgid) exists even though we don't have it in a file. A thread, for simplicity here, is just a list of emails that (hopefully) make up a conversation. The reply structure is also possibly important, and you can modify the algorithm very easily to get that if you want.
I also assume we have a list of pairs (message id, file name), loaded into a hashmap, so that given a message id, we can immediately tell whether we have the corresponding file and if so, get the file.
Our data structure must support the following operations:
// Get list of all threads containing the files in file_list get_threads(file_list) Input: list of email files Output: list of threads // Update state to include new files in file_list add_files(file_list) Input: list of new email files Output: none // Update state to reflect that file_list is no longer available rm_files(file_list) Input: list of old email files Output: none
get_threads() should use lazy memoization: It should not load threads until requested, but once loaded, it should cache them so that it can quickly answer future queries. Here I'm not requiring add_files() or rm_files() to return any newly-updated threads, just that subsequent calls to get_threads() must be correct. For instance, it would make sense after adding a new file to immediately call get_threads(new_file).
(All of this is roughly based on jwz's algorithm which is where all the credit should go.)
First, we have the Container object which stores:
msgid filename (name of email file if we have it, or none if we don't) parent (pointer to Container) children (list of pointers to Containers)
Next, we have the main hashmap, id_to_container which maps message-ids to their containers. Note that the Containers themselves will form a forest: a collection of trees. The root of each tree is a Container with no parent. Each tree is a separate thread (roughly -- the pruning operation can sometimes merge them).
It will turn out to be useful to implement two helper functions:
// Create and link together Containers in forest, // adding them to id_to_container build_containers(file_list) Input: list of email files Output: none // Prune out containers whose files are missing, // and rearrange remainders to make sense prune(Container list) Input: list of Container roots (separate threads) Output: list of Container roots such that all sub-Containers have email files present
Next, implementations.
First, get the message-id of each file. Then simply look up its container in id_to_container and, if present, set its filename to None. The idea is that the tree structure is still there and useful; indeed, even if we never had the file, this Container might have been created because its message-id was referenced. So don't mess with it, just indicate that we don't have the actual file any more so it will get filtered by prune().
First, we call add_files(new_files) on any files in the list whose message-ids aren't in id_to_container. That's where the thread structures will actually get built, by creating Containers and linking them together in parent-child relationships.
Next, we get a list of roots of all threads containing files in file_list. This is pretty easy, we just iterate over file_list, get their message-ids and then their containers via id_to_container, and go up the Container tree to the root.
Now, since we want to keep our data intact, we create a complete copy of these Container trees, then call prune() on the copy's list of roots to produce a separate, pruned version of the thread forest for thse files. (My implementation does something a bit more efficient, storing both pruned and un-pruned information at each node instead of making a copy, but this is easier to describe.)
prune() returns a list of roots, each of which is a separate thread tree. We can now walk each tree to produce a list of files to return for that thead, for instance.
This is actually where the main action happens. First, given the file list, we build up a bigger list by recursively searching all parents, references, and "children" (referenced-by) of the files in our initial list. Then, we call "stage 1" of jwz's threading algorithm, build_containers(big_file_list), to build their containers and place them in id_to_container.
Also, while we have all these files loaded, for each file A we touch, we ensure that all of its references are up-to-date in that they include A in their "referenced-by" headers.
For each file in the list, we take the following steps:
Again this follows jwz's instructions. It's a slightly tricky recursive algorithm which treats the top-level call slightly differently, because at the top level we just have a list of roots, while at lower levels we will be calling with a list of Containers which are all children of the same node.
Again, the above algorithm can be incorrect in the case where we call add_files() on a file A that has no parent, references, or referenced-by information in its headers, but our database does contain emails that are replies to A. In one sense this is a pretty normal case, as by default emails do not come with a magic foresight that knows which message-ids will eventually respond to them. However, it is not very normal in that a reply is unlikely to arrive in our database before the original message A.
To make the algorithm correct, we could e.g. in add_files() first go through the database searching for emails that reference A and add them to A's "referenced-by" list. But this would be slow and load the whole archive, so instead, the above algorithm just runs this check on the files it does load.
The good news is that the algorithm always makes progress toward correctness, as each time we touch a file, we make sure every file it references "knows this" by having its referenced-by list updated. In particular, pragmatically, if one imports a new email dataset that contains old files, one can ensure correctness by a single pass through the entire archive, which lets the archive update those "new old files" with referenced-by information. The algorithm will then be correct at least until the next time an "old" email is imported.
For additional robustness, one can add this referenced-by metadata check at other points in the email ecosystem.