Written: 2017-10-18
Author: Bo Waggoner
Emails, as you may know, are stored and communicated as files in the MIME format. The actual MIME format, as you may or may not know, is a huge mess. In particular, the type of data in the file is described by its "Content-Type" (one of the headers along with From, To, CC, etc), which can be plain text, html, multipart, and so on. But multipart messages can contain other content-types within them, leading to complex or unpredictable structures.
Given a particular structure, like a multipart with three subparts the second of which is another multipart, what do the parts represent and which one(s) are the actual message? Not obvious. It turns out that each email client or tool can just do it whatever way seems like a good idea at the time (which I'll grant you is better than the alternative, but only barely).
When creating an email client for fun, I realized that reading RFCs wasn't helping me understand what the different parts of the emails actually were being used for, and there was a surprising lack of other info available.
So I wrote a simple python tool, mimelyze, that scans through an email dataset and analyzes the structures of the files. It also lets you print some random emails with a given structure, to see how that structure is actually being used. I ran it on a medium email data set and share some results below.
Actually, even finding email datasets seems to be a big challenge. The Enron dataset turns out not to consist of MIME files, i.e. no Content-Type information. Clinton's recently-leaked emails seem to mostly not have Content-Type info, except for a small number which are all plain text. (Neither of these is probably representative of most emails people send and receive anyways.) And those are pretty much the only email data sets I could find.
So, I apologize, but the only data I have to present for you here is my own. (If this post were to ever get traction, perhaps I could ask readers to run mimelyze on their own emails and submit frequency data to compile a more useful resource.)
mimelyze prints out the structures it encounters, from most frequent to least. Here are the results, with slightly smoothed numbers, on a dataset of all my own emails, about 107,000 of them compiled from about 2005 (age 16) through 2017.
Notes: The program makes some simplifications, in particular, if the email has multiple attachments or images in a row, it shortens this to "attachment(s)" and "image(s)" so these are viewed as the same structure. Each result lists the number of emails with that structure followed by a description of the structure.
1. 58700 multipart/alternative text/plain text/html
Over half the emails had this structure. "multipart/alternative" means that the sub-parts are alternatives to each other; you can read any of them. Read this as "the sender prefers that you read the HTML version of the email (the last sub-part), but if you can't, we've provided a text version as an alternative."
2. 19900 text/plain
My favorite.
3. 7900 text/html
These are a bit annoying if you're trying to write a text-based email client -- they're emails in HTML format only. These tended to be big annoying emails from institutions like American Airlines, the New York Times, and my alma mater. Ideally, these should have included a text/plain option as well, i.e. been like format #1.
4. 7700 multipart/mixed multipart/alternative text/plain text/html text/plain
"multipart/mixed" means a sequence of components, each of which is needed. You're supposed to read the first sub-part, then the second, and so on if there are more. In this case, there are two subparts, but the first one is a multipart/alternative: it's a standard-looking email we've seen in #1 with an HTML part, or a plain-text part as a fallback alternative.
To understand what the second subpart (the last text/plain component) was doing, I had to look at some random emails with this structure. The first example was from a mailing list where the message that the person had sent to the list was the multipart/alternative, and the mailing list software had added a boilerplate footer message to the bottom. This footer was the text/plain part. This was the common pattern, which makes sense because most or all of the mailing lists at my grad school used the same software.
5. 2700 multipart/mixed multipart/alternative text/plain text/html attachment(s)
This is a standard email with attachments. Here multipart/mixed is used to append attachments to a text email message.
6. 1300 multipart/related multipart/alternative text/plain text/html image(s)
This is a case where the HTML email contains embedded images inline with the text, and these images are attached. (As opposed to containing images by linking to them from elsewhere on the web.) "multipart/related" is used when all of the pieces are necessary to view that part of the email, and are related to each other in contrast to multipart/mixed.
Here, the HTML option of the email will embed the image with an <img> tag, as you'd expect.
In the plain text version, the space where the image would be tends to be replaced by a line
[cid:image001.jpg@XXXXXXXX.XXXXXXXX]
Here cid stands for "Content-ID". In the MIME file, we should see attached an image named image001.jpg with Content-Id "image001.jpg@XXXXXXXX.XXXXXXXX".
But sometimes in the plain text version, instead of this helpful line, we just see:
[image: Inline image 1]
.
7. 1200 multipart/related text/html image(s)
Same as above, but HTML-only. These were mostly from my grad school's spam filter telling me about quarantined messages. They added an image logo at the bottom of the email.
8. 880 multipart/mixed multipart/alternative text/plain text/html
These seemed to be mostly automated emails from Amazon or certain mailing lists. There seems no need to encase the email in the outer multipart/mixed.
9. 680 multipart/mixed text/plain
Ditto.
10. 670 multipart/mixed multipart/alternative text/plain text/html attachment(s) text/plain
This was the mailing list software striking again. If someone sent a message to the list with attachments (i.e. structure #5), it added its plain/text footer to the very end. Note an email client shouldn't stop looking for text to display as part of the message just because it reaches some attachments.
11. 640 multipart/mixed text/plain attachment(s)
My second-favorite.
12. 550 multipart/related text/html
These were mostly from my college's athletic program. There seems no need for the outer multipart/related as no images are attached.
13. 450 multipart/mixed multipart/related multipart/alternative text/plain text/html image(s) text/plain
This is another mailing list pattern. When someone sends a message with structure #6 to the list, wrap it in multipart/mixed and append a plaintext footer.
14. 400 multipart/mixed text/html
Some people aren't looking for anything logical. They can't be bought, bullied, reasoned, or negotiated with.
15. 300 multipart/mixed multipart/alternative text/plain text/html text/plain text/plain
These were replies to the mailing lists mentioned above. I have no idea why, but the reply the person sent would apparently include attached the original plain text footer, and the mailing list would then append its footer again.
16. 210 multipart/mixed multipart/related multipart/alternative text/plain text/html image(s) text/plain text/plain
This. This right here is what I'm talking about.
17. 200 multipart/alternative text/plain
"You can have any format you like as long as it's plain." These seemed to mostly come from Amazon.
18. 180 multipart/alternative text/plain multipart/related text/html image(s)
An interesting take (compare to #6). This is sort of saying the images aren't as vital -- if you can view HTML, you get the embedded images, but if you can't and only have plain text, then never mind.
19. 140 multipart/alternative text/html
Apparently email formatting is like politics: it's all about offering the illusion of choice.
20. 140 multipart/mixed text/plain attachment(s) text/plain
Seems to be messages with attachments forwarded to a mailing list, which adds its plain text footer.
21. 130 multipart/mixed multipart/signed text/plain attachment(s) text/plain
One of my mailing list correspondents digitally signs emails. (A digital signature is a cryptographic hash of the email content that [presumably] can't be faked, so you know the message really came from the sender.) The text/plain part is the email message, the attachment is the digital signature, and the mailing list wraps it in multipart/mixed and adds a footer.
22. 130 multipart/mixed multipart/related multipart/alternative text/plain text/html image(s) attachment(s)
If you can figure out what's going on here, you've passed Bo's MIME academy with flying colors. So to speak.
The other kind of Content-Type that occassionally showed up but we haven't discussed is message/rfc822. This is its own entire message format which is apparently similar to MIME, but different for historical reasons I don't understand. It can appear as a part of a larger MIME file. Anyway, emails with this kind of message in them, in my dataset, tended to be from mailing list software to me, as the moderator of the list, asking me to approve posts to the list. I'm tempted to generalize this and say that most rfc822 messages are automated emails of some type, but I don't know.
The above structures represented a bit over 98% of my emails, but there's a fairly long tail, with over 100 different kinds of structures. Of course, it reveals to my shame that I don't send or receive encrypted emails, but we all have areas to improve.
I would guess that the above is quite representative in the sense that, if you understand the structures above, most email structures you see will not surprise you too much. (Though I have nothing at all on which to base that claim.) However, the actual frequencies that an individual person will see probably varies greatly, mostly depending on the kinds of mailing lists and automated email sources they're subscribed to which can end up dominating the absolute numbers.