If MIMEs could talk: Email structures in the wild

Written: 2017-10-18
Author: Bo Waggoner


Emails, as you may know, are stored and communicated as files in the MIME format. The actual MIME format, as you may or may not know, is a huge mess. In particular, the type of data in the file is described by its "Content-Type" (one of the headers along with From, To, CC, etc), which can be plain text, html, multipart, and so on. But multipart messages can contain other content-types within them, leading to complex or unpredictable structures.

Given a particular structure, like a multipart with three subparts the second of which is another multipart, what do the parts represent and which one(s) are the actual message? Not obvious. It turns out that each email client or tool can just do it whatever way seems like a good idea at the time (which I'll grant you is better than the alternative, but only barely).

When creating an email client for fun, I realized that reading RFCs wasn't helping me understand what the different parts of the emails actually were being used for, and there was a surprising lack of other info available.

So I wrote a simple python tool, mimelyze, that scans through an email dataset and analyzes the structures of the files. It also lets you print some random emails with a given structure, to see how that structure is actually being used. I ran it on a medium email data set and share some results below.


A Dataset of One

Actually, even finding email datasets seems to be a big challenge. The Enron dataset turns out not to consist of MIME files, i.e. no Content-Type information. Clinton's recently-leaked emails seem to mostly not have Content-Type info, except for a small number which are all plain text. (Neither of these is probably representative of most emails people send and receive anyways.) And those are pretty much the only email data sets I could find.

So, I apologize, but the only data I have to present for you here is my own. (If this post were to ever get traction, perhaps I could ask readers to run mimelyze on their own emails and submit frequency data to compile a more useful resource.)


The Walkthrough

mimelyze prints out the structures it encounters, from most frequent to least. Here are the results, with slightly smoothed numbers, on a dataset of all my own emails, about 107,000 of them compiled from about 2005 (age 16) through 2017.

Notes: The program makes some simplifications, in particular, if the email has multiple attachments or images in a row, it shortens this to "attachment(s)" and "image(s)" so these are viewed as the same structure. Each result lists the number of emails with that structure followed by a description of the structure.


1. 58700
multipart/alternative
    text/plain
    text/html

Over half the emails had this structure. "multipart/alternative" means that the sub-parts are alternatives to each other; you can read any of them. Read this as "the sender prefers that you read the HTML version of the email (the last sub-part), but if you can't, we've provided a text version as an alternative."


2. 19900
text/plain

My favorite.


3. 7900
text/html

These are a bit annoying if you're trying to write a text-based email client -- they're emails in HTML format only. These tended to be big annoying emails from institutions like American Airlines, the New York Times, and my alma mater. Ideally, these should have included a text/plain option as well, i.e. been like format #1.


4. 7700
multipart/mixed
    multipart/alternative
        text/plain
        text/html
    text/plain

"multipart/mixed" means a sequence of components, each of which is needed. You're supposed to read the first sub-part, then the second, and so on if there are more. In this case, there are two subparts, but the first one is a multipart/alternative: it's a standard-looking email we've seen in #1 with an HTML part, or a plain-text part as a fallback alternative.

To understand what the second subpart (the last text/plain component) was doing, I had to look at some random emails with this structure. The first example was from a mailing list where the message that the person had sent to the list was the multipart/alternative, and the mailing list software had added a boilerplate footer message to the bottom. This footer was the text/plain part. This was the common pattern, which makes sense because most or all of the mailing lists at my grad school used the same software.


5. 2700
multipart/mixed
    multipart/alternative
        text/plain
        text/html
    attachment(s)

This is a standard email with attachments. Here multipart/mixed is used to append attachments to a text email message.


6. 1300
multipart/related
    multipart/alternative
        text/plain
        text/html
    image(s)

This is a case where the HTML email contains embedded images inline with the text, and these images are attached. (As opposed to containing images by linking to them from elsewhere on the web.) "multipart/related" is used when all of the pieces are necessary to view that part of the email, and are related to each other in contrast to multipart/mixed.

Here, the HTML option of the email will embed the image with an <img> tag, as you'd expect. In the plain text version, the space where the image would be tends to be replaced by a line [cid:image001.jpg@XXXXXXXX.XXXXXXXX] Here cid stands for "Content-ID". In the MIME file, we should see attached an image named image001.jpg with Content-Id "image001.jpg@XXXXXXXX.XXXXXXXX". But sometimes in the plain text version, instead of this helpful line, we just see: [image: Inline image 1].


7. 1200
multipart/related
    text/html
    image(s)

Same as above, but HTML-only. These were mostly from my grad school's spam filter telling me about quarantined messages. They added an image logo at the bottom of the email.


8. 880
multipart/mixed
    multipart/alternative
        text/plain
        text/html

These seemed to be mostly automated emails from Amazon or certain mailing lists. There seems no need to encase the email in the outer multipart/mixed.


9. 680
multipart/mixed
    text/plain

Ditto.


10. 670
multipart/mixed
    multipart/alternative
        text/plain
        text/html
    attachment(s)
    text/plain

This was the mailing list software striking again. If someone sent a message to the list with attachments (i.e. structure #5), it added its plain/text footer to the very end. Note an email client shouldn't stop looking for text to display as part of the message just because it reaches some attachments.


11. 640
multipart/mixed
    text/plain
    attachment(s)

My second-favorite.


12. 550
multipart/related
    text/html

These were mostly from my college's athletic program. There seems no need for the outer multipart/related as no images are attached.


13. 450
multipart/mixed
    multipart/related
        multipart/alternative
            text/plain
            text/html
        image(s)
    text/plain

This is another mailing list pattern. When someone sends a message with structure #6 to the list, wrap it in multipart/mixed and append a plaintext footer.


14. 400
multipart/mixed
    text/html

Some people aren't looking for anything logical. They can't be bought, bullied, reasoned, or negotiated with.


15. 300
multipart/mixed
    multipart/alternative
        text/plain
        text/html
    text/plain
    text/plain

These were replies to the mailing lists mentioned above. I have no idea why, but the reply the person sent would apparently include attached the original plain text footer, and the mailing list would then append its footer again.


16. 210
multipart/mixed
    multipart/related
        multipart/alternative
            text/plain
            text/html
        image(s)
    text/plain
    text/plain

This. This right here is what I'm talking about.


17. 200
multipart/alternative
    text/plain

"You can have any format you like as long as it's plain." These seemed to mostly come from Amazon.


18. 180
multipart/alternative
    text/plain
    multipart/related
        text/html
        image(s)

An interesting take (compare to #6). This is sort of saying the images aren't as vital -- if you can view HTML, you get the embedded images, but if you can't and only have plain text, then never mind.


19. 140
multipart/alternative
    text/html

Apparently email formatting is like politics: it's all about offering the illusion of choice.


20. 140
multipart/mixed
    text/plain
    attachment(s)
    text/plain

Seems to be messages with attachments forwarded to a mailing list, which adds its plain text footer.


21. 130
multipart/mixed
    multipart/signed
        text/plain
        attachment(s)
    text/plain

One of my mailing list correspondents digitally signs emails. (A digital signature is a cryptographic hash of the email content that [presumably] can't be faked, so you know the message really came from the sender.) The text/plain part is the email message, the attachment is the digital signature, and the mailing list wraps it in multipart/mixed and adds a footer.


22. 130
multipart/mixed
    multipart/related
        multipart/alternative
            text/plain
            text/html
        image(s)
    attachment(s)

If you can figure out what's going on here, you've passed Bo's MIME academy with flying colors. So to speak.


The other kind of Content-Type that occassionally showed up but we haven't discussed is message/rfc822. This is its own entire message format which is apparently similar to MIME, but different for historical reasons I don't understand. It can appear as a part of a larger MIME file. Anyway, emails with this kind of message in them, in my dataset, tended to be from mailing list software to me, as the moderator of the list, asking me to approve posts to the list. I'm tempted to generalize this and say that most rfc822 messages are automated emails of some type, but I don't know.


Summary

The above structures represented a bit over 98% of my emails, but there's a fairly long tail, with over 100 different kinds of structures. Of course, it reveals to my shame that I don't send or receive encrypted emails, but we all have areas to improve.

I would guess that the above is quite representative in the sense that, if you understand the structures above, most email structures you see will not surprise you too much. (Though I have nothing at all on which to base that claim.) However, the actual frequencies that an individual person will see probably varies greatly, mostly depending on the kinds of mailing lists and automated email sources they're subscribed to which can end up dominating the absolute numbers.