What's up with all those equals signs anyway?
238 points
4 hours ago
| 17 comments
| lars.ingebrigtsen.no
| HN
ruhith
1 hour ago
[-]
The real punchline is that this is a perfect example of "just enough knowledge to be dangerous." Whoever processed these emails knew enough to know emails aren't plain text, but not enough to know that quoted-printable decoding isn't something you hand-roll with find-and-replace. It's the same class of bug as manually parsing HTML with regex, it works right up until it doesn't, and then you get congressional evidence full of mystery equals signs.
reply
lvncelot
57 minutes ago
[-]
> It's the same class of bug as manually parsing HTML with regex, it works right up until it doesn't

I'm sure you already know this one, but for anyone else reading this I can share my favourite StackOverflow answer of all time: https://stackoverflow.com/a/1732454

reply
josefx
38 minutes ago
[-]
I prefer the question about CPU pipelines that gets explained using a railroad switch as example. That one does a decent job of answering the question instead of going of on a, how to best put it, mentally deranged one page rant about regexes with the lazy throw away line at the end being the only thing that makes it qualify as an answer at all.
reply
MrGilbert
4 minutes ago
[-]
For anyone wondering about the railroad switch post: https://stackoverflow.com/questions/11227809/why-is-processi...
reply
kapep
13 minutes ago
[-]
The regex answer is from the very old days of Stackoverflow, before fun was banned. I agree it barely qualifies as answer, but considering that the question has over 4 million page views (which almost puts it in the top 100 question with most all-time views), it has reached a lot people. The answer probably had much more influence than any serious answer on that topic. So I'd say the author did a good job.
reply
bayesnet
10 minutes ago
[-]
I know this is grumpy but this I’ve never liked this answer. It is a perfect encapsulation of the elitism in the SO community—if you’re new, your questions are closed and your answers are edited and downvoted. Meanwhile this is tolerated only because it’s posted by a member with high rep and username recognition.
reply
Cthulhu_
15 minutes ago
[-]
HE COMES
reply
V__
1 hour ago
[-]
They have top men working on it right now.
reply
xg15
1 hour ago
[-]
I'm just wondering why this problem shows up now. Why do lots of people suddenly post their old emails with a defective QP decoder?

> For some reason or other, people have been posting a lot of excerpts from old emails on Twitter over the last few days.

On the risk of having missed the latest meme or social media drama, but does anyone know what this "some reason or other" is?

Edit: Question answered.

reply
SCdF
1 hour ago
[-]
Presumably the Epstein files, but I'm not on twitter so not sure
reply
xg15
1 hour ago
[-]
Ooh, that reason. Sorry for having been dense. Thanks!
reply
ropp
1 hour ago
[-]
the DOJ published another bunch of Epstein emails
reply
jychang
1 hour ago
[-]
That's clearly a joke, pretending to be unaware of all the emails from the Epstein files that were posted in the past 3 days.
reply
thedanbob
1 hour ago
[-]
I wrote my own email archiving software. The hardest part was dealing with all the weird edge cases in my 20+ year collection of .eml files. For being so simple conceptually, email is surprisingly complicated.
reply
heikkilevanto
1 hour ago
[-]
I thought the article would be about the various meanings of operators like = == === .=. <== ==> <<== ==>> (==) => =~=
reply
direwolf20
1 hour ago
[-]
What is this, a Haskell for ants?
reply
dkga
1 hour ago
[-]
It has to be at least… three times bigger than this
reply
tiborsaas
2 hours ago
[-]
> We see that that’s a quite a long line. Mail servers don’t like that

Why do mail server care about how long a line is? Why don't they just let the client reading the mail worry about wrapping the lines?

reply
direwolf20
1 hour ago
[-]
SMTP is a line–based protocol, including the part that transfers the message body

The server needs to parse the message headers, so it can't be an opaque blob. If the client uses IMAP, the server needs to fully parse the message. The only alternative is POP3, where the client downloads all messages as blobs and you can only read your email from one location, which made sense in the year 2000 but not now when everyone has several devices.

reply
fluoridation
11 minutes ago
[-]
Hey, POP3 still makes sense. Having a local copy of your emails is useful.
reply
direwolf20
6 minutes ago
[-]
If you want it to be the only copy and not sync with anything
reply
layer8
1 hour ago
[-]
Mails are (or used to be) processed line-by-line, typically using fixed-length buffers. This avoids dynamic memory allocation and having to write a streaming parser. RFC 821 finally limited the line length to at most 1000 bytes.

Given a mechanism for soft line breaks, breaking already at below 80 characters would increase compatibility with older mail software and be more convenient when listing the raw email in a terminal.

This is also why MIME Base64 typically inserts line breaks after 76 characters.

reply
citrin_ru
1 hour ago
[-]
Back in 80s-90s it was common to use static buffers to simplify implementation - you allocate a fixed size buffer and reject a message if it has a line longer than the buffer size. SMTP RFC specifies 1000 symbols limit (including \r\n) but it's common to wrap around 87 symbols so it is easy to examine source (on a small screen).
reply
thephyber
2 hours ago
[-]
The simplest reason: Mail servers have long had features which will send the mail client a substring of the text content without transferring the entire thing. Like the GMail inbox view, before you open any one message.

I suspect this is relevant because Quoted Printable was only a useful encoding for MIME types like text and HTML (the human readable email body), not binary (eg. Attachments, images, videos). Mail servers (if they want) can effectively treat the binary types as an opaque blob, while the text types can be read for more efficient transfer of message listings to the client.

reply
liveoneggs
1 hour ago
[-]
This is how email work(ed) over smtp. When each command was sent it would get a '200'-class message (success) or 400/500-class message (failure). Sound familiar?

telnet smtp.mailserver.com 25

HELO

MAIL FROM: me@foo.com

RCPT TO: you@bar.com

DATA

blah blah blah

how's it going?

talk to you later!

.

QUIT

reply
Telemakhos
30 minutes ago
[-]
This brings back some fun memories from the 1990s when this was exactly how we would send prank emails.
reply
josefx
1 hour ago
[-]
RFC822 explicitly says it is for readability on systems with simple display software. Given that the protocol is from 1982 and systems back then had between 4 and 16kb RAM in total it might have made sense to give the lower end thin client systems of the day something preprocessed.
reply
sumtechguy
34 minutes ago
[-]
Also it is an easy way to stop a denial of service attack. If you let an infinite amount in that field. I can remotely overflow your system memory. The mail system can just error out and hang up on the person trying the attack instead of crashing out.
reply
fluoridation
7 minutes ago
[-]
Surely you don't need the message to be broken up into lines just for that. Just read until a threshold is reached and then close the connection.
reply
Pinus
2 hours ago
[-]
As far as I can remember, most mail servers were fairly sane about that sort of thing, even back in the 90’s when this stuff was introduced. However, there were always these more or less motivated fears about some server somewhere running on some ancient IBM hardware using EBCDIC encoding and truncating everything to 72 characters because its model of the world was based on punched cards. So standards were written to handle all those bizarre systems. And I am sure that there is someone on HN who actually used one of those servers...
reply
tiborsaas
1 hour ago
[-]
Thanks, I really expected a tale from the 70's, but did not see punch cards coming :)
reply
jibal
1 hour ago
[-]
The influence of 80 column punch cards remains pervasive.
reply
codingdave
2 hours ago
[-]
Keep in mind that in ye olden days, email was not a worldwide communication method. It was more typical for it to be an internal-only mail system, running on whatever legacy mainframe your org had, and working within whatever constraints that forced. So in the 90s when the internet began to expand, and email to external organizations became a bigger thing, you were just as concerned with compatibility with all those legacy terminal-based mail programs, which led to different choices when engineering the systems.
reply
liveoneggs
1 hour ago
[-]
This is incorrect
reply
beejiu
2 hours ago
[-]
> So what’s happened here? Well, whoever collected these emails first converted from CRLF (i.e., “Windows” line ending coding) to “NL” (i.e., “Unix” line ending coding). This is pretty normal if you want to deal with email. But you then have one byte fewer:

I think there is a second possible conclusion, which is that the transformation happened historically. Everyone assumes these emails are an exact dump from Gmail, but isn't it possible that Epstein was syncing emails from Gmail to a third party mail server?

Since the Stackoverflow post details the exact situation in 2011, I think we should be open to the idea that we're seeing data collected from a secondary mail server, not Gmail directly.

Do we have anything to discount this?

(If I'm not mistaken, I think you can also see the "=" issue simply by applying the Quoted-Printable encoding twice, not just by mishandling the line-endings, which also makes me think two mail servers. It also explains why the "=" symbol is retained.)

reply
MoltenMan
2 hours ago
[-]
This seems like the most likely reason to me!
reply
maartin0
1 hour ago
[-]
Fun how the archive.today article near the top has this exact issue

https://pastes.io/correspond

https://news.ycombinator.com/item?id=46843805

reply
quibono
3 hours ago
[-]
CLRF vs LF strikes again. Partly at least.

I wonder why even have a max line length limit in the first place? I.e. is this for a technical reason or just display related?

reply
brk
25 minutes ago
[-]
Wait, now we have to deal with Carriage Line Return Feeds too?

I wonder if the person who had the idea of virtualizing the typewriter carriage knew how much trouble they would cause over time.

reply
keybored
19 seconds ago
[-]
Yeah, and using two bytes for a single line termination (or separation or whatever)? Why make things more complicated and take more space at the same time?
reply
OJFord
3 hours ago
[-]
I haven't seen them other than in the submission - but if the length matches up it may be that they were processed from raw email, the RFC defines a length to wrap at.

Edit: yes I think that's most likely what it is (and it's SHOULD 78ch; MUST 998ch) - I was forgetting that it also specifies the CRLF usage, it's not (necessarily) related to Windows at all here as described in TFA.

Here it is in my 'notmuch-more' email lib: https://github.com/OJFord/amail/blob/8904c91de6dfb5cba2b279f...

reply
FabHK
2 hours ago
[-]
> it's not (necessarily) related to Windows at all here as described in TFA.

The article doesn't claim that it's Windows related. The article is very clear in explaining that the spec requires =CRLF (3 characters), then mentions (in passing) that CRLF is the typical line ending on Windows, then speculates that someone replaced the two characters CRLF with a one character new line, as on Unix or other OSs.

reply
OJFord
2 hours ago
[-]
Ok yeah I may have misinterpreted that bit in the article. It would be a totally reasonable assumption if you didn't happen to know that about email though, it wasn't a judgement regardless.
reply
dgan
3 hours ago
[-]
I am just wondering how it is good idea for a sever to insert some characters into user's input. If a collegue were to propose this, i d laugh in his face

It's just sp hacky i cant belive it's a real life's solution

reply
jagged-chisel
2 hours ago
[-]
“Insert characters”?

Consider converting the original text (maintaining the author’s original line wrapping and indentation) to base64. Has anything been “inserted” into the text? I would suggest not. It has been encoded.

Now consider an encoding that leaves most of the text readable, translates some things based on a line length limit, and some other things based on transport limitations (e.g. passing through 7-bit systems.) As long as one follows the correct decoding rules, the original will remain intact - nothing “inserted.” The problem is someone just knowledgeable enough to be aware that email is human readable but not aware of the proper decoding has attempted to “clean up” the email for sharing.

reply
dgan
2 hours ago
[-]
Okey it does sound better from this POV. Still wierd as its a Client/UI concern, not something a server is supposed to do; whats next,adding "bold" tags on the title? Lol
reply
brookst
31 minutes ago
[-]
SMTP is a line-oriented protocol. The server processes one line at a time, and needs to understand headers.

Infinite line length = infinite buffer. Even worse, QP is 7-bit (because SMTP started out ASCII only), so characters >127 get encoded as three bytes (equal, then two hex digits), so a 500-character non-ASCII UTF8 line is 1500 bytes.

It all made sense at the time. Not so much these days when 7-bit pipes only exist because they always have.

reply
direwolf20
1 hour ago
[-]
It's called escaping, and almost every protocol has it. HN must convert the & symbol to &amp; for displaying in HTML. Many wire protocols like SATA or Ethernet must insert a 1 after a certain number of consecutive 0s to maintain electrical balance. Don't remember which ones — don't quote me that it's SATA and Ethernet.
reply
flexagoon
2 hours ago
[-]
When you post a comment on HN, the server inserts HTML tags into your input. Isn't that essentially the same thing?
reply
dgan
2 hours ago
[-]
No, because there is a clear separation between the content and the envelop. You wouldnt expect the post office to open your physical letters and write routing instructions to the postmen for delivery

But I agree with sibling comment: it makes more sense when its called "encoding" instead of "inserting chars into original stream"

reply
layer8
1 hour ago
[-]
Just wait until you learn what mess UTF-8 will turn your characters into. ;)
reply
jojomodding
3 hours ago
[-]
reply
voxelghost
1 hour ago
[-]
My main takeaway from this article, is that I want to know what happened to the modified pigs with non-cloven hoofs
reply
lordnacho
3 hours ago
[-]
I love how HN always floats up the answers to questions that were in my mind, without occupying my mind.

I, too, was reading about the new Epstein files, wondering what text artifact was causing things to look like that.

reply
AlphaAndOmega0
3 hours ago
[-]
Same here. I did notice what I think was an actual error on someone's part, there was a chart in the files comparing black to white IQ distributions, and well, just look at it:

https://nitter.net/AFpost/status/2017415163763429779?s=201

Something clearly went wrong in the process.

reply
fredley
2 hours ago
[-]
Me too. I first assumced it was an OCR error, then remembered they were emails and wouldn't need to go through OCR. Then I thought that the US Government is exactly the kind of place to print out millions of emails only to scan them back in again.

I'm glad to know the real reason!

reply
lucb1e
1 hour ago
[-]

    cat title | sed 's/anyway/in email/'
would save a click for those already familiar with =20 etc.
reply
noduerme
2 hours ago
[-]
Great. Can't wait for equal signs to be the next (((whatever this is))). Maybe it's a secret code. j/k

On a side note: There are actually products marketed as kosher bacon (it's usually beef or turkey). And secular Jews frequently make jokes like this about our kosher bros who aren't allowed to eat the real stuff for some dumb reason like it has too many toes.

reply
seydor
3 hours ago
[-]
TLDR "=\r\n" was converted to "=\n"
reply
netsharc
3 hours ago
[-]
Author seems to think Unix uses a character called "NL" instead of "LF"...
reply
debugnik
2 hours ago
[-]
Unicode labels U+000A as all of "LINE FEED (LF)", "new line (NL)" and "end of line (EOL)". I'm guessing different names were imported from slightly different character sets, although I understand the all-uppercase name to be the main/official one.

https://www.unicode.org/charts/PDF/U0000.pdf

reply
matsemann
2 hours ago
[-]
NL, or New Line, is a character in some character sets, like old mainframe computers. No need to be snarky just because he mistyped or uses a different name for something.
reply
db_admin
2 hours ago
[-]
I am more surprised by the description of “rock döts”. A Norwegian certainly knows that ASCII is not enough for all our alphabetical needs.
reply
topaz0
20 minutes ago
[-]
https://en.wikipedia.org/wiki/Metal_umlaut

The writer presumably knows that umlauts and other non-ascii characters are functional in many languages. "rock döts" is poking fun at the trend in a certain tranche of anglophone rock/metal to use them in a purely aesthetic way in band names etc.

reply
thaumasiotes
2 hours ago
[-]
No, the article is quite explicit that that isn't what happened.
reply
brador
2 hours ago
[-]
Could be worsened by inaccurate optical character recognition in some cases.

Back in those days optical scanners were still used.

reply
zabzonk
1 hour ago
[-]
People posting Excel formulae?
reply
ccppurcell
2 hours ago
[-]
Rock dots? You mean diacritics? Yeah someone invented them: the ancient Greeks, idiöt.
reply
RHSeeger
2 hours ago
[-]
It's not the character, its the way / context in which it's used

https://en.wikipedia.org/wiki/Metal_umlaut

reply
ccppurcell
1 hour ago
[-]
I know what he was referring to. But the use case is obviously languages other than English, not the Motörhead fan club newsletter.
reply
topaz0
17 minutes ago
[-]
Some combination of people misunderstood some other people's joke, not totally clear which and which.
reply
chr
2 hours ago
[-]
Yeah, that dude oughta read books and learn about computers, too.
reply
gerikson
26 minutes ago
[-]
And live in a country where they use these in their alphabets.
reply