What would be the benefit to updating legacy protocols to just use NL? You save a handful of bits at the expense of a lot of potential bugs. HTTP/1(.1) is mostly replaced by HTTP/2 and later by now anyway.
Sure, it makes sense not to require CRLF with any new protocols, but it doesn't seem worth updating legacy things.
> Even if an established protocol (HTTP, SMTP, CSV, FTP) technically requires CRLF as a line ending, do not comply.
I'm hoping this is satire. Why intentionally introduce potential bugs for the sake of making a point?
HTTP/1.1 was regrettably but irreversibly designed with security-critical parser alignment requirements. If two implementations disagree on whether `A:B\nC:D` contains a value for C, you can build a request smuggling gadget, leading to significant attacks. We live in a post-Postel world, only ever generate and accept CRLF in protocols that specify it, however legacy and nonsensical it might be.
(I am a massive, massive SQLite fan, but this is giving me pause about using other software by the same author, at least when networks are involved.)
The situation is different with SMTP, see https://www.postfix.org/smtp-smuggling.html
Myself, I've written an HTTP server that is strict enough to only recognize CRLF, because recognizing bare CR or LF would require more code†, but it doesn't reject requests that contain invalid characters. It wouldn't open a request-header-smuggling hole in my case because it doesn't have any proxy functionality.
One server is a small sample size, and I don't remember what the other HTTP servers I've written do in this case.
______
† http://canonical.org/~kragen/sw/dev3/httpdito-readme http://canonical.org/~kragen/sw/dev3/server.s
https://br.pinterest.com/ https://www.pinterest.co.uk/
https://apps.apple.com/ https://support.apple.com/ https://podcasts.apple.com/ https://music.apple.com/ https://geo.itunes.apple.com/
https://ncbi.nlm.nih.gov/ https://www.salesforce.com/ https://www.purdue.edu/ https://www.playstation.com/
https://llvm.org/ https://www.iana.org/ https://www.gnu.org/ https://epa.gov/ https://justice.gov/
https://www.brendangregg.com/ http://heise.de/ https://www.post.ch/ http://hhs.gov/ https://oreilly.com/
https://www.thinkgeek.com/ https://www.constantcontact.com/ https://sciencemag.org/ https://nps.gov/
https://www.cs.mun.ca/ https://www.wipo.int/ https://www.unicode.org/ https://economictimes.indiatimes.com/
https://science.org/ https://icann.org/ https://caniuse.com/ https://w3techs.com/ https://chrisharrison.net/
https://www.universal-music.co.jp/ https://digiland.libero.it/ https://webaim.org/ https://webmd.com/
This URL responds with HTTP 505 on an 0A request: https://ed.ted.com/
These URLs don't respond on an 0A request: https://quora.com/
https://www.nist.gov/
Most of these seem pretty major to me. There are other sites that are public but responded with an HTTP 403, probably because they didn't like the VPN or HTTP client I used for this test. (Also, www.apple.com is tolerant of 0A line endings, even though its other subdomains aren't, which is weird.)For what it's worth: I'm testing by piping the bytes for a bare-newline HTTP request directly into netcat.
$ printf 'GET / HTTP/1.1\r\nHost: hhs.gov\r\n\r\n' | nc hhs.gov 80
HTTP/1.1 302 Found
Date: Mon, 14 Oct 2024 01:38:29 GMT
Server: Apache
Location: http://www.hhs.gov/web/508//
Content-Length: 212
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="http://www.hhs.gov/web/508//">here</a>.</p>
</body></html>
^C
$ printf 'GET / HTTP/1.1\nHost: hhs.gov\n\n' | nc hhs.gov 80
HTTP/1.1 400 Bad Request
Date: Mon, 14 Oct 2024 01:38:40 GMT
Server: Apache
Content-Length: 226
Connection: close
Content-Type: text/html; charset=iso-8859-1
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
</p>
</body></html>
How much do we expect the domain owners to invest in changing an implementation that already works? Hint: it's a number smaller than epsilon.
Google might, but their volume is so high they care about the cost of individual bytes on the wire.
On the other hand, as a client, it's OK to send malformed requests, as long as you're prepared that they may fail. But it's a weird flex, legacy protocols have many warts, why die on this particular hill.
As a web server, you may not know which intermediate proxies did the request traverse before arriving to your port. Given that request smuggling is a thing, failing fast with no further parsing on any protocol deviations seems to be the most secure thing.
Fast-abort on bare-0ah will still be compatible with all browsers and major http clients, thus providing extra mitigations practically for free.
This is unrealistic, though:
> I don't believe in Postel's Law
All the systems around us that work properly do believe in it, and they will continue to do so. No-one who writes MTAs or reverse proxies &c is gonna listen to the wolves howling at the moon for change when there's no better plan that "ram it through unilaterally". Irrespective of what any individual may believe, Postel's Law remains axiomatic in protocol design & implementation.
More constructively, it may be that line-oriented protocols will only move towards change when they can explicitly negotiate line termination preferences during the opening handshake/banner/key exchange etc, which inevitably means a protocol revision in every case and very careful consideration of when CRLF is passed through anyway (e.g. email body).
No it isn't, at least not critical to all those parsers. My HTTP server couln't care less if some middle boxes that people go through are less or more strict in their HTTP parsing. This only becomes a concern when you operate something like a reverse proxy AND implement security-relevant policies in that proxy.
Maybe this was so widespread that ~everything already handles it because non-malicious stuff breaks if you don’t. In that case, my bad, but I still would like to make a general plea as an implementer for sticking strictly to specified behavior in this sort of protocols.
The only situation where you don't need to know two policies match is when one of the policies rejects one of the combinations outright. Probably. Maybe.
EDIT: maybe it's better phrased as "all parts need to be bare-0ah-strict". But then it's fine if it's bare-0ah-reject; they just need to all be strict, one way or the other.
I just don’t see why you’d not want to do that as the implementer. If there’s some way to exploit that behavior I can’t see it.
This attack is even worse when applied to SMTP because the attacker can forge emails that pass SPF checking, by inserting the end of one message and start of another. This can also be done in HTTP if your reverse proxy uses a single multiplexed connection to your origin server, and the attacker can make their response go to the next user and desync all responses after that.
The problem here is not to use one or the other, but to use a mix of both.
Hipp is probably one of the better engineering leaders out there. His point of view carries weight because of who he is, but should be evaluated on its merits. If Microsoft got rid of this crap 30 years ago, when it was equally obsolete, we wouldn’t be having this conversation; if nobody does, our grandchildren will.
I understand that it is tempting to blame Microsoft for \r\n proliferation, but it does not seem to be the case - the \r\n is comes from the era of teletypes and physical VT terminals. You can still see the original "NL" in action (move down only, do not go back to start of line) on any Unix system by typing "(stty raw; ls)" in a throw-away terminal.
“Today, CR is represented by U+000d and both LF and NL are represented by U+000a. Almost all modern machines use U+000a to mean NL exclusively. That meaning is embedded in most programming languages as the backslash escape \n. Nevertheless, a minority of machines still insist on sending a CR together with their NLs”
Who is the “minority”?
He also takes the position that the legacy behavior is fine for a tty, as it’s emulating a legacy terminal.
CRLF was the standardized way to implement “go down one line and return to column zero” and they’re the only ones who implemented new lines correctly at the outset.
Blaming Microsoft now, because they like backwards compatibility above almost everything else, is misplaced and myopic.
That is never the right approach. You intentionally introduce a problem you expect others to fix. All because he doesn't like 0x0d. The protocol is what it is. If you want to make more sane decisions when designing a new protocol (or an explicitly newer version of some existing one) then by all means, go for it. But intentionally breaking existing ones is not the way to go.
And given your application might assume your middleware does some form of access control (for example, `X-ActualUserForReal` being treated as an internal-only header), you could get around some access control stuff.
Not a bytes-alignment thing but a "header values disagreement" thing.
This is an issue if one part of your stack parses headers differently than another in general though, not limited to newlines.
Even if I wanted to contribute code to SQLite, I can't. I acknowledge the fact God doesn't exist, so he doesn't want my contributions :P
I think that the proper spirit of the thing is that if you have patches to sqlite is to just maintain them yourself. if you are especially benevolent you will put the patches in the public domain as well. and if they are any good perhaps the original author will want them.
In fact the public domain is so weird, some countries have no legal understanding of it. originally the concept was just the stance of the US federal government that because the works of the government were for the people, these works were not protected by copyright, and could be thought of as collectively owned by the people, or in the public domain. Some countries don't recognize this. everything has to be owned by someone. and sqlite was legally unable to be distributed in these countries, it would default to copyright with no license.
Go read the article again. I think you'll be pleasantly surprised.
It’s worse than satire. Postel’s Law is definitively wrong, at least in the context of network protocols, and delimiters, especially, MUST be precise. See, for example:
https://www.postfix.org/smtp-smuggling.html
Send exactly what the spec requires, and parse exactly as the spec requires. Do not accept garbage. And LF, where CRLF is specified, is garbage.
They can be: c.f. legally-enforced safety-regulations.
There is a method to this madness, and that's revising the standards.
Big ones being:
* The standards are often not detailed enough, or contain enough loose verbage that there are many ways to understand how to implement some part, yet those ways are not interoperable.
* Many protocols allow vendor specifications in such a way that 2 implementations that are 100% compliant won't interoperate.
* Many protocol implementations are interoperable quite well, converging on behavior that isn't specified in any standard (often to the surprise of people who haven't read the relevant standards)
At least this is my experience for ietf rfc standards.
Usually when there's a high disparity between the "de jure" and the "de facto", it's due to a discrepancy in the interests and the leverage, resulting in a breakdown in communication and cooperation. Laying into either then is a bandaid attempt, not a solution. It's how either standard sprawl starts, or how standards bodies lose relevance.
> They are not the only tool, and they don't carry any moral force.
Indeed there are countless other standards bodies in the world also producing normative definitions for many things, so I'm definitely a bit confused why the focus on IETF specifically.
To be even more exact, I do not know of any standards bodies who would publish what they and the world consider as standards, that would be entirely, or at least primarily, informational rather than normative in nature. Like, do I know the word "standard" incorrectly? What even is a point of a standard, if it doesn't aim to control?
For SMTP (which this subthread started with):
In addition, the appearance of "bare" "CR" or "LF" characters in text
(i.e., either without the other) has a long history of causing
problems in mail implementations and applications that use the mail
system as a tool. SMTP client implementations MUST NOT transmit
these characters except when they are intended as line terminators
and then MUST, as indicated above, transmit them only as a <CRLF>
sequence.
https://datatracker.ietf.org/doc/html/rfc5321#section-2.3.8https://datatracker.ietf.org/doc/html/rfc2616#section-19.3 https://datatracker.ietf.org/doc/html/rfc9112#section-2.2
As for HTTP or any other protocols' definitions go, I'd rather not join in on that back and forth. I'd imagine it's well defined what's expected. Skim reading RFC-2616 now certainly suggests so.
thats the context in which Postel's law make absolute sense. not that you should forgo any sanity checking, or attempt to interpret garbage or make up frame boundaries. but when there is a potential ambiguity, and you can safely tolerate it, then its really helpful for you to do so.
Me too. It's one thing to accept single LFs in protocols that expect CRLF, but sending single LFs is a bridge to far in my opinion. I'm really surprised most of the other replies to your comment currently seem to unironically support not complying with well-established protocol specifications under the misguided notion that it will somehow make things "simpler" or "easier" for developers.
I work on Kestrel which is an HTTP server for ASP.NET Core. Kestrel didn't support LF without a CR in HTTP/1.1 request headers until .NET 7 [1]. Thankfully, I'm unaware of any widely used HTTP client that even supports sending HTTP/1.1 requests without CRLF header endings, but we did eventually get reports of custom clients that used only LFs to terminate headers.
I admit that we should have recognized a single LF as a line terminator instead of just CRLF from the beginning like the spec suggests, but people using just LF instead of CRLF in their custom clients certainly did not make things any simpler or easier for me as an HTTP server developer. Initially, we wanted to be as strict as possible when parsing request headers to avoid possible HTTP request smuggling attacks. I don't think allowing LF termination really allows for smuggling, but it is something we had to consider.
I do not support even adding the option to terminate HTTP/1.1 request/response headers with single LFs in HttpClient/Kestrel. That's just asking for problems because it's so uncommon. There are clients and servers out there that will reject headers with single LFs while they all support CRLF. And if HTTP/1.1 is still being used in 2050 (which seems like a safe bet), I guarantee most clients and servers will still use CRLF header endings. Having multiple ways to represent the exact same thing does not make a protocol simpler or easier.
In its original terms for printing terminals, carriage return might be ambiguous. It could means either "just send the print head to column zero" or "print head to 0 and advance the line by one". The latter is what typewriters do for the Return key.
But LF always meant Line Feed, moving the paper but not the print head.
These are of course wildly out of date concepts. But it still strikes me as odd to see a Line Feed as a context reset.
Minor correction: mechanical typewriters do not have a Return key, but they have both operations (line feed, as well as carriage return).
The carriage return lever is typically rigged to also do line feed at the same time, by a preset amount of lines (which can be set to 0), or you can push the carriage without engaging line feed.
Technically, the lever would do LF, and pushing on it further would do CR (tensioning the carriage spring).
It is, however, true that most of the time, the users would simply push the lever until it stops without thinking about it, producing CRLF operation —
— and that CR without LF was comparatively rare.
From a pure protocol UX perspective, it would make sense IMO to have a single command for (CR + LF) too, just like the typewriter effectively does it (push the lever here to do both at once).
It seems weird that the protocol is more limited than the mechanical device that it drives, but then again, designers probably weren't involved in deciding on terminal protocol specs.
>the lever would do LF, and pushing on it further would do CR (tensioning the carriage spring).
In any case, carriage return is just as important function of the lever as line feed:
- you can also directly do line feed by turning the roller
- line feed, by itself, doesn't need a large lever
- carriage return, by itself, doesn't need a large lever either - you can simply push the carriage
- however, having a large lever is an ergonomic feature which allows you to:
1) return the carriage without moving your hands too far from the keyboard
2) do CRLF in one motion without it feeling like two things
3) If needs be, do a line feed by itself, since the force required for that is much smaller compared to the one to move the carriage (lever advantage!).
The long lever makes it so that line feed happens before carriage return. If the lever were short, you'd be able to move the carriage until it stops, and only then would the paper move up.
So I wondered why the control codes are doing the operations in the opposite order from the typewriter.
Turns out, the reasons are mechanical[1]:
>The separation of newline into two functions concealed the fact that the print head could not return from the far right to the beginning of the next line in time to print the next character. Any character printed after a CR would often print as a smudge in the middle of the page while the print head was still moving the carriage back to the first position. "The solution was to make the newline two characters: CR to move the carriage to column one, and LF to move the paper up.
Aha! Makes sense.
In a way, this was creating a de-facto protocol by usage, in a similar spirit the the author is suggesting to get rid of it.
As in: the existing standard wasn't really supported, but letting the commands go through nevertheless and allowing things to break incentivized people to collectively stick to the way of doing things that didn't result in misprints.
I also strongly disagree with the author that LF is useless.
So many times in code I need to type:
Function blah(parameter1 = default1,
parameter2, ...)
It would be super nice to move down from the beginning of the word "parameter1" down to the next line even when it's empty to start typing at that position.Sure, there is auto format. But not in this comment box.
And what I'm talking about is exactly what LF was meant to do!
I want all my text boxes to support that, and to have a special key on my keyboard to do it.
Eh.
1) "Typewriters" in parent's comment didn't refer to mechanical typewriters, but
2) Line feed/carriage return semantics, as well as the UX of combining them into one action to proceed to the next line of text, predate electric typewriters and were effectively the same on mechanical ones.
As I wrote in the other comment, the subtle difference in semantics comes from teletypes, which couldn't advance the paper feed and return the carriage fast enough to print the next character in the timespan of one command.
Not that it applied to all teletypes, but it was the case for a very popular unit.
The makers of that machine deliberately didn't include a single command that would do CR/LF so that there'd be no way for the users to notice that.
The ordering, CR then LF, differs from the one on mechanical typewriters, where LF always precedes CR when you use the big lever, allowing one to use the same lever to produce blank lines without moving the carriage (in effect, doing LF LF LF ... LF CR).
On the teletypes though, CR LF ordering was, in any case, a lie, since in actuality, LF was happening somewhere in the middle of the carriage return, which took the time span of two commands.
The CR command had to precede LF on the teletype because it took longer, but since the mechanisms were independent, they could be executed at the same time.
This is the difference from mechanical typewriters.
Typing mechanism was also independent of CR and LF, and running CR + [type character] at the same time was bad. But having fixed-time-per-command simplified everything, so instead of waiting (..which means buffering - with potential overflow issues - or a protocol to tell the sending party to wait, which is a lot more complex), hacks like this were put in place.
My IBM selectric is not functional (got it as a repair project, didn't get to do it yet), so I can't verify, but I'd guess it doesn't need to do CR then LF, since it can simply not process input while the carriage is returning. It's OK for it to do CR and LF in any order, or simultaneously.
If the operator presses and releases a button during this time, the machine can simply do nothing; the operator will re-type the character the next instant, using the buffer in their head where the text ultimately comes from.
The teletypes didn't have that luxury, as the operator on the other end could be a computer, which was told it could send output at a certain rate, and by golly it did. Not processing a command would mean dropping data.
All that is to say that CR and LF are present on both typewriters and teletypes, with the following differences:
* mechanical typewriters always do LFCR due the mechanics of the carriage return lever, which was designed for a human operator;
* teletypes do CRLF because that's how they cope with the typist being a machine that can't be told to wait a bit until the carriage returns;
* and electric typewriters are somewhere in betwen and could do whatever, because the CR lever was replaced by the motor (like on a teletype), but the operator was still a human that could wait half a second without forgetting what it is that they wanted to type.
IMO, it's worth keeping CRLF around simply because it's a part of computer and technology history that spans nearly two centuries, from typewriters to Google docs.
Changing the line endings can invalidate signatures over plaintext content. So an email MTA, for example, could never do so. Nor most proxy implementations. Then there's the high latent potential for request smuggling, command injection, and privilege escalation, via careful crafting of ambiguous header lines or protocol commands that target less robust implementations. With some protocols, it may cause declared content sizes to be incorrect, leading to bizarre hangs, which is to say, another attack surface.
In practice, retiring CRLF can't be safely performed unilaterally or by fiat, we'll need to devise a whole new handshake to affirm that both ends are on the same page re. newline semantics.
It seems spiteful, but it strikes me as an interesting illustration of how the robustness principle could be hacked to force change. It’s a descriptivist versus prescriptivist view of standards, which is not how we typically view standards.
I've had to write decoders for things like HTTP, SMTP, SIP (VoIP), and there's so many edge cases and undocumented behavior from different implementations that you have to still support.
I find that it affects text based protocols, a lot more than binary protocols. Like TLS, or RTP, to stick with the examples above, have much less divergence and are much less forgiving to broken (according to spec) implementations.
sendmail is now stricter in following the RFCs and rejects
some invalid input with respect to line endings
and pipelining:
...snip...
- Accept only CRLF . CRLF as end of an SMTP message
as required by the RFCs, which can disabled by the
new srv_features option 'O'.
- Do not accept a CR or LF except in the combination
CRLF (as required by the RFCs). These checks can
be disabled by the new srv_features options
'U' and 'G', respectively. In this case it is
suggested to use 'u2' and 'g2' instead so the server
replaces offending bare CR or bare LF with a space.
It is recommended to only turn these protections off
for trusted networks due to the potential for abuse.
It is interesting that you ignore the benefits the OP describes and instead present a vague and fearful characterization of the costs. Your reaction lies at the heart of cargo-culting, the maintenance of previous decisions out of sheer dread. One can do a cost-benefit analysis and decide what to do, or you can let your emotions decide. I suggest that the world is better off with the former approach. To wit, the OP notes for benefits " The extra CR serves no useful purpose. It is just a needless complication, a vexation to programmers, and a waste of bandwidth." and a mitigation of the costs "You need to search really, really hard to find a device or application that actually interprets U+000a as a true linefeed." You ignore both the benefits assertion and cost mitigating assertion entirely, which is strong evidence for your emotionality.
My intuition (not emotion) agrees with the parent that investing in changing legacy code that works, and doesn't see a lot of churn, is likely a lot more expensive than leaving it be and focusing on new protocols that over time end up replacing the old protocols anyways.
OP does not really talk about the benefit, he just opines. How many programmers are vexed when implementing "HTTP, SMTP, CSV, FTP"? I'd argue not many programmers work on implementations of these protocols today. How much traffic is wasted by a few extra characters in these protocols? I'd argue almost nothing. Most of the bits are (binary, compressed) payload anyways. There is no analysis by OP of the cost of not complying with the standard which potentially results in breakage and the difficulty of being able to accurately estimate the breakage/blast radius of that lack of compliance. That just makes software less reliable and less predictable.
Funnily enough, the author doesn't actually describe any tangible benefits. It's all just (in my reading, semi-sarcastic) platonics:
- peace
- simplicity
- the flourishing of humanity
... so instead of "vague and fearful", the author comes on with a "vague and cheerful". Yay? The whole shtick about saving bandwidth, lessening complications, and reducing programmer vexations are only ever implied by the author, and were explicitly considered by the person you were replying to:
> You save a handful of bits at the expense of a lot of potential bugs.
... they just happened to be not super convinced.
Is this the kind of HackerNews comment I'm supposed to feel impressed by? That demonstrates this forum being so much better than others?
It's not satire and it's not just trying to make a point. It's trying to make things simpler. As he says, a lot of software will accept input without the CR already, even if it's supposed to be there. But we should change the standard over time so people in 2050 can stop writing code that's more complicated (by needing to eat CR) or inserts extra characters. And never mind the 2050 part, just do it today.
Let's absolutely fix new protocols (or new versions of existing protocols). But intentionally breaking existing protocols doesn't simplify anything.
Obviously IPv6 shows you need to be patient. Your great grandkids may see a useless carriage return!
Windows doesn't help here.
Easy - being able to use a plain text protocol as a human being without having to worry if my terminal sends the right end of line terminator. Using netcat to debug SMTP issues is actually something I do often enough.
But IMO the right resolution is to update the spec so that (1) readers MUST accept any of (CR, LF, CRLF), (2) writers MUST use one of (CR, LF, CRLF), and (3) writers SHOULD use LF. Removing compatibility from existing applications to break legacy code would be asinine.
Just think about text protocols like HTTP, how much easier something like cookies would be to parse if you had CR as terminating character. And then each record separated by LF.
My title was imprecise and unclear. I didn't mean that you should raise errors if CRLF is used as a line terminator in (for example) HTTP, only that a bare NL should be allowed as an acceptable line terminator. RFC2616 recommends as much (section 19.3 paragraph 3) but doesn't require it. The text of my proposal does say that CRLF should continue to be accepted, for backwards compatibility, just not required and not generated by default. I failed to make that point clear.
My initial experiments suggested that this idea would work fine and that few people would even notice. Initially, it appeared that when systems only generate NL instead of CRLF, everything would just keep working seamlessly and without problems. But, alas, there are more systems in circulation that are unable to deal with bare NLs than I knew. And I didn't sell my idea very well. So there was breakage and push-back.
I have revised the document accordingly and reverted the various systems that I control to generate CRLFs again. The revolution is over. Our grandchildren will have to continue dealing with CRLFs, it seems. Bummer.
Thanks to everyone who participated in my experiment. I'm sorry it didn't work out.
I really appreciate this attitude. As programmers, we love to complain and grumble to each other about how the state of things suck, or that things are over complicated, but then too often the response is the software engineering equivalent of “I paid my student loans, so you should have to, too”. A new person joins the project, and WTFs at something, and the traumatized veterans say, “haha oh boy welcome, yeah everything sucks! You’ll get used to it soon.”
I hate that attitude.
We are at the very, very beginning of software protocols that could potentially last for millennia. From that perspective, you would look back at this situation and think of Richard’s blog post as super obvious, the clear voice of reason, and the reaction of everyone here as myopic.
Even if our software protocols for whatever reason don’t last that long, we need to be working on reducing global system complexity. Beauty and elegance aside, there is such a thing as complexity budget which is limited by the laws of information theory, the computer science equivalent of the laws of physics. People like Richard understand this intuitively, and actively work towards reconstructing our world to regain complexity currency so that it can be spent on more productive things.
I would have backed you 100%.
Specifically, I'm referring to your new guy example here. The new guy usually very correctly identifies that things suck, what he lacks is perspective. This means that both his priorities will be off, as well as his approaches. Trust the gripe, not the advice.
This is also I think what people in this thread are/were generally about here. Not because Richard would be some new unknown kid on the block mind you, but because our grandchildren having to deal with CRLF is approximately as harrowing as the eventual heat death of the universe, and because instead of standards revisions, he was calling for standards violations.
That said, I do agree we should abolish CRLF. And replace it with LF.
=> https://sqlite.org/althttpd/info/8d917cb10df3ad28 Send bare \n instead of \r\n for all HTTP reply headers.
While browser aren't effected, this broke compatibility with at least Zig's HTTP client.
=> https://github.com/ziglang/zig/issues/21674 zig fetch does not work with sqlite.org
The struggle is real, the problem is real. Parents, teach your kids to use .gitattribute files[1]. While you're at it, teach them to hate byte order marks[2].
1: https://stackoverflow.com/questions/73086622/is-a-gitattribu...
2: https://blog.djhaskin.com/blog/byte-order-marks-must-diemd/
The correct solution is to use .gitattributes.
How would you handle two different tools only supporting disjoint line endings?
Having Git normalize line endings for the relevant file types upon check-in is by far the simplest solution to this problem, in particular since having .gitattributes in the repository means that all clients automatically and consistently perform the same normalization.
Don't use `auto`, full marks, but the gitattributes file is indispensable as a safety net when explicit entries are used in it.
I mean, the whole point of the file is not everyone who is working on the project has their editors set to lf. Furthermore, not every tool is okay with line endings that are not CRLF.
When used properly (sure, ideally without auto), the git attributes file as a lifesaver.
Can we ask for the typical *nix text editors to disobey the POSIX standard of a text file next, so that I don't need to use hex editing to get trailing newlines off the end of files?
The balance here, of course, being backwards compatability. I'd sooner kill EBCDIC, bad ASCII and Code Pages than worry about CRLF if we didn't have to care about ancient systems.
Programming languages still retain C's operator precedence hierarchy even though it was itself meant to be a backwards compatible compromise and leads to errors around logical operator expressions.
Anyways, this article is about actively breaking systems like some kind of protocol terrorist in order to achieve an outcome at any cost, if it was merely along the lines of "CRLF considered harmful in new protocols" I'd have nothing to say.
You didn't limit your general admiration of standards to CRLF, so no, not only that.
> about actively breaking systems like some kind of protocol terrorist in order to achieve an outcome at any cost,
That's simply false, he isn't
> Almost all implementations of these protocols will accept a bare NL as an end-of-line mark, even if it is technically incorrect.
So your position, then, is that all standards include "needless complexity?" What argument are you actually trying to make here?
> That's simply false, he isn't
Yea.. that's why the word "like" is present, it implies a near association, not a direct accusation.
> Almost all implementations of these protocols will accept a bare NL as an end-of-line mark, even if it is technically incorrect.
So, right back to my original point, then, standards prevent people from having to debug dumb issues that could have been avoided. This advice is basically "go ahead, create dumb issues, see if I care."
I may have flippantly labeled that as "protocol terrorism" but I don't think it's pure hyperbole either.
That you're mistaken in your one-sided generalization of the benefits of standards.
> So your position, then, is that all standards include "needless complexity?"
No, that's just another extreme you've made up.
> Yea.. that's why the word "like" is present, it implies a near association, not a direct accusation.
Your mistake is before "like", you can't be "about actively breaking systems" when you explicitly say that no systems will be broken
> "see if I care."
That this is false is also easy to see - the author reverted a change after he realized it breaks something ancient, so clearly he does care.
> standards prevent people from having to debug dumb issues that could have been avoided.
Not to circle the conversaion back to my original response to your point: why do you think "Almost all implementations" break the standard and "accept a bare NL"? Could it be that such unintuitive limitations don't prevent anything, and people still have to debug "dumb issues" because common expectations are more powerful?
See https://news.ycombinator.com/item?id=41832555 as far as HTTP/1.1 goes, it's definitely common but far from universal. The big problem with "it's 100% safe to make this change, since it doesn't break anything I know about" is that there are always a lot of things you don't know about, not all of which can be boiled down to being negligible weirdos.
Leaders choose the standards, especially as they approach monopoly.
Worse still: people will come out of the woodwork to actively defend the monopolist de facto standard producer.
All Unix text processing tools assume that every line in a text file ends in a newline. Otherwise, it's not a text file.
There's no such thing as a "trailing newline," there is only a line-terminating newline.
I've yet to hear a convincing argument why the last line should be an exception to that extremely long-standing and well understood convention.
Is "line-terminating newline" a controlled / established term I'm unfamiliar with or am I right to hold deep contempt against you?
Because "trailing newline", contrary to what you claim, is 100% established terminology (in programming anyways), so I'd most definitely consider it "existing", and I find it actively puzzling that someone wouldn't.
It hadn't even occurred to me until today that anything else could be meant :o
I know it's just me but my worldview is that the world would be better if all editors had "insert final newline" behavior
I expect my editor to do what I say, not secretly(!) guess what I might have wanted, or will potentially want sometime in the future. Having to insert a newline while concatenating files is a chore, but a predictable annoyance. Having to hunt for mystery bytes, maybe less so.
What Unix program "throws a fit" when encountering a perfectly normal newline in the last line in a file?
What I ran into issues with was contemporary software that's shipped to Linux, such as Neo4j, which expects its license files to have no newline at the end of the file, and will actively refuse to start otherwise.
I have a feeling I'll now experience the "well that's that software's problem then" part of this debate. Just like how software not being able to handle CRLF / CR-only / LF-only, is always the problem - instead of text files being a joke, and platforms assuming things about them being the problem.
Please do consider that many software products will not change and they will still be actively used on production environments that you will never have interest about.
And it was pretty clear from the context of norir's comment that they were not talking about legacy software, they were talking about writing new projects/file formats that used newlines as a separator. Just because you want to shoehorn your legacy projects into this discussion doesn't mean that they fit.
I don't see value in picking on Google or HTTP here, even if it is fashionable to do so.
10 LF (Line Feed). A format effector that advances the active position to the same character position on the next line. (Also applicable to display devices.) Where appropriate, this character may have the meaning “New Line” (NL), a format effector that advances the active position to the first character position on the next line. Use of the NL convention requires agreement between sender and recipient of data.
ASCII 1968 - https://www.rfc-editor.org/info/rfc20
ASCII 1977 - https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub1-2-197...
The second sentence is the UNIX interpretation of LF doing the equivalent of CRLF. But calling it a standard line ending when it's an alternative meaning defined in the standard as "requires agreement between sender and recipient of data" is a bit of a stretch. It's permissible by the standard, but it's not the default as per the standard
Personally speaking, I've always written my parsers to be permissive and accept either CR¹, LF, or CRLF as line endings. And it always meant keeping a little extra boolean for "previous byte was CR" to ignore the LF to not turn CRLF into 2 line endings.
¹ CR-only was used on some ancient (m68k era?) Macintosh computers I believe.
P.S.: LFCR is 2 line endings in my parsers :D
(And these distinctions predate UNIX — if I were confronted with an inconsistent mess I'd go for simplicity too, and a 2-byte newline is definitely not simple just by merit of being 2 bytes. I personally wouldn't have cared whether it was CR or LF, but would have cared to make it a single byte.)
OP clearly says that most things in fact don't break if you just don't comply with the CRLF requirement in the standard and send only LF. (He calls LF "newline". OK, fine, his reasoning seems legit.) He is not advocating changing the language of the standard.
To all those people complaining that this is a minor matter and the wrong hill to die on, I say this: most programmers today are blindly depending on third-party libraries that are full of these kinds of workarounds for ancient, weird vestigial crud, so they might think this is an inconsequential thing. But if you're from the school of pure, simple code like the SQLite/Fossil/TCL developers, then you're writing the whole stack from scratch, and these things become very, very important.
Let me ask you instead: why do you care if somebody doesn't comply with the standard? The author's suggestion doesn't affect you in any way, since you'll just be using some third-party library and won't even know that anything is different.
Oh bUT thE sTandArDs.
The Unicode standard does call it NL along with LF.
000A <control>
= LINE FEED (LF)
= new line (NL)
= end of line (EOL)
Source: https://www.unicode.org/charts/PDF/U0000.pdfIf every server updated to line-end of LF, thereby supporting both types, this vuln wouldn’t happen?
Of course if there’s is a mixed bag then I guess this is still possible, if your server only supports CRLF. At least in that scenario you have some control over the issue though.
Unfortunately, asking more people to ignore the currently estabilished standards makes the problem worse, not better.
More specifically the Unicode control character U+000a is, in the Unicode standard, named both LF and NL (and that comes from ASCII but in ASCII I think 0x0a was only called LF).
It literally has both names in Unicode: but LINEFEED is written in uppercase while newline is written in lowercase (not kidding you). You can all see for yourself that U+000a has both names (and eol too):
https://www.unicode.org/charts/PDF/U0000.pdf
> and the article does not at all address that fact that the "ENTER" key on every keyboard sends a CR and not a LF.
what a key on a keyboard sends doesn't matter though. What matters is what gets written to files / what is sent over the wire.
... $ cat > /tmp/anonymousiam<ENTER>
<ENTER>
<CTRL-C>
... $ hexdump /tmp/anonymousiam
00000000 000a
When I hit ENTER at my Linux terminal above, it's LINEFEED that gets written to the file. Under Windows I take it the same still gets CRLF written to the file as in the Microsoft OSes of yore (?).> Things work fine the way they are.
I agree
(stty raw)
Note that your job control characters will no longer function, so you will need to kill the cat command from a different terminal, then type: stty sane (or stty cooked) to restore your terminal to "normal" operation.
You will then see the 0d hex carriage return characters in the /tmp/anonymousiam file, and no 0a hex linefeed characters present.
There is, copying from a helpful comment above:
> The Unicode standard does call it NL along with LF.
000A <control>
= LINE FEED (LF)
= new line (NL)
= end of line (EOL)
Source: https://www.unicode.org/charts/PDF/U0000.pdfAnd things don't work fine, there are many issues with this historical baggage
The fact that both CRLF and LF used the same control character in my eyes in a huge bonus for this type of action to actually work. Simply make everything cross platform and start ignoring CR completely. I’m surprised this isn’t mentioned explicitly as a course of action in the article, instead it focuses on making people change their understanding of LF in to NL which is as unnecessary complication that will cause inevitable bikeshedding around this idea.
Not really. In order to ignore CR you need to treat LF as NL.
Insane. First i think it was a April 1st joke, but is not.
Let's break everything because YES.
I'm kind of confused by this whole post.
I do understand the desire for simplification (let's ignore the argument of whether this is one), but...
Is this true?
It was used for "graphics" on character-only terminals.
I'm pretty sure drh is making a case only against the use of CRLF in protocols, not trying to redesign terminal in the process. If you're emulating a machine which understands LF then you're kinda stuck with line feed semantics, for better and for worse.
If a modern machine interprets LF as a newline, and the cursor is moved to the left of the current row before the newline is issued, wouldn't that add a newline _before_ the current line, i.e. a newline before the left most character of the current line? Obviously this isn't how it works but I don't understand why not.
We could certainly try to write no new software that uses them.
But last I checked, there are terabytes and terabytes of stored data in various formats (to say nothing of living protocols already deployed) and they aren't gonna stop using CRLF any time soon.
According to ChatGPT, the original proposal had:
Number of sentences: 60 Number of diphthongs: 128 (pairs of vowels in the same syllable like "ai", "ea", etc.) Number of digraphs: 225 (pairs of letters representing a single sound, like "th", "ch", etc.) Number of trigraphs: 1 (three-letter combinations representing a single sound, like "sch") Number of silent letters: 15 (common silent letter patterns like "kn", "mb", etc.)
For all intents and purposes, CRLF is just another digraph.
> Selectric-based mechanisms were also widely used as terminals for computers, replacing both Teletypes and older typebar-based output devices. One popular example was the IBM 2741 terminal
> Teletypes (technically "teleprinters" - "teletype" was just the most popular brand name)
Like hey - why don't we start using the field separator and record separator characters when exporting/importing data.
But then you end up realizing that even when you are right, the energy it would take to push a change like that is astounding.
Those who successfully create an RFC and find a way push it through all the way to it becoming a standard are admirable people.
stop reinventing terms. it's literally standardized with the name "LF" / "line feed" in Unicode.
CR + LF was meant as an instruction for teletype printers, so it is outdated, and looks like he withdrew the proposal (which couldn’t have ever been serious) after some feedback.
Fossil SCM, btw, was written by the creator of SQLite, so his opinion shouldn’t be discounted as some random nobody.
But given the very first sentence:
> CR and NL are both useful control characters.
I'm willing to conclude that he doesn't intend A Blaste Against The Useless Appendage of Carriage Return Upon a New Line, or Line Feed As Some Style It, to apply to emulators of the old devices which make actual use of the distinction.
> I actually never worked on a hardware terminal in my entire career
I used to look books up at the library using a VT220. In the late 1990s they replaced that with an ASPX web browser endpoint running on PC hardware, and it was terrible. But I'm also not quite old enough to have used them for programming.
You're completely correct that it's no longer emulation of hardware terminals, there are dozens of input and output sequences which no hardware terminal ever used or understood. In many ways it's now emulation of XTerm, but even that era is slowly being left behind.
https://github.com/jftuga/chars
Stand-alone binaries are provided for all major platforms.
Yes CRLF is dumb. No, replacing it is not realistic.
Now just go pound sand. Seriously. And you owe me 5 minutes of my life wasted on reading the whole thing.
My god, I would have thought all those “simplification” ideas die off once you have 3 years of experience or more. Some people won’t learn.
P. S. Guess even the most brilliant people tend to have dumb ideas sometimes.
I mean it is all cool to have this idea, but real world implications, where half the stuff dangles on a text file, appear to be not considered here.
For clarity's sake, I am not saying don't do it. I am saying: how will that work?
edit: spaces, tabs and one crlf
That will make things better.
The reality is that existing protocols CANNOT be changed. Only new versions are released and the old ones (which might rely on CRLF) will never die.
Unicode have already done so - (NEL) https://www.compart.com/en/unicode/U+0085
In short - shutup and deal with it. Is it an extremely mild and barely inconvenient nuisance to deal with different or mixed line endings? Yes. Is this actually a hard or difficult problem? No.
Stop trying to force everyone to break their backs so your life is inconsequentially easier. Deal with it and move on.
Allowing CRFL-less operation intentionally, especially in new implementations. Abusing protocol tolerance is (just a bit) to switch current ones. Should allow relatively gradual progress towards Less Legacy:tm: with basically no cost.
Not every change is "breaking your back" especially if you should be updating your systems anyways to implement other, larger and more important changes.
There will always be tech debt. Always and forever. Burn cycles on one that matters.
Regarding this issue…I don’t think the author is advocating for patching standards. Just consider CR as deprecated and use it only for backward compatibility.
I do it similarly. I don’t convert line endings but any new project uses LF irrespective of the OS and configured as such in the editor.