"For example, my compiler interprets "\n" (a sequence of backslash and character "n") in a string literal as "\n" (a newline character in this case). If you think about this, you would find this a little bit weird, because it does not have information as to the actual ASCII character code for "\n". The information about the character code is not present in the source code but passed on from a compiler compiling the compiler. Newline characters of my compiler can be traced back to GCC which compiled mine."
[1] https://github.com/gcc-mirror/gcc/blob/8a4a967a77cb937a2df45...
I think the author may be thinking of Ken Thompson's Turing Award lecture "Reflections on Trusting Trust".
All theories being bandied about should account for the fact that early C compilers appeared on non-ASCII systems that did not map \n "line feed" to decimal 10.
https://en.wikipedia.org/wiki/EBCDIC
As an added wrinkle EBCDIC had both an explicit NextLine and an explicit LineFeed character.
For added fun:
The gaps between letters made simple code that worked in ASCII fail on EBCDIC. For example for (c = 'A'; c <= 'Z'; ++c) putchar(c); would print the alphabet from A to Z if ASCII is used, but print 41 characters (including a number of unassigned ones) in EBCDIC.
Sorting EBCDIC put lowercase letters before uppercase letters and letters before numbers, exactly the opposite of ASCII.
The only guarentee in the C standard re: chracter encoding was that the digits '0'-'9' mapped in contiguous ascending order.In theory* simple C programs (that printed 10 lines of "Hello World") should have tha same source that compiled on either ASCII or EBCDIC systems and produced the same output.
* many pitfalls aside
Despite EBCDIC having a newline/next line character (NEL), it is rarely encountered in many EBCDIC systems. Early on, most EBCDIC systems (e.g. MVS, VM/CMS, OS/400, DOS/VSE) did not store text as byte stream files, but instead record-oriented files – storing lines as fixed-length or variable-length records. With fixed-length records, you'd declare a record length when creating the file (80 or 132 were the most common choices); every line in the file had to be of that length, shorter lines would be padded (normally with the EBCDIC space character, which is 0x40 not 0x20), longer lines would either be truncated or a continuation character would be used. With variable-length records, each record was prefixed with a record descriptor word (RDW) which gave its length (and a couple of spare bytes that theoretically could be used for additional metadata). However, in practice the use of variable-length records for text files (including program source code) was rather rare, fixed-length records were the norm.
So even though NEL exists, it wasn't normally used in files on disk. Essentially, newline characters such as NEL are "in-band signalling "for line/record boundaries, but record-oriented filesystems used "out-of-band signalling" instead. I'm not sure exactly how stdio was implemented in the runtime libraries of EBCDIC C compilers – I assume \n did map to NEL internally, but then the stdio layer treated it as a record separator, and then wrote each record using a separate system call, padding as necessary.
Later on, most of these operating systems gained POSIX compatibility subsystems, at which point they gained byte stream files as exist on mainstream systems. IBM systems generally support tagging files with a code page, so the files can be a mix of EBCDIC and ASCII, and the OS will perform translation between them in the IO layer (so an application which uses EBCDIC at runtime can read an ASCII file as EBCDIC, without having to manually call any character encoding conversion APIs, or be explicitly told whether the file to be read is EBCDIC or ASCII). Newer applications make increasing use of the POSIX-based filesystems, but older applications still mostly store data (even text files and program source code) in the classic record-oriented file systems.
From what I understand, the most common place EBCDIC NEL would be encountered in the wild was EBCDIC line mode terminal connections (hard copy terminals such as IBM 2741 and IBM 3767).
But the way in which this phenomena is explained is via actual code. The code itself is besides the point of course, it's not like anyone will ever run or compile this specific code, but it's put there for humans to follow the discussion.
/// ProcessCharEscape - Parse a standard C escape sequence, which can occur in
/// either a character or a string literal.
static unsigned ProcessCharEscape(const char *ThisTokBegin,
const char *&ThisTokBuf,
const char *ThisTokEnd, bool &HadError,
FullSourceLoc Loc, unsigned CharWidth,
DiagnosticsEngine *Diags,
const LangOptions &Features) {
const char *EscapeBegin = ThisTokBuf;
// Skip the '\' char.
++ThisTokBuf;
// We know that this character can't be off the end of the buffer, because
// that would have been \", which would not have been the end of string.
unsigned ResultChar = *ThisTokBuf++;
switch (ResultChar) {
...
case 'n':
ResultChar = 10;
break;
...
/* Convert an escape sequence (pointed to by FROM) to its value on
the target, and to the execution character set. Do not scan past
LIMIT. Write the converted value into TBUF, if TBUF is non-NULL.
Returns an advanced pointer. Handles all relevant diagnostics.
If LOC_READER is non-NULL, then RANGES must be non-NULL: location
information is read from *LOC_READER, and *RANGES is updated
accordingly. */
static const uchar *
convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit,
struct _cpp_strbuf *tbuf, struct cset_converter cvt,
cpp_string_location_reader *loc_reader,
cpp_substring_ranges *ranges, bool uneval)
{
/\* Values of \a \b \e \f \n \r \t \v respectively. */
#if HOST_CHARSET == HOST_CHARSET_ASCII
static const uchar charconsts[] = { 7, 8, 27, 12, 10, 13, 9, 11 };
#elif HOST_CHARSET == HOST_CHARSET_EBCDIC
static const uchar charconsts[] = { 47, 22, 39, 12, 21, 13, 5, 11 };
#else
#error "unknown host character set"
#endif
uchar c;
/* Record the location of the backslash. */
source_range char_range;
if (loc_reader)
char_range = loc_reader->get_next ();
c = *from;
switch (c)
{
...
case 'n': c = charconsts[4]; break;
...
The ultimate point of this exercise is to alter your perception of what a compiler is (in the same way as the famous Reflections On Trusting Trust presentation).
Which is to say: your compiler is not something that outputs your program; your compiler is also input to your program. And as a program itself, your compiler's compiler was an input to your compiler, which makes it transitively an input to your program, and the same is true of your compiler's compiler's compiler, and your compiler's compiler's compiler's compiler, and your compiler's compiler's compiler's compiler's compiler, and...
It's an interesting real-world example of the Ken Thompson hack.
As a human I can just Google "C string escape codes", but that table is nowhere to be found inside the compiler. If C 2025 is going to define Start of Heading as \h, is `'h' => cooked.push('\h')` going to magically start working? How could it possibly know?
Clearly at some point someone must've manually programmed a `'n' => 10` mapping, but where is it!?
You can look into the ascii table then.
10 is the line feed. '\n' is there to help human because it makes more sense than '10'.
Nobody is asking why 'a' is equal to '61'.
From the codebase, you know that '\n' is a char. A char is a value between 0 and 255, if you explicitly convert '\n' to int then you happen to find the ascii value and you are good to go and there is no need to pretend there is any poetry in this.
Requoting you:
> "if you see 'string of newline character', output 'newline character'"
It simply becomes "if a you see 'the arbitrary symbol for the new line', output 'the corresponding ascii value'".
I read the quote as "if you see 'a', output 'a' in ascii code." which is not mysterious in any kind of way.
IMO, the article does not make sense at all because it pretends to wonder where an hexadecimal value is coming from, but with any other kind of symbol it would be the exact same article, you enter 'a' and find some weird hexadecimal value and you won't be able to correctly trace it from the source code.
It would make sense it 'a' would display 'a' but '\n' would display some hexadecimal garbage in a text editor.
But how does the computer know which int to output when you "explicitly convert '\n' to int"? As humans, we can clearly just consult the ASCII table and/or the relevant language standard, but the computer doesn't have a brain and a pair of eyes, instead it must store that association somewhere. The purpose of this article is to locate where the association was originally entered into the source code by some human.
The question is less interesting for ordinary characters like 'a', since the codes for those are presumably baked into the keyboard hardware and the font files, and no further translation is needed.
It's true that the question is less interesting for regular characters, but your explanation why is way off base.
Consider a computer whose only I/O is a serial console. It is concerned with neither a keyboard nor a font file.
That is, suppose I design a font file so that character 0x61 has glyph 'b' and 0x62 has glyph 'a', and I accordingly swap the key caps for 'A' and 'B' on my keyboard. If I write a document with this font file and print it off, then no one looking at it could tell that my characters had the wrong codes. Only the spell-checker on my computer would complain, since it's still following its designers' ideas of the character codes 0x61 and 0x62 are supposed to mean within a word.
But physical computers knew what to insert in place immediately, because there was 0x0a somewhere in binary every time.
> I read the quote as "if you see 'a', output 'a' in ascii code." which is not mysterious in any kind of way.
Only, it's not like that.
It's like:
> If you see a backslash followed by n, output a newline.
There's no 'newline character' in the input we are parsing here.
I suggest reading the article, to find out just how badly you’re missing the point.
Check the first Unicode codepoints, 10 is defined there:
000A is LINE FEED (LF) = new line (NL), end of line (EOL)
It was already 10 in ASCII too (and the first 128 codepoints of Unicode are mostly the same as ASCII [I think there are a few tiny differences in some control characters]).
So to answer your question: it's neither 9 nor 11 because '\n' stands for "new line" and not for "character tabulation" nor for "line tabulation" (which 9 and 11 respectively stands for).
> Clearly at some point someone must've manually programmed a `'n' => 10` mapping
I don't disagree with that.
The point is how the compiler knows that. Read the article.
So I find this quite confusing; maybe OCaml does not have octal escapes but decimal ones, and \09 is the Tab character. I haven't checked.
Instead, consider the convention of control characters, such as ^C (interrupt), ^G (bell), or ^M (carriage return). Those are characters in the C0 control set, where ^C is \0x3, ^G is \0x7, or ^M is \0xD. You're seeing a bit of cleverness that goes back to pre-Unix days: to represent the invisible C0 characters in ASCII, a terminal prepends the "^" character and prints the character AND-0x40, shifting it into a visible range.
You may want to pull up an ASCII table such as https://www.asciitable.com to follow along. Each control character (first column) is mapped to the ^character two columns over, on that table.
That's why \0 is represented with the odd choice of ^@, the escape key becomes ^[, and other hard-to-remember equivalents. These weren't choices made by Unix authors, they're artifacts of ASCII numbering.
.title {
font-variant: small-caps;
}
Whence '\n'?
Many systems use \N in CSVs or similar as NULL, to distinguish from an empty string.
I figured this is what the article was about?
'\N{PILE OF POO}'
is the Unicode string containing a single USV, the pile of poop emoji.Much more self-documenting than doing it with a hex sequence with \u or \U.
Running the "Reflections on Trusting Trust" Compiler - https://news.ycombinator.com/item?id=38020792 - Oct 2023 (67 comments)
How do you make the carriage (a literal thing on a chain being driven back and forth) "return" to the start of the line without a "carriage return" code?
How would you make the paper feed go up one line without a "line feed" code?
Same for ringing the bell, tabs, backspace etc.
A "new line" on a teletype was actually two characters, a CR and a LF.
Unfortunately, using one of the other sequences (eg RS, for "Record Separator") would have saved billions of CPU cycles and misinterpreted text files dealing with the CRLF, CR, LF, LFCR sequences to mean "new line".
ASCII is the layer that doesn't have escape codes (although it does have a single code for ESC), ascii is just a set of mappings from 7-bit numbers to/from mostly-printable characters
Terminal control is fairly easy answer, there would be some other API to control cursor position, so the code would need to call some function to move the cursor to next line.
For files, it would depend on what the format is. So we might be writing just `<p>hello world</p>` instead of `hello world\n`. In fact I find it bit weird that we are using teletype (and telegraph etc) control protocol (what ASCII mostly is) as our "general purpose text" format; it doesn't make much sense to me.
This is what `chr(39)` is for in the following Python quine:
a = 'a = {}{}{}; print(a.format(chr(39), a, chr(39)))'; print(a.format(chr(39), a, chr(39)))
[1]: https://en.wikipedia.org/wiki/Quine_(computing) a = '''a = {}
print(a.format(repr(a)))'''
print(a.format(repr(a)))
(OK, it's not a quine yet, because the output has slightly different formatting. But the formatting very quickly reaches a fixed point, if you keep feeding it into a Python interpreter.)The actual quine is the slightly less readable:
a = 'a = {}\nprint(a.format(repr(a)))'
print(a.format(repr(a)))
https://github.com/matthiasgoergens/Quine/blob/c53f187a4b403... has a C quine that doesn't rely on ASCII values either.Perhaps like [0] (Unicode notation).
First time is the wrong way up
Second time is also the wrong way up
Third time works
You never really know it's right until you take it out and test the friction against the other orientation.
This explain the Apple connector branding...
USB-A connection is not a classic system, you have to collapse the wavefunction before the connectors can match.
The proper way is to always look first.
A randomly occurring old-school full-size type B may be encountered during any cable search, approximately 1% of the time, usually at the same moment your printer jams.
What I really don't understand, however, is why I keep finding DB13W3s in my closet
... Isn't progress wonderful?
(Jk, it's so much better than USB-A. But the wtf moments are real)
I’m sure it’s an improvement.
//const int ZERO = '0';
const int ZERO = 0x30;
int convert(const char *s){
int ret = 0;
for (; *s; s++){
ret *= 10;
ret += *s - ZERO;
}
return ret;
}
After that you just dump the value to the output verbatim. Any high level language can handle this, it's just a change of representation like any other. There's no need to do any further delegation.Also, the original Rust developers were OCaml devs, Rust borrowed design features (like matching and additive types) from OCaml, the syntax for lifetimes in Rust is the syntax for polymorphism in OCaml, and they borrowed semantics such as latent typing.
While your rationale was used to argue for its inclusion in ASCII, as an origin story however it is very unlikely, as (according to wiki again): "The earliest known reference found to date is a 1937 maintenance manual from the Teletype Corporation with a photograph showing the keyboard of its Kleinschmidt keyboard perforator WPE-3 using the Wheatstone system."
The Kleinschmidt keyboard perforator was used for sending telegraphs, and is not well equipped with mathematical symbols, or indeed any symbols at all besides forward slash, backslash, question mark, and equals sign. Not even period!
Were A...X already taken?