FilterHN

HocusLocus

1 month ago

[-]

I have lived my whole professional life with this being 'beyond obvious'... It's hard to imagine a generation where it's not. But then again, I did work with EBCDIC for awhile and we were reading and translating ASCII log tapes (ITT/Alcatel 1210 switch, phone calls, memory dumps).

I once got drunk with my elderly unix supernerd friend and he was talking about TTYs and how his passwords contained embedded ^S and ^Q characters and he traced the login process to learn they were just stalling the tty not actually used to construct the hash. No one else at the bar got the drift. He patched his system to put do 'raw' instead of 'cooked' mode for login passwords. He also used backspaces ^? ^H as part of his passwords. He was a real security tiger. I miss him.

Eduard

1 month ago

[-]

Regarding ^?: shouldn't that be ^_ instead?

dcminter

1 month ago

[-]

It doesn't seem to have been mentioned in the comments so far, but as a floppy-disk era developer I remember my mind was blown by the discovery that DEL was all-bits-set because this allowed a character on paper tape and punched card to be deleted by punching any un-punched holes!

axblount

1 month ago

[-]

Bit-level skeuomorphism! And since NUL is zero, does that mean the program ends wherever you stop punching? I've never used punch cards so I don't know how things were organized.

fix4fun

1 month ago

[-]

For me was interesting that all digits in ASCII starts with 0x3, eg. 0x30 - 0, 0x31 - 1, ..., 0x39 - 9. I thought it was accidental, but in real it was intended. This was giving possibility to build simple counting/accounting machines with minimal circuit logic with BCD (Binary Coded Decimals). That was wow for me ;)

satiated_grue

1 month ago

[-]

ASCII was started in 1960. A terminal then would have been a mostly-mechanical teletype (keyboard and printer, possibly with paper tape reader/punch), without much by way of "circuit logic". Think of it more as a bit caused a physical shift of a linkage to do something like hit the upper or lower part of a hammer, or a separate set of hammers for the same remaining bits.

Look at the Teletype ASR-33, introduced in 1963.

fix4fun

1 month ago

[-]

Yes, that's true ASR-33 was first application, but IBM has impact on ANSI/ASA comeete and ASCII standardisation. In 1963 IBM System/360 was using BCD with digits quick "parse" and in it's peripherals. I remember it from some interview with old IBM tech employee ;)

zahlman

1 month ago

[-]

And this is exactly why I find the usual 16x8 at least as insightful as this proposed 32x4 (well, 4x32, but that's just a rotation).

kibwen

1 month ago

[-]

I still wonder if it wouldn't have been better to let each digit be represented by its exact value, and then use the high end of the scale rather than the low end for the control characters. I suppose by 1970 they were already dealing with the legacy of backwards-compatibility, and people were already accustomed to 0x0 meaning something akin to null?

mmilunic

1 month ago

[-]

Either way you would still need some check to ensure your digits are digits and not some other type of character. Having zeroed out memory read as a bunch of NUL characters instead of like “00000000” would probably be useful, as “000000” is sometimes a legitimate user input

gpvos

1 month ago

[-]

NUL was often sent as padding to slow (printing) terminals. Although that was just before my time.

1 month ago

[-]

This is by design, so that case conversion and folding is just a bit operation.

The idea that SOH/1 is "Ctrl-A" or ESC/27 is "Ctrl-[" is not part of ASCII; that idea comes from they way terminals provided access to the control characters, by a Ctrl key that just masked out a few bits.

muyuu

1 month ago

[-]

I guess it's an age thing, but I thought this was really basic CS knowledge. But I can see why this may be much less relevant nowadays.

<https://web.archive.org/web/20251103035213/https://www.catb....>

teddyh

1 month ago

[-]

It’s on the list:

muyuu

1 month ago

[-]

thanks for that, it's brilliant and very true

Cthulhu_

1 month ago

[-]

I've been in IT for decades but never knew that ctrl was (as easy as) masking some bits.

muyuu

1 month ago

[-]

You can go back maybe 2 decades without this being very relevant, but not 3 given the low level scope that was expected in CS and EE back then.

1 month ago

[-]

I learned about from 6502 machine language programming, from some example that did a simple bit manipulation to switch lower case to upper case. From that it became obvious that ASCII is divided into four banks of 32.

aa-jv

1 month ago

[-]

Been an ASCII-naut since the 80's, so .. its always amusing to see people type 'man ascii' for the first time, gaze upon its beauty, and wonder at its relevance, even still today ...

nine_k

1 month ago

[-]

Yes, the diagram just shows the ASCII table for the old teletype 6-bit code (and 5-bit code before), with the two most significant bits spread over 4 columns to show the extension that happened while going 5→6→7 bits. It makes obvious what was very simple bit operations on very limited hardware 70–100 years ago.

(I assume everybody knows that on mechanical typewriters and teletypes the "shift" key physically shifted the caret position upwards, so that a different glyph would be printed when hit by a typebar.)

taejavu

1 month ago

[-]

For whatever reason, there are extraordinarily few references that I come back to over and over, across the years and decades. This is one of them.

https://news.ycombinator.com/item?id=21586980

taejavu

1 month ago

[-]

Tangentially related, there is much insight about Unix idioms to be gained from understanding the key layout of the terminal Bill Joy used to create vi

aa-jv

1 month ago

[-]

Not 'man ascii'?

1 month ago

[-]

If Unicode had used a full 32 bits from the start, it could have usefully reserved a few bits as flags that would divide it into subspaces, and could be easily tested.

Imagine a Unicode like this:

8:8:16

- 8 bits of flags. - 8 bit script family code: 0 for BMP. - 16 bit plane for every script code and flag combination.

The flags could do usefuil things like indicate character display width, case, and other attributes (specific to a script code).

Unicode peaked too early and applied an economy of encoding which rings false now in an age in which consumer devices have two digit gigabyte memories, multi terabyte of storage, and high definition video is streamed over the internet.

https://blog.glyphdrawing.club/the-origins-of-del-0x7f-and-i...

california-og

1 month ago

[-]

I made an interactive viewer some time ago (scroll down a bit):

It really helps understand the logic of ASCII.

ripe

1 month ago

[-]

Wow, that's quite a monograph you have there, complete with ascii art examples, history, and extensive footnotes! Fantastic work.

california-og

1 month ago

[-]

Thank you! :)

mbreese

1 month ago

[-]

I came across this a week ago when I was looking at some LLM generated code for a ToUpper() function. At some point I “knew” this relationship, but I didn’t really “grok” it until I read a function that converted lowercase ascii to uppercase by using a bitwise XOR with 0x20.

It makes sense, but it didn’t really hit me until recently. Now, I’m wondering what other hidden cleverness is there that used to be common knowledge, but is now lost in the abstractions.

Findecanor

1 month ago

[-]

A similar bit-flipping trick was used to swap between numeric row + symbol keys on the keyboard, and the shifted symbols on the same keys. These bit-flips made it easier to construct the circuits for keyboards that output ASCII.

I believe the layout of the shifted symbols on the numeric row were based on an early IBM Selectric typewriter for the US market. Then IBM went and changed it, and the latter is the origin of the ANSI keyboard layout we have now.

auselen

1 month ago

[-]

xor should toggle?

munk-a

1 month ago

[-]

That's correct, a toUpper would just use OR.

mbreese

1 month ago

[-]

I left out that the line before there was a check to make sure the input byte was between ‘a’ and ‘z’. This ensures that if the char is already upper case, you don’t do an extraneous OR. And at this point, OR, XOR, or even a subtract 0x20 would work. For some reason the LLM thought the XOR was faster.

I honestly wouldn’t have thought anything of it if I hadn’t seen it written as `b ^ 0x20`.

https://www.pixelbeat.org/docs/utf8_programming.html

pixelbeat__

1 month ago

[-]

Some of this elegance discussed from a programmatic point of view

rbanffy

1 month ago

[-]

This is also why the Teletype layout has parentheses on 8 and 9 unlike modem keyboards that have them on 9 and 0 (a layout popularised by the IBM Selectric). The original Apple IIs had this same layout, with a “bell” on top of the G.

spragl

1 month ago

[-]

Modern keyboards = some keyboards. In the Nordic Countries modern keyboards have parantheses on 8 and 9.

https://www.farah.cl/Keyboardery/A-Visual-Comparison-of-Diff...

debugnik

1 month ago

[-]

According to the layouts on this site, there're more European layouts with parenthesis on 8, 9 than on 9, 0. (I had to zoom out to see the right-side of the comparisons.)

Terretta

1 month ago

[-]

What happened to this block and the keyboard key arrangement?

  ESC  [  {  11011
  FS   \  |  11100
  GS   ]  }  11101

Also curious why the keys open and close braces, but ... the single and double curly quotes don't open and close, but are stacked. Seems nuts every time I type Option-{ and Option-Shift-{ …

1 month ago

[-]

You're no longer talking about ASCII. ASCII has only a double quote, apostrophe (which doubles as a single quote) and backtick/backquote.

Note on your Mac that the Option-{ and Option-}, with and without Shift, produce quotes which are all distinct from the characters produced by your '/" key! They are Unicode characters not in ASCII.

In the ASCII standard (1977 version here: https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub1-2-197...) the example table shows a glyph for the double quote which is vertical: it is neither an opening nor closing quote.

The apostrophe is shown as a closing quote, by slanting to the right; approximately a mirror image of the backtick. So it looks as though those two are intended to form an opening and closing pair. Except, in many terminal fonts, the apostrophe is a just vertical tick, like half of a double quote.

The ' being veritcal helps programming language '...' literals not look weird.

jolmg

1 month ago

[-]

> What happened to this block and the keyboard key arrangement?

There's also these:

  | ASCII      | US keyboard |
  |------------+-------------|
  | 041/0x21 ! | 1 !         |
  | 042/0x22 " | 2 @         |
  | 043/0x23 # | 3 #         |
  | 044/0x24 $ | 4 $         |
  | 045/0x25 % | 5 %         |
  |            | 6 ^         |
  | 046/0x26 & | 7 &         |

https://en.wikipedia.org/wiki/Bit-paired_keyboard

kps

1 month ago

[-]

dveeden2

1 month ago

[-]

Also easy to see why Ctrl-D works for exiting sessions.

https://github.com/jez/bin/blob/master/ascii-4col.txt

jez

1 month ago

[-]

I have a command called `ascii-4col.txt` in my personal `bin/` folder that prints this out:

It's neat because it's the only command I have that uses `tail` for the shebang line.

dang

1 month ago

[-]

Related. Others?

Four Column ASCII (2017) - https://news.ycombinator.com/item?id=21073463 - Sept 2019 (40 comments)

Four Column ASCII - https://news.ycombinator.com/item?id=13539552 - Feb 2017 (68 comments)

unnah

1 month ago

[-]

If Ctrl sets bit 6 to 0, and Shift sets bit 5 to 1, the logical extension is to use Ctrl and Shift together to set the top bits to 01. Surely there must be a system somewhere that maps Ctrl-Shift-A to !, Ctrl-Shift-B to " etc.

maybewhenthesun

1 month ago

[-]

It's more that shift flips that bit. Also I'd call them bit 0 and 1 and not 5 and 6 as 'normally' you count bits from the right (least significant to most significant). But there are lots of differences for 'normal' of course ('middle endian' :-P )

Leszek

1 month ago

[-]

I guess in this system, you'd also type lowercase letters by holding shift?

seyz

1 month ago

[-]

This is why Ctrl+C is 0x03 and Ctrl+G is the bell. The columns aren't arbitrary. They're the control codes with bit 6 flipped. Once you see it, you can't unsee it. Best ASCII explainer I've read.

gpvos

1 month ago

[-]

Back in early times, I used to type ctrl-M in some situations because it could be easier to reach than the return key, depending on what I was typing.

renox

1 month ago

[-]

I still find weird that they didn't make A,B... just after the digits, that would make binary to hexadecimal conversion more efficient..

- https://en.wikipedia.org/wiki/ASCII#History

iguessthislldo

1 month ago

[-]

Going off the timelines on Wikipedia, the first version of ASCII was published (1963) before the 0-9,A-F hex notation became widely used (>=1966):

- https://en.wikipedia.org/wiki/Hexadecimal#Cultural_history

jolmg

1 month ago

[-]

The alphanumeric codepoints are well placed hexadecimally-speaking though. I don't imagine that was just an accident. For example, they could've put '0' at 050/0x28, but they put it at 060/0x30. That seems to me that they did have hexadecimal in consideration.

kubanczyk

1 month ago

[-]

It's a binary consideration if you think of it rather than hexadecimal.

If you have to prominently represent 10 things in binary, then it's neat to allocate slot of size 16 and pad the remaining 6 items. Which is to say it's neat to proceed from all zeroes:

    x x x x 0 0 0 0
    x x x x 0 0 0 1
    x x x x 0 0 1 0
    ....
    x x x x 1 1 1 1

It's more of a cause for hexadecimal notation than an effect of it.

jolmg

1 month ago

[-]

Currently 'A' is 0x41 and 0101, 'a' is 0x61 and 0141, and '0' is 0x30 and 060. These are fairly simple to remember for converting between alphanumerics and their codepoint. Seems more advantageous, especially if you might be reasonably looking at punchcards.

tgv

1 month ago

[-]

[0-9A-Z] doesn't fit in 5 bits, which impedes shift/ctrl bits.

vanderZwan

1 month ago

[-]

I'm not sure if our convention for hexadecimal notation is old enough to have been a consideration.

EDIT: it would need to predate the 6-bit teletype codes that preceded ASCII.

kps

1 month ago

[-]

They put : ; immediately after the digits because they were considered the least used of the major punctuation, so that they could be replaced by ‘digits’ 10 and 11 where desired.

(I'm almost reluctant to to spoil the fun for the kids these days, but https://en.wikipedia.org/wiki/%C2%A3sd )

gravifer

1 month ago

[-]

It really deserves some public documenting. People that was designing a charset for ANSI must have tried to think everything through; even more so than an 8-bit ISA because the charset was going to be inter-typewritters.

ezekiel68

1 month ago

[-]

I love this stuff. It's the kind of lore that keeps getting forgotten and re-discovered by swathes of curious computer scientists over the years. So easy to assume many of the old artifacts (such as the ASCII table) had no rhyme or reason to them.

msarnoff

1 month ago

[-]

On early bit-paired keyboards with parallel 7-bit outputs, possibly going back to mechanical teletypes, I think holding Control literally tied the upper two bits to zero. (citation needed)

Also explains why there is no difference between Ctrl-x and Ctrl-Shift-x.

https://dl.acm.org/doi/epdf/10.1145/365628.365652

mac3n

1 month ago

[-]

credit to William Crosby, "Note on an ASCII-Octal Code Table", CACM 8.10, Oct 1965

also defined 6-bit ASCII subset

meken

1 month ago

[-]

Very cool.

Though the 01 column is a bit unsatisfying because it doesn’t seem to have any connection to its siblings.

y42

1 month ago

[-]

first I was like "What but why? You don't save any space or what's that excercise about" then I read it again and it blew my mind. I thought I knew everything about ASCII. What a fool I am, Sokrates was right. Always.

mac3n

1 month ago

[-]

anyone remember 005 ENQ (also called WRU who are you) and its effect on a teletype?

joshcorbin

1 month ago

[-]

Just wait until someone finally gets why CSI ( aka the “other escape” from the 8-bit ansi realm, which is now eternalized in unicode C1 block ) is written ESC [ in 7-bit systems, such as the equally now eternal utf-8 encoding

SUDEEPSD25

1 month ago

[-]

Love this!

1 month ago

[-]

where does this character set come from? It looks different on xterm.

for x in range(0x0,0x20): print(chr(x),end=" ")

voxelghost

1 month ago

[-]

What are you trying to achieve, none of those characters are printable, and definetly not going to show up on the web.

    for x in range(0x0,0x20): print(f'({chr(x)})', end =' ')
    (0|) (1|) (2|) (3|) (4|) (5|) (6|) (7|) (8) (9| ) (10|
    ) (11|
          ) (12|
    ) (14|) (15|) (16|) (17|) (18|) (19|) (20|) (21|) (22|) (23|) (24|) (25|)    (26|␦) (27|8|) (29|) (30|) (31|)

1 month ago

[-]

Just asking why they have different icons in different environments? Maybe it is UTF-8 vs ISO-8859?

gschizas

1 month ago

[-]

UTF-8 is not technically a character set (because it has way more than 256 characters). Characters 32-127 in UTF8 are the same as ASCII, which is the same as the OEM/CP437 and the ANSI/ISO-8859/CP1252.

The characters in CP437 (and other OEM codepages) actually come from the ROM of the VGA (and EGA/CGA/MCGA/Hercules before them).

What you are referring to is those (visually), right? I'm missing some characters in the first line, because HN drops them.

    0123456789abcdef
   0...♥♦♣♠•◘○◙..♪♫.
   1►◄↕‼¶§▬↨↑↓→←∟↔▲▼

As far as I know, the equivalent control characters (characters 0-31) don't have any representation in CP1252, but that's also dependent on the font (since rendering of CP1252 is always done by Windows)

As to their origin, originally the full CP437 character set was taken from Wang word processors. I don't know where Wang took it from, but they probably invented it themselves.

EDIT: There's a more complete history here: https://www.os2museum.com/wp/weird-tales/

EDIT 2: The CP437 character set didn't seem to come directly from Wang; it's just that they took some (a lot) of characters from Wang word processors character sets. The positions of those "graphic" characters was decided by Microsoft when they made MS-DOS (at least according to Bill Gates).

1 month ago

[-]

In my screen there is indeed about thirty icons. When I executed the program on xterm, they were different and when I pasted them on LibreOffice they were again different. And now it seems this shit is also different in every country.

The world is broken.

rbanffy

1 month ago

[-]

They shouldn't show as visual representations, but some "ASCII" charts show the IBM PC character set instead of the ASCII set. IIRC, up to 0xFF UTF-8 and 8859 are very close with the exceptions being the UTF-8 escapes for the longer characters.

gschizas

1 month ago

[-]

There's no 0x80-0xFF in the UTF-8 encoding. Only up to 0x7F (127) it's the same.

1 month ago

[-]

Opera AI solved the problem:

If you want to use symbols for Mars and Venus for example,they are not in range(0,0x20). They are in Miscellanous Symbols block.

1 month ago

[-]

Ok this set does not even show on Android, just some boxes. Very strange.

Aardwolf

1 month ago

[-]

Imho ascii wasted over 20 of its precious 128 values on control characters nobody ever needs (except perhaps the first few years of its lifetime) and could easily have had degree symbol, pilcrow sign, paragraph symbol, forward tick and other useful symbols instead :)

ogurechny

1 month ago

[-]

Smaller, 6-bit code pages existed before and after that. They did not even have space for upper and lower case letters, but had control characters. Those codes directly moved the paper, switched to next punch card or cut the punched tape on the receiving end, so you would want them if you ever had to send more than a single line of text (or a block of data), which most users did.

Even smaller 5-bit Baudot code had already had special characters to shift between two sets and discard the previous character. Murray code, used for typewriter-based devices, introduced CR and LF, so they were quite frequently needed in way more than few years.

mmooss

1 month ago

[-]

It is interesting that, as a guess, we waste an average of ~5% of storage capacity for text (12.5% of Unicode's first byte, but many languages regularly use higher bytes of course).

I don't fault the creators of ASCII - those control characters were probably needed at the time. The fault is ours for not moving on from the legacy technology. I think some non-ASCII/Unicode encodings did reuse the control character bytes. Why didn't Unicode implement that? I assume they were trying to be be compatible with some existing encodings, but couldn't they have chosen the encodings that made use of the control character code points?

If Unicode were to change it now (probably not happening, but imagine ...), what would they do with those 32 code points? We couldn't move other common characters over to them - those already have well-known, heavily used code points in Unicode and also iirc Unicode promises backward compability with prior versions.

There still are scripts and glyphs not in Unicode, but those are mostly quite rare and effectively would continue to waste the space. Is there some set of characters that would be used and be a good fit? Duplicate the most commonly used codepoints above 8 bits, as a form of compression? Duplicate combining characters? Have a contest? Make it a private area - I imagine we could do that anyway, because I doubt most systems interpret those bytes now.

Also, how much old data, which legitimately uses the ASCII control characters, would become unreadable?

bee_rider

1 month ago

[-]

On top of the control symbols being useful, providing those symbols would have reduced the motivation for Unicode, right?

ASCII did us all the favor of hitting a good stopping point and leaving the “infinity” solution to the future.

gpvos

1 month ago

[-]

Maybe 32 was a bit much, but even fitting a useful set of control characters into, say, 16, would be tricky for me. For example, ^S and ^Q are still useful when text is scrolling by too fast.

zygentoma

1 month ago

[-]

I started using the separator symbols (file, group, record, unit separator, ascii 60-63 ... though mostly the last two) for CSV like data to store in a database. Not looking back!

mmooss

1 month ago

[-]

I've wanted to do that but don't you have compatibility problems? What can read/import files with those deliminters? Don't people you are working with have problems?

gschizas

1 month ago

[-]

ASCII 60-63 is just <=>?

You probably mean 28-31 (∟↔▲▼, or ␜␝␞␟)

Unless this is octal notation? But 0o60-0o63 in octal is 0123