FilterHN

Plain text has been around for decades and it’s here to stay

114 points

by rbanffy

10 hours ago

| past

| 12 comments

| unsung.aresluna.org

| HN

▲

TazeTSchnitzel

37 minutes ago

[-]

> Fun to see a contemporary take on something that peaked between 1970s–1980s

Maybe that was the peak, but you had some very good TUIs in the early 1990's for DOS apps, where Windows hadn't quite completely taken over yet, but you very likely had a VGA-compatible graphics card and monitor, meaning you had a good, high-resolution, crisp and configurable-font text mode available, and also likely had a mouse. This is the stuff I grew up with: QBASIC and EDIT.COM for example. Bisqwit has a cool video about how some apps from that era could have a proper mouse cursor, even: https://www.youtube.com/watch?v=7nlNQcKsj74

▲

ssivark

7 hours ago

[-]

Couldn't help riffing off on a tangent from the title (since the article is about diagramming tools)...

Dylan Beattie has a thought-provoking presentation for anyone who believes that "plain text" is a simple / solid substrate for computing: "There's no such thing as plain text" https://www.slideshare.net/slideshow/theres-no-such-thing-as... (you'll find many videos from different conferences)

▲

rmunn

3 hours ago

[-]

Haven't watched the videos yet, but from the slides, it looks like part of the issue he was talking about was encodings (there's a slide illustrating UTF-16LE ve UTF-16BE, for example). Thankfully, with UTF-8 becoming the default everywhere (so that you need a really good reason not to use it for any given document), we're back at "yes, there is such a thing as plain text" again. It has a much larger set of valid characters, but if you receive a text file without knowing its encoding, you can just assume it's UTF-8 and have a 99.7% chance of being right.

FINALLY.

▲

thaumasiotes

49 minutes ago

[-]

> Thankfully, with UTF-8 becoming the default everywhere (so that you need a really good reason not to use it for any given document), we're back at "yes, there is such a thing as plain text" again.

Whenever I hear this, I hear "all text files should be 50% larger for no reason".

UTF-8 is pretty similar to the old code page system.

▲

mort96

45 minutes ago

[-]

Hm? UTF-8 encodes all of ASCII with one byte per character, and is pretty efficient for everything else. I think the only advantage UTF-16 has over UTF-8 is that some ranges (such as Han characters I believe?) are often 3 bytes of UTF-8 while they're 2 bytes of UTF-16. Is that your use case? Seems weird to describe that as "all text files" though?

▲

thaumasiotes

44 minutes ago

[-]

UTF-8 encodes European glyphs in two bytes and oriental glyphs in three bytes. This is due to the assumption that you're not going to be using oriental glyphs. If you are going to use them, UTF-8 is a very poor choice.

▲

mort96

40 minutes ago

[-]

UTF-8 does not encode "European glyphs" in two bytes, no. Most European languages use variations of the latin alphabet, meaning most glyphs in European languages use the 1-byte ASCII subset of UTF-8. The occasional non-ASCII glyph becomes two bytes, that's correct, but that's a much smaller bloat than what you imply.

Anyway, what are you comparing it to, what is your preferred alternative? Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information and you can't mix languages in a text file? Or do you prefer using UTF-16, where all of ASCII is 2 bytes per character but you get a marginal benefit for Han texts?

▲

thaumasiotes

37 minutes ago

[-]

> Do you prefer using code pages so that the bytes in a file have no meaning unless you also supply code page information?

Yes. Note that this is already how Unicode is supposed to work. See e.g. https://en.wikipedia.org/wiki/Byte_order_mark .

A file isn't meaningful unless you know how to interpret it; that will always be true. Assuming that all files must be in a preexisting format defeats the purpose of having file formats.

> Most European languages use variations of the latin alphabet

If you want to interpret "variations of Latin" really, really loosely, that's true.

Cyrillic and Greek characters get two bytes, even when they are by definition identical to ASCII characters. This bloat is actually worse than the bloat you get by using UTF-8 for Japanese; Cyrillic and Greek will easily fit into one byte.

▲

harmonics

15 minutes ago

[-]

As someone who has been using Cyrillic writing all my life, I've never noticed this bloat you're speaking of, honestly...

Maybe if you're one of those AI behemots who works with exabytes of training data, it would make some sense to compress it down by less than 50% (since we're using lots of Latin terms and acronyms and punctuation marks which all fit in one byte in UTF-8).

On the web and in other kinds of daily text processing, one poorly compressed image or one JavaScript-heavy webshite obliterates all "savings" you would have had in that week by encoding text in something more efficient.

It's the same with databases. I've never seen anyone pick anything other than UTF-8 in the last 10 years at least, even though 99% of what we store there is in Cyrillic. I sometimes run into old databases, which are usually Oracle, that were set up in the 90s and never really upgraded. The data is in some weird encoding that you haven't heard of for decades, and it's always a pain to integrate with them.

I remember the days of codepages. Seeing broken text was the norm. Technically advanced users would quickly learn to guess the correct text encoding by the shapes of glyphs we would see when opening a file. Do not want.

▲

mort96

34 minutes ago

[-]

UTF-8 does not require a byte order mark. The byte order mark is a technical necessity born from UTF-16 and a desire to store UTF-16 in a machine's native endianness.

The byte order mark has has no relation to code pages.

I don't think you know what you're talking about and I do not think further engagement with you is fruitful. Bye.

EDIT: okay since you edited your comment to add the part about Greek and cryllic after I responded, I'll respond to that too. Notice how I did not say "all European languages". Norwegian, Swedish, French, Danish, Spanish, German, English, Polish, Italian, and many other European languages have writing systems where typical texts are "mostly ASCII with a few special symbols and diacritics here and there". Yes, Greek and cryllic are exceptions. That does not invalidate my point.

▲

lelanthran

1 hour ago

[-]

I can't tell what the argument is just from the slideshow. The main point appears to be that code pages, UTF-16, etc are all "plain text" but not really.

If that really was the argument, then it is, in 2026, obsolete; utf-8 is everywhere.

▲

benj111

1 hour ago

[-]

He has a YouTube channel, there's a talk on there.

He also discusses code pages etc.

I don't think the thesis is wrong. Eg when I think plain text I think ASCII, so we're already disagreeing about what 'plain text' is. His point isn't that we don't have a standard, it's that we've had multiple standards over what we think is the most basic of formats, with lots of hidden complications.

▲

2b3a51

1 hour ago

[-]

Tangent to article: text character based charts for statistics. Decades ago I had an education version of MINITAB that ran under DOS and did scatter diagrams and dotplots and box and whisker plots from text characters (you could use pure text, I think proper ASCII or you could set an option to use those DOS drawing characters). The idea was to encourage initial data exploration before launching on formal statistical tests.

Anyone know of a terminal program that can do proper dotplots?

▲

suprjami

6 hours ago

[-]

The list at the top could be longer:

- https://asciiflow.com/

- https://asciidraw.github.io/

Anybody know more?

▲

snackbroken

4 hours ago

[-]

D2 https://d2lang.com/ added beta support for ASCII & Unicode output last year.

▲

suprjami

4 hours ago

[-]

That would be interesting. I like D2 though the lack of control over the layout is a bit frustrating sometimes.

▲

smusamashah

1 hour ago

[-]

https://xosh.org/text-to-diagram a list of lots of tools

▲

electroglyph

54 minutes ago

[-]

how about a unicode art tool?

https://electroglyph.github.io/atheriz_draw/

▲

1024kb

2 hours ago

[-]

I have a few more on my site under the bookmarks page. Link in bio.

▲

4k93n2

3 hours ago

[-]

https://monosketch.io

▲

dwb

2 hours ago

[-]

Plain text is great as far as it goes, but when it comes to structure you start from zero for every file. There’s always someone getting wistful about ad-hoc combinations of venerable Unix tools to process “plain text”, and that’s fine when you’re in an ad-hoc situation, but it’s no substitute for a well-specified format.

▲

adityaathalye

2 hours ago

[-]

XML, JSON, YAML, RDF, EDN, LaTeX, OrgMode, Markdown... Plenty of plaintext, but structured information formats that are "yes, and". Yes, I can process them as lines of plain text, and I can do structured data transformations on them too, and there are clients (or readers) that know how to render them in WYSIWYG style.

▲

dwb

2 hours ago

[-]

If that’s our definition of “plain text”, sure. I would still rather our tools were more advanced, such that printable and non-printable formats were on a more equal footing, though. I always process structured formats through something that understands the structure, if I can, so I feel that the only benefit I regularly get out of formats being printable is that I have to use tools that only cope with printable formats. The argument starts getting a bit circular for me.

▲

draven

3 hours ago

[-]

Also: M-x artist-mode in emacs.

▲

dlcarrier

6 hours ago

[-]

From the title, I was not expecting a bunch of extended ASCII characters.

▲

Freak_NL

4 hours ago

[-]

The article mentioned that the use of 'ASCII' within the context of those tools should not be seen as the limited character set ASCII. Personally, I would avoid mentioning ASCII at all.

The title just talks of plain text though, and plain text usually means UTF-8 encoded text these days. Plain, as in conventional, standardised, portable, and editable with any text editor. I would be surprised if someone talked about plain text as being limited to just ASCII.

▲

benj111

57 minutes ago

[-]

I would?

Would an emoji count as plain text?

What about right to left text? I have no idea how many editors handle that.

▲

keyle

3 hours ago

[-]

I'm all for it, but it's dangerously mixing ASCII with the meaning of plain-text...

▲

OuterVale

7 hours ago

[-]

Unsung is one of the best little blogs around. Well worth checking out the rest of the posts.

▲

Joel11

2 hours ago

[-]

It's good to see the plain text, it's been a while that people wanting them.

So many users wants the Special fonts but in here simple is Special to eyes and Mind.

As a developer I agree. Sometimes simplicity is more Special and powerful than complex formats.

▲

hsbauauvhabzb

2 hours ago

[-]

* L a u g h s i n u t f 1 6 *

▲

randomeel

24 minutes ago

[-]

That’s cool ! How did you do that ?

▲

nullhole

6 hours ago

[-]

I have a mixed opinion of unicode, but it's hard not to love the box-drawing / block-element chars.

▲

shevy-java

3 hours ago

[-]

Text and text files are simple. I think this is their number #1 advantage.

There are limitations though. Compare a database of .yml files to a database in a DBMS. I wrote a custom forum via ruby + yaml files. It also works. It also can not compete anywhere with e. g. rails/activerecord and so forth. Its sole advantage is simplicity. Everywhere else it loses without even a fight.