Awk Technical Notes (2023)
150 points
12 days ago
| 11 comments
| maximullaris.com
| HN
dietrichepp
1 day ago
[-]
Awk is still one of my favorite tools because its power is underestimated by nearly everyone I see using it.

    ls -l | awk '{print $3}'
That’s typical usage of Awk, where you use it in place of cut because you can’t be bothered to remember the right flags for cut.

But… Awk, by itself, can often replace entire pipelines. Reduce your pipeline to a single Awk invocation! The only drawback is that very few people know Awk well enough to do this, and this means that if you write non-trivial Awk code, nobody on your team will be able to read it.

Every once in a while, I write some tool in Awk or figure out how to rewrite some pipeline as Awk. It’s an enrichment activity for me, like those toys they put in animal habitats at the zoo.

reply
PopAlongKid
1 day ago
[-]
>To Perl connoisseurs, this feature may be known as Autovivification. In general, AWK is quite unequivocally a prototype of Perl. You can even say that Perl is a kind of AWK overgrowth on steroids…

Before I learned Perl, I used to write non-trivial awk programs. Associative arrays, and other features are indeed very powerful. I'm no longer fluent, but I think I could still read a sophisticated awk script.

Even sed can be used for some fancy processing (i.e scripts), if one knows regex well.

reply
nerdponx
1 day ago
[-]
> this means that if you write non-trivial Awk code, nobody on your team will be able to read it.

Sort of! A lot of AWK is easy to read even if you don't remember how to write it. There are a few quirks like how gsub modifies its target in-place (and how its default target is $0), and of course understanding the overall pattern-action layout. But I think most reasonable (not too clever, not too complicated) AWK scripts would also be readable to a typical programmer even if they don't know AWK specifically.

reply
Brian_K_White
1 day ago
[-]
I wrote a BASIC renumberer and compactor in bash, using every bashism I could so that it called no externals and didn't even use backticks to call child bashes, just pure bash itself (but late version and use every available feature for convenience and compactness).

I then re-wrote it in awk out of curiosity and it looked almost the same.

Crazy bash expansion syntax and commandline parser abuse was replaced by actual proper functions, but the whole thing when done was almost a line by line in-place replacement, so almost the same loc and structure.

Both versions share most of the same advantages over something like python. Both single binary interpreters always already installed. Both versions will run on basically any system any platform any version (going forward at least) without needing to install anything let alone anything as gobsmacking ridiculous as pip or venv.(1)

But the awk version is actually readable.

And unlike bash, awk already pretty much stopped changing very much decades ago, so not only is it forward compatible, it's pretty backwards compatible too.

Not that that is generally a thing you have to worry about. We don't make new machines that are older than some code we wrote 5 years ago. Old bash or awk code always works on the next new machine, and that's all you ever need(2).

There is gnu vs bsd vs posix vs mawk/nawk but that's not much of a problem and it's not a constantly breaking new-version problem but the same gnu vs posix differences for the last 30 years. You have to knowingly go out of your way to use mawk etc.

(1) bash you still have for example how everything is on bash 5 or at worst 4, except a brand new Mac today still ships with bash3, and so you can actually run into backwards compatibility in bash.

(2) and bash does actually have plugins & extensions and they do vary from system to system so you do have things you either need to avoid using or run into exactly the same breakage as python or ruby or whatever.

For writing a program vs gluing other programs together, really awk should be the goat.

reply
fuzztester
17 hours ago
[-]
>and so you can actually run into backwards compatibility in bash.

let's have a bash and bash that backwards compatibility in bash.

reply
benjaminogles
1 day ago
[-]
I feel the same about using Awk, it is just fun to use. I like that variables have defined initial values so they don't need to be declared. And the most common bits of control flow needed to process an input file are implicit. Some fun things I've written with awk

Plain text accounting program in awk https://github.com/benjaminogles/ledger.bash

Literate programming/static site generator in awk https://github.com/benjaminogles/lit

Although the latter just uses awk as a weird shell and maintains a couple child processes for converting md to html and executing code blocks with output piped into the document

reply
packetlost
1 day ago
[-]
AWK, rc, and mk are the 3 big tools in my shell toolkit. It's great
reply
nmz
1 day ago
[-]
Why mk instead of any of the other builders?
reply
packetlost
6 hours ago
[-]
I already get it with plan9port and it addresses 100% of my issues with make. It integrates nicely with rc so there's really not a lot of additional syntax to remember.
reply
sudahtigabulan
22 hours ago
[-]
> That’s typical usage of Awk, where you use it in place of cut because you can’t be bothered to remember the right flags for cut.

Even you remember the flags, cut(1) will not be able to handle ls -l. And any command that uses spaces for aligning the text into fixed-width columns.

Unlike awk(1), cut(1) only works with delimiters that are a single character. Meaning, a run of spaces will be treated like several empty fields. And, depending on factors you don't control, every line will have different number of fields in it, and the data you need to extract will be in a different field.

You can either switch to awk(1), because its default field separator treats runs of spaces as one, or squeeze them with tr(1) first:

  ls -l | tr -s' ' | cut -d' ' -f3
reply
lelanthran
10 hours ago
[-]
Cut has flags to extract byte or character ranges.

You don't have to use fields.

reply
sudahtigabulan
9 hours ago
[-]
Can these flags be used to extract the N-th column (say, the size) of every line from ls -l output?
reply
lelanthran
8 hours ago
[-]
Yes.

    $ ls -l | cut -c 35-41

        22 
      4096 
      4096 
      4096 
      4096 
      4096 
      4096 
        68 
       456 
       690 
      7926 
      8503 
     19914
reply
sudahtigabulan
7 hours ago
[-]
This is what I get:

  ls -l | cut -c 35-41
  
  6 Nov 1
  6 Nov
  6 Nov 1
  6 Nov 1
reply
lelanthran
7 hours ago
[-]
Well, sure. I said it did character ranges so you don't have to use fields.

What were you expecting? That your character ranges in ls would match mine?

reply
sudahtigabulan
7 hours ago
[-]
> What were you expecting? That your character ranges in ls would match mine?

I would expect the command to work in any directory. Try a few different directories on your computer and you'll see that it won't work in some of them.

reply
lelanthran
6 hours ago
[-]
> I would expect the command to work in any directory.

But ... why expect that? That's not what "character ranges" mean.

I mean, I was only trying to clarify that `cut` is not limited to fields only.

reply
abhgh
1 day ago
[-]
Love awk. In the early days of my career, I used to write ETL pipelines and awk helped me condense a lot of stuff into a small number of LOC. I particularly prided myself in writing terse one-liners (some probably undecipherable, ha!); but did occasionally write scripts. Now I mostly reach for Python.
reply
tetris11
1 day ago
[-]
one of the best word-wrapping implementations I've seen (handles color codes and emojis just fine!) is written in pure mawk

very fast, highly underrated language

I'm not sure how good it would be for pipelines, if a step should fail, or if a step should need to resume, etc.

reply
meken
1 day ago
[-]
This sounds interesting. Could you give an example where you rewrote a pipeline in awk?
reply
ketanmaheshwari
1 day ago
[-]
Not the op but here is an example: TOKEN=$(kubectl describe secret -n kube-system $(kubectl get secrets -n kube-system | grep default | cut -f1 -d ' ') | grep -E '^token' | cut -f2 -d':' | tr -d '\t' | tr -d " ")

This pipeline may be significantly reduced by replacing cut's with awk, accommodating grep within awk and using awk's gsub in place of tr.

reply
rbonvall
1 day ago
[-]
Example of replacing grep+cut with a single awk invokation:

    $ echo token:abc:def | grep -E ^token | cut -d: -f2
    abc
    
    $ echo token:abc:def | awk -F: '/^token/ { print $2 }'
    abc
Conditions don't have to be regular expressions. For example:

    $ echo $CSV
    foo:24
    bar:15
    baz:49
    
    $ echo $CSV | awk -F: '$2 > 20 { print $1 }'
    foo
    baz
reply
dietrichepp
1 day ago
[-]
Somebody wanted to set breakpoints in their C code by marking them with a comment (note “d” for “debugger”):

  //d
You can get a list of them with a single Awk line.

  awk -F'//d[[:space:]]*' 'NF > 1 {print FILENAME ":" FNR " " $2}' source/*.c
You can even create a GDB script, pretty easily.

(IMO, easier still to configure your editor to support breakpoints, but I’m not the one who chose to do it this way.)

reply
kazinator
1 day ago
[-]
Why are you using the locale-specific [:space:] on source code? In your C source code, are you using spaces other than ASCII 0x20?

Would you have //d<0xA0>rest of comment?

Or some fancy Unicode space made using several UTF-8 bytes?

reply
dietrichepp
6 hours ago
[-]
> Why are you using the locale-specific [:space:] on source code?

Because it’s the one I remembered first, it worked, and I didn’t think that it needed any improvement. In fact, I still don’t think it needs any improvement.

reply
wtallis
1 day ago
[-]
Tab characters can also be found in source code.
reply
kazinator
1 day ago
[-]
Since you control the \\d format, why would you allow/support anything but a space as a separator? That's just to distinguish it from a comment like "\\delete empty nodes" that is not the \\d debug notation.

If tabs are supported,

  [ \t]
is still shorter than

  [[:space:]]
and if we include all the "isspace" characters from ASCII (vertical tab, form feed, embedded carriage return) except for the line feed that would never occur due to separating lines, we just break even on pure character count:

  [_\t\v\f\r]
TVFR all fall under the left hand, backspace under the right, and nothing requires Shift.

The resulting character class does exactly the same thing under any locale.

reply
nerdponx
21 hours ago
[-]
There's also [:blank:], which is just space and tab. Both I think are perfectly readable and reasonable options that communicate intent nicely.
reply
kazinator
20 hours ago
[-]
ISO C99 says, of the isblank function (to which [:blank:] is related:

The isblank function tests for any character that is a standard blank character or is one of a locale-specific set of characters for which isspace is true and that is used to separate words within a line of text. The standard blank characters are the following: space (’ ’), and horizontal tab (’\t’). In the "C" locale, isblank returns true only for the standard blank characters.

[:blank:] is only the same thing as [\t ] (tab space) if you run your scripts and Awk and everything in the "C" locale.

reply
nerdponx
4 hours ago
[-]
Interesting, the GNU Grep manual describes both character classes as behaving as if you are in the C locale. I shouldn't have assumed it was the same as in the C standard!
reply
nmz
1 day ago
[-]
awk is so much better than sed to learn given its ability, the only unix tool it doesn't replace is tr and tail, but other than that, you can use it instead of grep, cut, sed, head.
reply
stevekemp
12 hours ago
[-]
I think you could replace tail with awk, if you absolute needed to. This is a naive attempt:

   cat /etc/passwd | \
   awk -v n=10 '{ lines[NR] = $0 }
            END{
                for (i = NR - n + 1; i <= NR; i++)
                    if (i > 0) print lines[i]
            }'
reply
nmz
4 hours ago
[-]
You can, sure, but, tail seeks to EOF and then goes back until it finds "\n", awk cannot seek, so you must do what you did there, that means the bigger the file the longer the time.

And there's also tail -f, how would you go about doing that? a while loop that sleeps and reopens the file? yuck

reply
RGBCube
1 day ago
[-]
Stop using awk, use a real programming language+shell instead, with structured data instead of bytestream wrangling:

  > ls -l | get user

  ┌────┬──────┐
  │  0 │ cube │
  │  1 │ cube │
  │  2 │ cube │
  │  3 │ cube │
  │  4 │ cube │
  │  5 │ cube │
  │  6 │ cube │
  │  7 │ cube │
  │  8 │ cube │
  │  9 │ cube │
  │ 10 │ cube │
  │ 11 │ cube │
  │ 12 │ cube │
  │ 13 │ cube │
  │ 14 │ cube │
  │ 15 │ cube │
  └────┴──────┘
You don't need to memorize bad tools' quirks. You can just use good tools.

https://nushell.sh - try Nushell now! It's like PowerShell, if it was good.

reply
electricEmu
1 day ago
[-]
PowerShell is open source and available on Linux today for those who enjoy an OO terminal.

MIT licensed.

https://learn.microsoft.com/en-us/powershell/scripting/insta...

reply
ryapric
23 hours ago
[-]
While your recommendation is sound: this is not only a rudely-worded take, but also missing the point of the parent comment.
reply
esafak
21 hours ago
[-]
Also, the nushell code is self-explanatory. Who knows what $3 refers to?
reply
anthk
1 day ago
[-]
Once you get TSV and CSV related tools, nushell and psh are like toys.
reply
esafak
21 hours ago
[-]
https://www.nushell.sh/commands/docs/from_csv.html

For TSV, use the --separator flag.

reply
anthk
15 hours ago
[-]
Current AWK (One True AWK, under OpenBSD in base) got CSV support, you can read the man page for it.
reply
simoncion
21 hours ago
[-]
> try Nushell now!

So, I'm curious. What's the Nushell reimplementation of the 'crash-dump.awk' script at the end of the "Awk in 20 Minutes" article on ferd.ca ? Do note that "I simply won't deal with weirdly-structured data." isn't an option.

reply
knlb
1 day ago
[-]
I used to be scared of Awk, and then I read through the appendix / chapters of "More Programming Pearls" (https://www.amazon.com/More-Programming-Pearls-Confessions-C...) and it became a much easier to reason about language.

The structure can be a bit confusing if you've only seen one liners because it has a lot of defaults that kick in when not specified.

The pleasant surprise from learning to use awk was that bpftrace suddenly became much more understandable and easier to write as well, because it's partially inspired by awk.

reply
svat
1 day ago
[-]
I learned the basics of AWK in a few minutes from here: https://learnxinyminutes.com/awk/ — and I agree with you, it was worth it!
reply
cholantesh
1 day ago
[-]
This was a great read, and the previous post in the series. I see a lot of very convincing arguments here (https://maximullaris.com/awk.html#why) but for me one of the biggest points in favour of python (and I say this is someone who, for learning, will always just reach for C++ because of my muscle memory) is its eminent readability. If I'm writing a script, quite a lot of the time, it's meant not just for myself but for my peers to use with some degree of regularity. I feel pretty confident that there would be much more operational overhead and a lot of time spent explaining internals with awk than with python.
reply
xphos
1 day ago
[-]
The portability hit me. I was working a closed corp net that at the time didn't have python and shell was so inconsistent. But awk just worked. Sed was also a really strong tool.
reply
layer8
9 hours ago
[-]
Not just portability, but also stability. You can be confident your scripts will still work just the same in ten years.
reply
kevg123
1 day ago
[-]
Nice link to the canonical book on Awk within the first linked page in the article: https://ia903404.us.archive.org/0/items/pdfy-MgN0H1joIoDVoIC...
reply
jjice
1 day ago
[-]
Everyone should read "The AWK Programming Language". It's so short, with both the first and second editions floating around 200 pages in an A5 (could be off on that page size) form factor.

Aside from AWK being a handy language to know, understanding the ideas behind it from a language design and use case perspective can help open your eyes to new constructs and ideas.

reply
layer8
1 day ago
[-]
There’s a second edition: https://www.awk.dev/
reply
nmz
1 day ago
[-]
Good for extensibility is a claim I've never heard before, I've always found awk's "everything is global scope" as a huge limitation, but if its scripting then I suppose you could just... isolate each script's namespace and take the global namespace as the exported namespace, and since everything is static it really simplifies things further, but lua is still better of course, but if you don't need that much power I suppose it would be even smaller.
reply
renjieliu
23 hours ago
[-]
Fun read. I always thought quick calculation like echo $((1+100)) is just a shell feature. Perhaps it was rooted from awk as well.
reply
1vuio0pswjnm7
14 hours ago
[-]
AWK is slower than sed
reply
1vuio0pswjnm7
6 hours ago
[-]
sed may never die

Certainly sed will outlive me

sed is a time-saver, enabling computer users to make the most of the time they have left

reply
Towaway69
7 hours ago
[-]
we all die sooner or later.
reply
anthk
1 day ago
[-]
FreeCell written in AWK:

https://git.luxferre.top/nnfc/

AWK goodies (git clone --recursive) :

https://git.luxferre.top/awk-gold-collection

reply
1vuio0pswjnm7
1 day ago
[-]

   stat -c %U *
reply
gist
1 day ago
[-]
I am not a programmer, but I have used awk since the 1980's. And normally I would read this type of info or really many things about typical unix tools. I've done a small amount of helpful things with awk (again dating to the 1980's). (Wrote an estimating system using awk and flat txt files as an example).

However given what I've been able to acomplish with Claude Code, I no longer find it necessary to know any details, tips, or tricks, or to really learn anything more (at least for the types of projects I am involved in for my own benefit).

Update: Would love to know why this was downvoted...

reply
HeinzStuckeIt
22 hours ago
[-]
HN is all about content that gratifies one’s intellectual curiosity, so if you are admitting you have lost the desire to learn, then that could be triggering the backlash.
reply
Towaway69
7 hours ago
[-]
Ironically at the same time, a good percentage of HN readers are probably shareholders in one or multiple AI companies.

Making a buck off the disinterested is ok, being disinterested yourself isn't.

reply
Brian_K_White
23 hours ago
[-]
Obviously, you won't understand or agree with the reason once explained, so really what's the point?

The reason is (yes I will be so bold as to speak for all on this one) both using ai to do your thinking for you, and essentially advocating to any readers to do the same simply by writing how well it works for you. Some people find this actively bad, of negative value, and some find it merely utterly uninteresting, of no value, and both responses produce downvotes.

But it's automatic that you can not see this. If you recognized any problem, you would not be doing it, or at the very least would not describe it as anything but an embarrasing admission, like talking about a guilty pleasure vs a wholesome good thing.

So don't bother asking "What's wrong with using this tool that works vs any other tool that works?" If you have to ask... There are several things wrong, not just one.

Or for some it could just be that "I used to use awk but now I just use ___" just doesn't add anything to a discussion about awk. "I used to use awk a lot but now I just use ruby". Ok? So what? Some people go as far as to downvote for that.

Also now that you whined about downvotes, I wouldn't be surprised if that isn't the cause of some itslef, because it absolutely does deserve it.

There might possibly also be at least some just from "I'm not a programmer but here's my thoughts on this programming topic" though that isn't very wrong in my own opinion. You even say you've actually used awk a lot so as far as I'm concerned you can absolutely talk about awk and probably don't need to be so humble as to deny yourself as a pragrammer. It's admirable to avoid making claims about yourself, but I bet a bystander would call you at least a programmer, even if we'll leave the actual level of sophistication unspecified.

Since I wrote this comment, I did not up or downvote myself. But for the record, I would have downvoted for the ai.

reply
layer8
9 hours ago
[-]
I'm upvoting GP so that more people read your reply.
reply