I've been a pipeline junkie for a long time, but i've only recently started to get into awk. The thing i can do with awk but not other tools is to write stateful filters, which accumulate information in associative arrays as they go.
For example, if you want to do uniq without sorting the input, that's:
awk '{ if (!($0 in seen)) print $0; seen[$0] = 1; }'
This works best if the number of unique lines is small, either because the input is small, or because it is highly repetitive. Made-up example, finding all the file extensions used in a directory tree:
find /usr/lib -type f | sed -rn 's/^.*\.([^/]*)$/\1/p' | awk '{ if (!($0 in seen)) print $0; seen[$0] = 1; }'
That script is easily tweaked, eg to uniquify by a part of the string. Say you have a log file formatted like this:
2019-03-03T12:38:16Z hob: turned to 75%
2019-03-03T12:38:17Z frying_pan: moved to hob
2019-03-03T12:38:19Z frying_pan: added butter
2019-03-03T12:38:22Z batter: mixed
2019-03-03T12:38:27Z batter: poured in pan
2019-03-03T12:38:28Z frying_pan: tilted around
2019-03-03T12:39:09Z frying_pan: FLIPPED
2019-03-03T12:39:41Z frying_pan: FLIPPED
2019-03-03T12:39:46Z frying_pan: pancake removed
If you want to see the first entry for each subsystem:
awk '{ if (!($2 in seen)) print $0; seen[$2] = 1; }'
Or the last (although this won't preserve input order):
awk '{ seen[$2] = $0; } END { for (k in seen) print seen[k]; }'
I don't think there's another simple tool in the unix toolkit that lets you do things like this. You could probably do it with sed, but it would involve some nightmarish abuse of the hold space as a database.
awk '{ if (!($2 in seen)) print $0; seen[$2] = 1; }'
You can even shorten this a bit! "awk '!seen[$2]++'" does the same thing -- awk will print the whole line when it's provided a truthy value. It's definitely more code-golfy than being explicit about what's actually going on though
I did a lightning talk on awk last year and found this great article series from 2000 on all the powers of awk (including network access, but not yet email :) ).
I admire your work. Clever usage of unix tools is very handy. But for parsing text, do you really see that awk and Unix tools as a better solution then a simple python script?
Although I admit that the key argument for Unix tools is that they don’t get updated. That sounds awful, but think about it, once it works, it works everywhere, no matters OS type, version or packages installed. That is something experienced programmers always want from their solutions.
Python is fantastic for little (or large!) bits of logic, but its handling of input is clunky enough to put me off for tiny things. AFAIK the boilerplate you need to get to working on the fields on each line is:
import sys
for line in sys.stdin:
fields = line.split()
# now you can do your logic
If you want to use regular expressions, that's another import.
Python also doesn't play well with others in a pipeline. You can use python -c, but you can't use newlines inside the argument (AFAICT), so you're very limited in what you can do.
parsing text is what a lot of these scripts/mini-pipelines do.
the key argument for *nix tools is that they do one thing and only one thing extremely well. at a meta level these tools are units of functionality and you’re actually doing functional programming, on the command line, without realizing it.
^ agree -- I've seen lots of folks [newer users mostly] turn to grep when really what they wanted was sed. It's just a matter of learning which screwdriver is for which type of screw
"I don't think there's another simple tool in the unix toolkit that lets you do things like this."
Perl can, since it borrowed a fair amount of awk. It's also almost as commonly already installed. The one liner equivalents to what you showed are pretty similar, for example: https://news.ycombinator.com/item?id=19294575
Though, I concede it falls outside the realm of "simple tool".
There's a wonderful quote about things like this in the Unix Hater's Handbook:
> However, since joining this discussion, a lot of Unix supportershave sent me examples of stuff to “prove” how powerful Unix is.These examples have certainly been enough to refresh my memory:they all do something trivial or useless, and they all do so in a veryarcane manner.
So I assume you would use some sort of ' grep XX | sort | uniq' (I still) to get a unique line output. Is this awk line now your default, or did you find yourself using both for convenience?
Do you alias these awk commands on all machines you work on, or other way put, I did not find a nice way to keep my custom aliases 'in sync' over different machines, perhaps you have some recommendation or workflow that is really sweet?
I still default to sort -u (or sort | uniq -c if i need counts), partly from habit, but partly because it's often useful to have the output sorted anyway.
I have a script on my path called huniq ('hash uniq') that contains that awk program. I prefer scripts to aliases because they play better with xargs and so on.
I have a Mercurial repository full of little scripts like this, and other handy things, which lives on Bitbucket, and which i clone on machines i do a lot of work on. In principle, whenever i make changes to it i should commit and push them, then pull them down on other machines, but i'm pretty slack about it. It still helps more than not having anything, though.
For example, if you want to do uniq without sorting the input, that's:
This works best if the number of unique lines is small, either because the input is small, or because it is highly repetitive. Made-up example, finding all the file extensions used in a directory tree: That script is easily tweaked, eg to uniquify by a part of the string. Say you have a log file formatted like this: If you want to see the first entry for each subsystem: Or the last (although this won't preserve input order): I don't think there's another simple tool in the unix toolkit that lets you do things like this. You could probably do it with sed, but it would involve some nightmarish abuse of the hold space as a database.