Ever wondered how the Unix word count utility wc
works? It’s likely written in C, but here is a simple awk
version.
{ nc += length($0) + 1
nw += NF
}
END { print NR, nw, nc }
The code is input into a file, say wcawk, and then this is executed in the following manner:
awk -f wcawk textfile
The first line of code counts the number of characters as the length of the string representing the entire line, plus 1 for the \n
character, and the second line counts the number of words, by adding NF, the built-in variable for the number of fields. The fourth line just prints out the number of lines (NR), word count and character count. Here’s a sample:
I've been waiting for you Obi-Wan. We meet again, at last. The circle is now complete; when I left you, I was but the learner, now I am the master. Only a master of evil, Darth.
And here’s what happens when awk
is run: 3 36 181
What abut word frequencies, could you do that in awk
? Sure can. Here’s the code:
awk ' { for (i=1; i<=NF; i=i+1) freq[$i]++ }
END { for (word in freq) print word, freq[word] }
' $*
In this case we have included awk
in the script, and just made the file executable. The first for
loop looks at each word in the input line, incrementing the element of the array freq
subscripted by the word. After the file has been read, the second for
loop prints the words and their counts. When run it prints them in a long, arbitrary list. To make the output nicer, it can first be piped into sort
, then into column
. Below is the output from the sample text:
% wfreq vader.txt | sort | column Darth. 1 a 1 complete; 1 master 1 was 1 I 3 again, 1 evil, 1 master. 1 when 1 I've 1 am 1 for 1 meet 1 you 1 Obi-Wan. 1 at 1 is 1 now 2 you, 1 Only 1 been 1 last. 1 of 1 The 1 but 1 learner, 1 the 2 We 1 circle 1 left 1 waiting 1