Using awk: Counting words?

Ever wondered how the Unix word count utility wc works? It’s likely written in C, but here is a simple awk version.

{ nc += length($0) + 1
  nw += NF
END { print NR, nw, nc }

The code is input into a file, say wcawk, and then this is executed in the following manner:

awk -f wcawk textfile

The first line of code counts the number of characters as the length of the string representing the entire line, plus 1 for the \n character, and the second line counts the number of words, by adding NF, the built-in variable for the number of fields. The fourth line just prints out the number of lines (NR), word count and character count. Here’s a sample:

I've been waiting for you Obi-Wan. We meet again, at last.
The circle is now complete; when I left you, I was but the
learner, now I am the master. Only a master of evil, Darth.

And here’s what happens when awk is run: 3 36 181

What abut word frequencies, could you do that in awk? Sure can. Here’s the code:

awk '   { for (i=1; i<=NF; i=i+1) freq[$i]++ }
END     { for (word in freq) print word, freq[word] }
' $*

In this case we have included awk in the script, and just made the file executable. The first for loop looks at each word in the input line, incrementing the element of the array freq subscripted by the word. After the file has been read, the second for loop prints the words and their counts. When run it prints them in a long, arbitrary list. To make the output nicer, it can first be piped into sort, then into column. Below is the output from the sample text:

% wfreq vader.txt | sort | column
Darth. 1	a 1		complete; 1	master 1	was 1
I 3		again, 1	evil, 1		master. 1	when 1
I've 1		am 1		for 1		meet 1		you 1
Obi-Wan. 1	at 1		is 1		now 2		you, 1
Only 1		been 1		last. 1		of 1
The 1		but 1		learner, 1	the 2
We 1		circle 1	left 1		waiting 1

