Awk Tutorial and Introduction

Posted on April 07, 2019 in Linux

Awk

Awk is a full fledged, turing complete programming language on its own but it was designed specifically to deal with text data.

Say, you have the following kind of data:

PRN   Az     Ele     Lat     Lon     Stec    Vtec    S4
1   155.41  5.82    15.74   90.22   29.20   10.78   -99.000
3   155.25  5.95    15.84   90.21   29.38   10.86   -99.000
2   155.09  6.08    15.94   90.21   29.26   10.84   -99.000
1   154.94  6.21    16.03   90.21   29.22   10.84   -99.000
3   154.78  6.33    16.13   90.21   29.16   10.84   -99.000
3   154.62  6.46    16.23   90.21   29.01   10.81   -99.000
2   154.46  6.59    16.32   90.21   28.98   10.82   -99.000

Say you are interested in only some of the columns. And you want to seperate out the data by PRN value which runs from \(1\) to \(3\). You might also want to delete those rows which have a negative value i.e. -99.00 which you know implies an error or missing data. You could do all these things and more using awk.

Basic syntax

One thing you should note before we begin is that awk is a line editor. It processes one line at a time. So, no matter how many lines of input you have, awk runs the same code on all of it. It's basic structure is the following:

BEGIN{
    ...
}
{
    ...
}
END {
    ...
}

It has three blocks as you can see. Sometimes you need to do something before you even begin editing lines, for eg. write the header, or initialize some variables etc. That is what goes inside the BEGIN{} block.

Similary, you have END{} block which runs after the lines are finished processing. This may be used to write footer, for example.

The code that runs for each line of input is on the middle, also called the main block.

Running the script

Let me show you how to run the awk script. First of all, create a script file using your text editor, eg. gedit, vim, etc.

I recommend creating a new directory to save the file so that it is cleaner. Put the following into the file and name it script.awk:

BEGIN{
    print "###First Line###"
}
{
    print
}
END{
    print "###Last Line###"
}

Also create another file which will be the file that we want to edit/manipulate using awk. So, create a file named list.txt with the following content:

Apples
Potato
Onion
Garlic

Then open the terminal and cd into the directory where you saved your files. Then, enter the following command to run the script. The syntax is awk -f <script-file> <input file>.

$ awk -f script.awk list.txt
###First Line###
Apples
Potato
Onion
Garlic
###Last Line###

If you get the output that looks like above, everything is working correctly. Now let's talk about what the program did.

BEGIN block has a line that says print "###First Line###". Since this block runs before dealing with any line in the file list.txt itself, we see the line in the first line of the output. You also see that print is the command to write something to the screen/file.

Similarly, the END block writes to the last line i.e. after every line in the given file is evaluated. The line in the main block is a little different. It just says print. When the argument to print is missing, it just prints the line in the input file as it is as we see above.

It's only interesting when you start to manipulate the lines given. But before we do, let's see how to save the output to a new file.

Saving output to a file

Unix/Linux has this concept of pipe/redirection, which basically is a way to link output of one program to the input of some other program or to a file, etc.

Pipe symbols | are used to pipe data between two programs and <, > for redirection to/from a file. As such:

$ awk -f script.awk list.txt > output.txt
$ ls
list.txt  output.txt  script.awk

You see a new file output.txt has been created. You can cat output.txt to check that the output has indeed been written to that file.

Selecting columns

A column in awk is represented by $n where n is \(1\) for first column, and \(8\) for eighth column. So, $3 would be the third column.

For example purpose, copy the following into the list.txt file we created earlier.

Apples  1Kg
Potato  5Kg
Onion   1Kg
Garlic  .5Kg

And create a script itemOnly.awk with the following code:

BEGIN{
}
{
    print $1
}
END {
}

And save and run the script against the file list.txt like this:

$ awk -f itemOnly.awk list.txt
Apples
Potato
Onion
Garlic

Note that we didn't write anything in BEGIN and END block. That's allowed. If you wanted to write a header for example that says "Items" at the top, you could do print "Item" in your BEGIN block.

You can also very simply change the order of the columns:

BEGIN{
    print "Qty." "\t" "Items"
}
{
    print $2 "\t" $1
}
END {
}

Save and run just like above, you'll get something like this:

Qty.    Items
1Kg     Apples
5Kg     Potato
1Kg     Onion
.5Kg    Garlic

So, you see the order has been reversed and the heading has been added.

And, you can change the order of columns into anything you like just by specifying them in the main block as we did above.

Also note that the \t is a tab character which puts a tabspace in between two columns. You could also put space or comma ,(for csv files for example) or semicolon ; or any other column seperator you like.

Variables

Variables are easy to declare and use. If you are familiar with C, the syntax is similar. Let's add a new column S.N for serial number into our list above. Since we want awk to put the value \(1,2,3\) and so on automatically, we need a variable that counts the line and writes the value of S.N for each row.

Let me show you what I mean:

BEGIN{
    print "S.N." "\t" "Qty." "\t" "Items"
    i = 1
}
{
    print i "\t" $2 "\t" $1
    i = i + 1
}
END {
}

Now if you save and run the program, you'll get the following:

S.N.    Qty.    Items
1       1Kg     Apples
2       5Kg     Potato
3       1Kg     Onion
4       .5Kg    Garlic

So, we made a variable called 'i' which has a value \(1\) to start. Note that since we only make a variable once, we do this in BEGIN block. Then, we print the value of i in the main block, which is executed once for every line of input, so it writes Serial Number for each line in the file given.

Also note that we increase the value of i by 1 in the main block which means when evaluating each line of the input file i increases by \(1\) so that next line is given a new serial number.

Conditionals

Suppose I don't have much money and I only intend to buy first two items from our shopping list. We want to reduce the list to just two items. In other words we want only those columns which have S.N \(1\) and \(2\). Another way to say the same thing is serial number less than \(3\).

This is where we need conditionals. So, we update the main block of our last script putting in the conditional as follows:

{
    if (i < 3){
        print i "\t" $2 "\t" $1
    }

    i = i + 1
}

We added an if conditional in the main block. Therefore, the print line only gets evaluated if the conditional is true namely if the value of i is less than \(3\).

Save and run the script with list.txt as input and verity that it runs as expected.

All of the following relational operators can be used in a conditional:

OperatorMeaning
==Is equal
!=Is not equal to
>Is greater than
>=Is greater than or equal to
<Is less than
<=Is less than or equal to
Regular expression

If you're familiar with regular expressions, you can use ~ to mean matches a certain pattern and !~ to mean doesn't match. If you're unfamiliar with RegEx, comment below and I will write another short intro about them.

Besides, you also have boolean operators 'and' && and 'or' || if you want to combine two or more conditions.

Loops

To be continued...