December 19, 2019

Variables that control awk behavior

  • FS: Field Separator. Regex that separates fields
  • RS: Record Separator. Regex that separates fields
  • OFS: Output Field Separator. Text to separate printed fields
  • ORS: Output Record Separator. Text to separate printed records

FS is super useful

Changing FS is so useful that there is a shortcut for it

awk has the -F option for it. Upper case F

awk -F ":" '$2>100e6 {print $1}' population_total.csv

Several Rules

awk reads the input files one line at a time

  • more precisely, one record each time

For each record, awk tries the patterns of each rule

If several patterns match, then several actions execute in the order in which they appear in the awk program

If no patterns match, then no actions run.

Several Rules

After processing all the rules that match the line, awk reads the next line.

This continues until the program reaches the end of the file.

For example, the following awk program contains two rules:

/12/  { print $0 }
/21/  { print $0 }

Longer programs

Sooner or later you will have too many rules to fit in the command line

And it becomes hard to write all again and again

In this case we can write all in an .awk file

We use a text editor (like nano or vim) to edit it

Example

Let’s write the file gdp.awk with this content

BEGIN { FS="\t" }

NF>0 {gdp = $3*$4; total+=gdp; print $1,gdp }

END {print "Total", total}

Running the example

This time we take the commands from a file

To tell awk to read commands from a file, we use the option -f (lower case f)

awk -f gdp.awk world2017.txt

Be careful. Do not confuse -f and -F

Functions in awk

Numeric Functions

AWK has the following built-in arithmetic functions:

int(expr) Truncate to integer.
rand() Return a random number N, between 0 and 1, such that 0 ≤ N < 1.
srand([expr]) Use expr as the new seed for the random number generator. If no expr is provided, use the time of day. Return the previous seed for the random number generator.

Numeric Functions

atan2(y, x) Return the arctangent of y/x in radians.
cos(expr) Return the cosine of expr, which is in radians.
sin(expr) Return the sine of expr, which is in radians.
exp(expr) The exponential function.
log(expr) The natural logarithm function.
sqrt(expr) Return the square root of expr.

Examples

Print seven random numbers from 0 to 99, inclusive:

awk 'BEGIN { for (i = 1; i <= 7; i++)
                 print int(100 * rand()) }'

rand() is a real number between 0 and 1.

It can be 0, but cannot be 1

i.e. 0 <= rand() && rand < 1

String Functions

tolower(str)
Return a copy of the string str, with all the uppercase characters in str translated to their corresponding lowercase counterparts.
Non-alphabetic characters are left unchanged.
toupper(str)
Return a copy of the string str, with all the lowercase characters in str translated to their corresponding uppercase counterparts.

String Functions

length([s])
Return the length of the string s, or the length of $0 if s is not supplied.
substr(s, i [, n])
Return the at most n-character substring of s starting at i.
If n is omitted, use the rest of s.

There are several more functions

Homework

Write an awk program that changes the first word to Title Case

Comments

In an AWK script you can write comments to help you understand what is happening

This is super practical. since other people (or yourself) can understand the program later

Comments start with # and continue to the end of line

Loops

Population data

The file /home/andres/population_total.csv has data for all years and all countries

Take a look doing this:

head /home/andres/population_total.csv

Pivot table

We want to change the shape of this table

The output should be in three columns

  • country
  • year
  • population

Fields are separated by comma

We need to use the -F option. Something like

awk -F ',' '{print $1, 1800, $2; 
             print $1, 1801, $3; 
             print $1, 1802, $4;
         }' /home/andres/population_total.csv

with one print command for every field

Can we do it smarter?

for loops

Like many other computer languages, awk can repeat the same commands several times

awk -F ',' '{for(i=2; i<=NF; i++) {
                print $1, 1798+i, $i
            }
        }' /home/andres/population_total.csv

for loops have four parts

The general form of a for loop looks like this:

for(A;B;C){D}

  • A, B, and C are separated by ; (semicolon)
  • D is wrapped in {}

The four parts of for

for(A;B;C){D}

  • A is the initialization
  • B is a while condition
  • C is the update
  • D is one or more commands to be executed

The while condition

A, C and D are normal awk commands or assignments

B is a TRUE/FALSE condition

The D part is repeated while B is true

B should be FALSE sometimes, otherwise we never finish

Big Picture

  • A
  • D
  • C
  • B TRUE
  • D
  • C
  • B TRUE
  • D
  • C
  • B FALSE