November 22, 2018

## Regular Expressionâ€™s Flavors

The idea of regular expressions comes from Linguistics and Mathematics

Different people made slightly different computational versions

There are small differences between different programs. We say that Regular Expressions come in different “flavors”

In both cases the characters .*[]^$\ are special In extended regex the symbols ?{}()+| are also special You need to use \ to make them “normal” In basic regex the symbols ?{}()+| are “normal” But (usually) you can use \ to make them “special” Extended regex (ab)+ can be Basic regex $$ab$$\+ ## grep, egrep, fgrep grep search for basic regular expressions egrep search for extended regular expressions fgrep search for fixed patterns Example: searching for [ab]|[cd] has different meaning on each case ## grep Patterns must be written in quotations grep and egrep patterns are regular expressions regular expressions can have ? and * The shell can think that ? and * are wildcards Wildcards are expanded by the shell before being passed as arguments to the program That is bad for the regular expression Always wrap the grep pattern in single quotes ' ## Revisiting the midterm exam ## Show the life expectancy and population of Turkey world_2007.txt is a TAB-separated file. The columns represents “country”, “population”, “continent”, “life expectancy”, and “GDP per capita”. grep 'Turkey' world_2007.txt | cut -f 4,2 71158647 71.777 But that is the wrong order! ## Which is the African country with 135031164 habitants? grep '135031164.*Africa' world_2007.txt Nigeria 135031164 Africa 46.859 2013.977305 Which world countries have bigger populations than Nigeria? Which countries have a little more than 9 million people? ## What is the total GDP of India? Total GDP = GDP per capita * population This is a question that we cannot solve with grep and cut For these advanced questions we have another tool, called awk ## AWK Created by Aho, Weinberger, and Kernighan, AWK is a programming language that permits • easy manipulation of structured data • generation of formatted reports ## What is the total GDP of India? awk '$1=="India" {print $2 *$5}' world_2007.txt
2.72293e+12

In awk, each statement has two parts:

• a condition
• a block of commands to run when the condition is true

## AWK works row by row

We write

awk 'statements' file1 file2 ...

awk process each line of each file, one by one

It can process billions of lines, one by one

Each line is split on whitespace into fields

The columns of the file are called fields, the lines are called records

## AWK commands

There are a few commands

You can always do man awk

For today we will use only print

print can take one or more arguments, separated by comma

The value of the argument is printed to the standard output, in a separated line

## AWK automatic variables

There are many automatic variables in awk

$1, $2, and so are the fields of each row

awk variables do not need $ to be read ## AWK conditions The basic conditions are comparisons awk '$1=="Turkey"' world_2007.txt
Turkey  71158647    Europe  71.777  8458.276384
awk '$2>135031164' world_2007.txt Bangladesh 150448339 Asia 64.062 1391.253792 Brazil 190010647 Americas 72.39 9065.800825 China 1318683096 Asia 72.961 4959.114854 India 1110396331 Asia 64.698 2452.210407 Indonesia 223547000 Asia 70.65 3540.651564 Pakistan 169270617 Asia 65.483 2605.94758 United_States 301139947 Americas 78.242 42951.65309 ## Awk statements have 2 optional parts Each awk statement is like this condition {command} We can omit the command. It will automatically print all We can omit the condition. It will work always ## We can omit {print$0}

awk '$1=="Turkey" {print$0}' world_2007.txt
Turkey  71158647    Europe  71.777  8458.276384

can be shortened as

awk '$1=="Turkey"' world_2007.txt Turkey 71158647 Europe 71.777 8458.276384 ## Complex conditions We can combine small conditions to make longer conditions Population over 9 million AND (&&) population less than 10 million awk '$2 > 9000000 && $2<10000000' world_2007.txt Bolivia 9119152 Americas 65.554 3822.137084 Dominican_Republic 9319622 Americas 72.235 6025.374752 Guinea 9947814 Africa 56.007 942.6542111 Hungary 9956108 Europe 73.338 18008.94444 Somalia 9118773 Africa 48.159 926.1410683 Sweden 9031088 Europe 80.884 33859.74835 ## Complex condition: Eurasian countries Continent is “Europe” OR (||) “Asia” awk '$3=="Europe" || $3=="Asia"' world_2007.txt Afghanistan 31889923 Asia 43.828 974.5803384 Albania 3600523 Europe 76.423 5937.029526 Austria 8199783 Europe 79.829 36126.4927 Bahrain 708573 Asia 75.635 29796.04834 Bangladesh 150448339 Asia 64.062 1391.253792 Belgium 10392226 Europe 79.441 33692.60508 Bosnia_and_Herzegovina 4552198 Europe 74.852 7446.298803 Bulgaria 7322858 Europe 73.005 10680.79282 Cambodia 14131858 Asia 59.723 1713.778686 China 1318683096 Asia 72.961 4959.114854 Croatia 4493312 Europe 75.748 14619.22272 Czech_Republic 10228744 Europe 76.486 22833.30851 Denmark 5468120 Europe 78.332 35278.41874 Finland 5238460 Europe 79.313 33207.0844 France 61083916 Europe 80.657 30470.0167 Germany 82400996 Europe 79.406 32170.37442 Greece 10706290 Europe 79.483 27538.41188 Hong_Kong,_China 6980412 Asia 82.208 39724.97867 Hungary 9956108 Europe 73.338 18008.94444 Iceland 301931 Europe 81.757 36180.78919 India 1110396331 Asia 64.698 2452.210407 Indonesia 223547000 Asia 70.65 3540.651564 Iran 69453570 Asia 70.964 11605.71449 Iraq 27499638 Asia 59.545 4471.061906 Ireland 4109086 Europe 78.885 40675.99635 Israel 6426679 Asia 80.745 25523.2771 Italy 58147733 Europe 80.546 28569.7197 Japan 127467972 Asia 82.603 31656.06806 Jordan 6053193 Asia 72.535 4519.461171 Korea,_Dem._Rep. 23301725 Asia 67.297 1593.06548 Korea,_Rep. 49044790 Asia 78.623 23348.13973 Kuwait 2505559 Asia 77.588 47306.98978 Lebanon 3921278 Asia 71.993 10461.05868 Malaysia 24821286 Asia 74.241 12451.6558 Mongolia 2874127 Asia 66.803 3095.772271 Montenegro 684736 Europe 74.543 9253.896111 Myanmar 47761980 Asia 62.069 944 Nepal 28901790 Asia 63.785 1091.359778 Netherlands 16570613 Europe 79.762 36797.93332 Norway 4627926 Europe 80.196 49357.19017 Oman 3204897 Asia 75.64 22316.19287 Pakistan 169270617 Asia 65.483 2605.94758 Philippines 91077287 Asia 71.688 3190.481016 Poland 38518241 Europe 75.563 15389.92468 Portugal 10642836 Europe 78.098 20509.64777 Romania 22276056 Europe 72.476 10808.47561 Saudi_Arabia 27601038 Asia 72.777 21654.83194 Serbia 10150265 Europe 74.002 9786.534714 Singapore 4553009 Asia 79.972 47143.17964 Slovak_Republic 5447502 Europe 74.663 18678.31435 Slovenia 2009245 Europe 77.926 25768.25759 Spain 40448191 Europe 80.941 28821.0637 Sri_Lanka 20378239 Asia 72.396 3970.095407 Sweden 9031088 Europe 80.884 33859.74835 Switzerland 7554661 Europe 81.701 37506.41907 Syria 19314747 Asia 74.143 4184.548089 Taiwan 23174294 Asia 78.4 28718.27684 Thailand 65068149 Asia 70.616 7458.396327 Turkey 71158647 Europe 71.777 8458.276384 United_Kingdom 60776238 Europe 79.425 33203.26128 Vietnam 85262356 Asia 74.249 2441.576404 West_Bank_and_Gaza 4018332 Asia 73.422 3025.349798 Yemen,_Rep. 22211743 Asia 62.698 2280.769906 ## Advanced conditions We can use regular expression as conditions awk '/Turkey/' world_2007.txt Turkey 71158647 Europe 71.777 8458.276384 is the same as grep 'Turkey' world_2007.txt Turkey 71158647 Europe 71.777 8458.276384 ## Print the number of record ## Avoiding head We can decide to print or not based on the row number awk 'NR <= 10 {print NR,$0}' science.txt
1 The Electronic Telegraph  Thursday 28 September 1995  Science
2
3 This summer the Royal Observatory at Herstmonceux
4 found new life as a science centre. Andro Linklater
5 celebrates a partial victory for the heritage
6
7 THE SIGHT of a child's top spinning unsupported in mid-air should have been
8 surprising. Rotating there in space, it not only defied the rules of gravity,
9 it defied common sense, and at least three Fellows of the Royal Society gazed
10 at it in something close to wonder.

## Counting words

Each word is a field

awk 'NR <= 10 {print NR, NF, $0}' science.txt 1 8 The Electronic Telegraph Thursday 28 September 1995 Science 2 0 3 7 This summer the Royal Observatory at Herstmonceux 4 9 found new life as a science centre. Andro Linklater 5 7 celebrates a partial victory for the heritage 6 0 7 13 THE SIGHT of a child's top spinning unsupported in mid-air should have been 8 13 surprising. Rotating there in space, it not only defied the rules of gravity, 9 14 it defied common sense, and at least three Fellows of the Royal Society gazed 10 7 at it in something close to wonder. ## Print non-empty lines ## Print the last word of every line ## Command without condition If there is no condition, the command is run for every line awk '{print$0, $2 *$5}' world_2007.txt
