December 12, 2018

This course is easy

if you do the correct things

  • First, you need to understand the question in your own language
    • Make a drawing
  • Decompose the problem in smaller parts
    • You can eat an elephant, piece by piece
  • Write your answer in your own words
  • Translate them to the computer language
    • In this case, awk

Finishing Quiz 3

What is the total sum of the last field of each line of population_total.csv?

7595618719

What is the total sum of the last field of each line of population_total.csv?

awk -F, '{x+=$NF} END{print x}' population_total.csv
7595618719

What is the average of the last field of each line of population_total.csv?

3.87532e+07

What is the average of the last field of each line of population_total.csv?

awk -F, '{x+=$NF} END {print x/NR}' population_total.csv
3.87532e+07

What is the standard deviation of the last field of each line of population_total.csv?

Standard deviation is the square root of the average of the squares minus the square of the average \(\sqrt{\sum x_i^2/n -(\sum x_i/n)^2}\)

1.44157e+08

What is the standard deviation of the last field of each line of population_total.csv?

Standard deviation is the square root of the average of the squares minus the square of the average \(\sqrt{\sum x_i^2/n -(\sum x_i/n)^2}\)

awk -F, '{x+=$NF; xx+=$NF*$NF} END {print sqrt(xx/NR-(x/NR)^2)}' population_total.csv
1.44157e+08

What is the sum of all fields in each line of population_total.csv?

Print the first field and the sum from the second until the last field

geo 418071
Afghanistan 1821590000
Albania 279593000
Algeria 2249290000
Andorra 3693700
Angola 1257910000
Antigua and Barbuda 10420200
Argentina 2764950000
Armenia 297897000
Australia 1503897000

What is the sum of all fields in each line of population_total.csv?

Print the first field and the sum from the second until the last field

awk -F, '{x=0; for(i=2;i<=NF;i++) x+=$i; print $1,x}' population_total.csv
geo 418071
Afghanistan 1821590000
Albania 279593000
Algeria 2249290000
Andorra 3693700
Angola 1257910000
Antigua and Barbuda 10420200
Argentina 2764950000
Armenia 297897000
Australia 1503897000
Austria 1279710000
Azerbaijan 685730000
Bahamas 22319800
Bahrain 48095700
Bangladesh 10626000000
Barbados 39419300
Belarus 1274430000
Belgium 1531950000
Belize 17909900
Benin 532698000
Bhutan 49752900
Bolivia 707200000
Bosnia and Herzegovina 479510000
Botswana 106085000
Brazil 11557500000
Brunei 17604630
Bulgaria 1098060000
Burkina Faso 1018390000
Burundi 572102000
Cambodia 1041530000
Cameroon 1128110000
Canada 2503430000
Cape Verde 38099400
Central African Republic 290070000
Chad 665920000
Chile 1204090000
China 132726000000
Colombia 2689440000
Comoros 41432600
"Congo 3604470000
"Congo 241146000
Costa Rica 232489500
Cote d'Ivoire 1013800000
Croatia 625620000
Cuba 944775000
Cyprus 89447000
Czech Republic 1724170000
Denmark 679970000
Djibouti 36380500
Dominica 9028500
Dominican Republic 523119200
Ecuador 831010000
Egypt 4968770000
El Salvador 426370000
Equatorial Guinea 49993600
Eritrea 263677000
Estonia 183792000
Ethiopia 4504540000
Fiji 66474000
Finland 664189000
France 9359300000
Gabon 106823000
Gambia 86213000
Georgia 601360000
Germany 12151400000
Ghana 1302850000
Greece 1300570000
Grenada 13633400
Guatemala 788227000
Guinea 701925000
Guinea-Bissau 119402000
Guyana 86093000
Haiti 671129000
Holy See 190984
Honduras 399364000
Hungary 1636660000
Iceland 30312400
India 89583000000
Indonesia 16110000000
Iran 4650640000
Iraq 1518760000
Ireland 842400000
Israel 358215000
Italy 8293800000
Jamaica 250004000
Japan 14312300000
Jordan 292151000
Kazakhstan 1511940000
Kenya 1954660000
Kiribati 8283000
Kuwait 122806500
Kyrgyz Republic 407540000
Lao 412947000
Latvia 317393000
Lebanon 286781000
Lesotho 165359000
Liberia 233824000
Libya 370381000
Liechtenstein 3134060
Lithuania 430861000
Luxembourg 55615000
"Macedonia 230064000
Madagascar 1186320000
Malawi 729538000
Malaysia 1440308000
Maldives 22903300
Mali 1003400000
Malta 54703000
Marshall Islands 3710240
Mauritania 212018000
Mauritius 108111100
Mexico 6982060000
"Micronesia 8758600
Moldova 451656000
Monaco 3792770
Mongolia 229458000
Montenegro 72169000
Morocco 2176320000
Mozambique 1511210000
Myanmar 3617140000
Namibia 127313000
Nauru 878140
Nepal 1994400000
Netherlands 1670160000
New Zealand 319949500
Nicaragua 328053000
Niger 849830000
Nigeria 8877100000
North Korea 2062780000
Norway 567112000
Oman 173507000
Pakistan 9611700000
Palau 1788940
Palestine 197401000
Panama 196821800
Papua New Guinea 451632000
Paraguay 336570000
Peru 1816200000
Philippines 4808490000
Poland 5206860000
Portugal 1394980000
Qatar 43897100
Romania 2914400000
Russia 17712300000
Rwanda 628604000
Samoa 17175700
San Marino 2801270
Sao Tome and Principe 13086100
Saudi Arabia 1346490000
Senegal 738390000
Serbia 1158230000
Seychelles 7323900
Sierra Leone 463018000
Singapore 240669500
Slovak Republic 733840000
Slovenia 249881000
Solomon Islands 30615900
Somalia 688761000
South Africa 2979100000
South Korea 4260860000
South Sudan 640330000
Spain 5216400000
Sri Lanka 1546170000
St. Kitts and Nevis 7374600
St. Lucia 14684400
St. Vincent and the Grenadines 12393900
Sudan 2032470000
Suriname 46721600
Swaziland 66899900
Sweden 1231880000
Switzerland 889790000
Syria 1043870000
Tajikistan 446131000
Tanzania 2326450000
Thailand 4488820000
Timor-Leste 93099000
Togo 368450000
Tonga 9328700
Trinidad and Tobago 107014300
Tunisia 734533000
Turkey 5287380000
Turkmenistan 326048000
Tuvalu 1099590
Uganda 1631780000
Ukraine 6255800000
United Arab Emirates 181973800
United Kingdom 8517400000
United States 25488660000
Uruguay 307526500
Uzbekistan 1766260000
Vanuatu 14921800
Venezuela 1545365000
Vietnam 6029180000
Yemen 1319800000
Zambia 666097000
Zimbabwe 853410000

What is the average of all fields in each line of population_total.csv?

Modify the previous command to print average instead of sum. Skip the first line

Afghanistan 8.31776e+06
Albania 1.27668e+06
Algeria 1.02707e+07
Andorra 16866.2
Angola 5.74388e+06
Antigua and Barbuda 47580.8
Argentina 1.26253e+07
Armenia 1.36026e+06
Australia 6.86711e+06
Austria 5.84342e+06

What is the average of all fields in each line of population_total.csv?

Modify the previous command to print average instead of sum. Skip the first line

awk -F, 'NR>1 {x=0; for(i=2;i<=NF;i++) x+=$i; print $1,x/(NF-1)}' population_total.csv
Afghanistan 8.31776e+06
Albania 1.27668e+06
Algeria 1.02707e+07
Andorra 16866.2
Angola 5.74388e+06
Antigua and Barbuda 47580.8
Argentina 1.26253e+07
Armenia 1.36026e+06
Australia 6.86711e+06
Austria 5.84342e+06
Azerbaijan 3.13119e+06
Bahamas 101917
Bahrain 219615
Bangladesh 4.85205e+07
Barbados 179997
Belarus 5.81932e+06
Belgium 6.99521e+06
Belize 81780.4
Benin 2.43241e+06
Bhutan 227182
Bolivia 3.22922e+06
Bosnia and Herzegovina 2.18954e+06
Botswana 484406
Brazil 5.2774e+07
Brunei 80386.4
Bulgaria 5.01397e+06
Burkina Faso 4.65018e+06
Burundi 2.61234e+06
Cambodia 4.75584e+06
Cameroon 5.15119e+06
Canada 1.14312e+07
Cape Verde 173970
Central African Republic 1.32452e+06
Chad 3.04073e+06
Chile 5.49813e+06
China 6.06055e+08
Colombia 1.22805e+07
Comoros 189190
"Congo 1.6384e+07
"Congo 1.09612e+06
Costa Rica 1.0616e+06
Cote d'Ivoire 4.62922e+06
Croatia 2.85671e+06
Cuba 4.31404e+06
Cyprus 408434
Czech Republic 7.87292e+06
Denmark 3.10489e+06
Djibouti 166121
Dominica 41226
Dominican Republic 2.38867e+06
Ecuador 3.79457e+06
Egypt 2.26884e+07
El Salvador 1.94689e+06
Equatorial Guinea 228281
Eritrea 1.204e+06
Estonia 839233
Ethiopia 2.05687e+07
Fiji 303534
Finland 3.03283e+06
France 4.27365e+07
Gabon 487776
Gambia 393667
Georgia 2.74594e+06
Germany 5.54858e+07
Ghana 5.94909e+06
Greece 5.93868e+06
Grenada 62253
Guatemala 3.59921e+06
Guinea 3.20514e+06
Guinea-Bissau 545215
Guyana 393119
Haiti 3.06452e+06
Holy See 872.073
Honduras 1.82358e+06
Hungary 7.47333e+06
Iceland 138413
India 4.09055e+08
Indonesia 7.35616e+07
Iran 2.12358e+07
Iraq 6.93498e+06
Ireland 3.84658e+06
Israel 1.63568e+06
Italy 3.78712e+07
Jamaica 1.14157e+06
Japan 6.5353e+07
Jordan 1.33402e+06
Kazakhstan 6.90384e+06
Kenya 8.92539e+06
Kiribati 37821.9
Kuwait 560760
Kyrgyz Republic 1.86091e+06
Lao 1.8856e+06
Latvia 1.44928e+06
Lebanon 1.3095e+06
Lesotho 755064
Liberia 1.06769e+06
Libya 1.69124e+06
Liechtenstein 14310.8
Lithuania 1.9674e+06
Luxembourg 253950
"Macedonia 1.04575e+06
Madagascar 5.41699e+06
Malawi 3.33122e+06
Malaysia 6.57675e+06
Maldives 104581
Mali 4.58174e+06
Malta 249785
Marshall Islands 16941.7
Mauritania 968119
Mauritius 493658
Mexico 3.18816e+07
"Micronesia 39811.8
Moldova 2.06236e+06
Monaco 17318.6
Mongolia 1.04775e+06
Montenegro 329539
Morocco 9.93753e+06
Mozambique 6.9005e+06
Myanmar 1.65166e+07
Namibia 581338
Nauru 4009.77
Nepal 9.10685e+06
Netherlands 7.6263e+06
New Zealand 1.46096e+06
Nicaragua 1.49796e+06
Niger 3.8805e+06
Nigeria 4.05347e+07
North Korea 9.41909e+06
Norway 2.58955e+06
Oman 792269
Pakistan 4.3889e+07
Palau 8168.68
Palestine 901374
Panama 898730
Papua New Guinea 2.06225e+06
Paraguay 1.53685e+06
Peru 8.29315e+06
Philippines 2.19566e+07
Poland 2.37756e+07
Portugal 6.36977e+06
Qatar 200443
Romania 1.33078e+07
Russia 8.08781e+07
Rwanda 2.87034e+06
Samoa 78427.9
San Marino 12791.2
Sao Tome and Principe 59753.9
Saudi Arabia 6.14836e+06
Senegal 3.37164e+06
Serbia 5.28872e+06
Seychelles 33442.5
Sierra Leone 2.11424e+06
Singapore 1.09895e+06
Slovak Republic 3.35087e+06
Slovenia 1.14101e+06
Solomon Islands 139799
Somalia 3.14503e+06
South Africa 1.36032e+07
South Korea 1.9456e+07
South Sudan 2.92388e+06
Spain 2.38192e+07
Sri Lanka 7.06014e+06
St. Kitts and Nevis 33674
St. Lucia 67052.1
St. Vincent and the Grenadines 56593.2
Sudan 9.28068e+06
Suriname 213341
Swaziland 305479
Sweden 5.62502e+06
Switzerland 4.06297e+06
Syria 4.76653e+06
Tajikistan 2.03713e+06
Tanzania 1.06231e+07
Thailand 2.04969e+07
Timor-Leste 425110
Togo 1.68242e+06
Tonga 42596.8
Trinidad and Tobago 488650
Tunisia 3.35403e+06
Turkey 2.41433e+07
Turkmenistan 1.4888e+06
Tuvalu 5020.96
Uganda 7.45105e+06
Ukraine 2.85653e+07
United Arab Emirates 830931
United Kingdom 3.88922e+07
United States 1.16387e+08
Uruguay 1.40423e+06
Uzbekistan 8.06511e+06
Vanuatu 68136.1
Venezuela 7.05646e+06
Vietnam 2.75305e+07
Yemen 6.02648e+06
Zambia 3.04154e+06
Zimbabwe 3.89685e+06

Pivot table: Transform the table from one row to multiple rows

Show only the data for Turkey

Turkey 9770000
Turkey 9790000
Turkey 9800000
Turkey 9820000
Turkey 9830000
Turkey 9850000
Turkey 9860000
Turkey 9880000
Turkey 9890000
Turkey 9910000

Pivot table: Transform the table from one row to multiple rows

Show only the data for Turkey

awk -F, '/Turkey/ {for(i=2;i<=NF;i++) print $1,$i}' population_total.csv
Turkey 9770000
Turkey 9790000
Turkey 9800000
Turkey 9820000
Turkey 9830000
Turkey 9850000
Turkey 9860000
Turkey 9880000
Turkey 9890000
Turkey 9910000
Turkey 9920000
Turkey 9940000
Turkey 9960000
Turkey 9970000
Turkey 9990000
Turkey 10000000
Turkey 10000000
Turkey 10100000
Turkey 10100000
Turkey 10100000
Turkey 10100000
Turkey 10100000
Turkey 10200000
Turkey 10200000
Turkey 10200000
Turkey 10200000
Turkey 10300000
Turkey 10300000
Turkey 10300000
Turkey 10400000
Turkey 10400000
Turkey 10400000
Turkey 10500000
Turkey 10500000
Turkey 10500000
Turkey 10600000
Turkey 10600000
Turkey 10600000
Turkey 10700000
Turkey 10700000
Turkey 10700000
Turkey 10800000
Turkey 10800000
Turkey 10800000
Turkey 10900000
Turkey 10900000
Turkey 10900000
Turkey 11000000
Turkey 11000000
Turkey 11000000
Turkey 11100000
Turkey 11100000
Turkey 11100000
Turkey 11200000
Turkey 11200000
Turkey 11300000
Turkey 11300000
Turkey 11300000
Turkey 11400000
Turkey 11400000
Turkey 11400000
Turkey 11500000
Turkey 11500000
Turkey 11600000
Turkey 11600000
Turkey 11700000
Turkey 11700000
Turkey 11700000
Turkey 11800000
Turkey 11800000
Turkey 11900000
Turkey 12000000
Turkey 12000000
Turkey 12100000
Turkey 12100000
Turkey 12200000
Turkey 12200000
Turkey 12300000
Turkey 12400000
Turkey 12400000
Turkey 12500000
Turkey 12600000
Turkey 12700000
Turkey 12700000
Turkey 12800000
Turkey 12900000
Turkey 13000000
Turkey 13000000
Turkey 13100000
Turkey 13200000
Turkey 13300000
Turkey 13300000
Turkey 13400000
Turkey 13500000
Turkey 13600000
Turkey 13600000
Turkey 13700000
Turkey 13800000
Turkey 13900000
Turkey 14000000
Turkey 14000000
Turkey 14100000
Turkey 14200000
Turkey 14300000
Turkey 14400000
Turkey 14400000
Turkey 14500000
Turkey 14600000
Turkey 14700000
Turkey 14800000
Turkey 14900000
Turkey 14900000
Turkey 15000000
Turkey 15000000
Turkey 15000000
Turkey 14900000
Turkey 14800000
Turkey 14700000
Turkey 14500000
Turkey 14400000
Turkey 14300000
Turkey 14200000
Turkey 14100000
Turkey 14100000
Turkey 14100000
Turkey 14200000
Turkey 14300000
Turkey 14400000
Turkey 14600000
Turkey 14800000
Turkey 15100000
Turkey 15300000
Turkey 15500000
Turkey 15800000
Turkey 16000000
Turkey 16300000
Turkey 16600000
Turkey 16900000
Turkey 17200000
Turkey 17600000
Turkey 17900000
Turkey 18100000
Turkey 18300000
Turkey 18500000
Turkey 18700000
Turkey 19000000
Turkey 19400000
Turkey 19800000
Turkey 20300000
Turkey 20800000
Turkey 21400000
Turkey 22000000
Turkey 22500000
Turkey 23100000
Turkey 23700000
Turkey 24300000
Turkey 24900000
Turkey 25500000
Turkey 26200000
Turkey 26800000
Turkey 27500000
Turkey 28100000
Turkey 28800000
Turkey 29500000
Turkey 30200000
Turkey 31000000
Turkey 31700000
Turkey 32500000
Turkey 33300000
Turkey 34100000
Turkey 34900000
Turkey 35700000
Turkey 36600000
Turkey 37500000
Turkey 38400000
Turkey 39300000
Turkey 40200000
Turkey 41100000
Turkey 42000000
Turkey 43000000
Turkey 44000000
Turkey 45000000
Turkey 46000000
Turkey 47100000
Turkey 48100000
Turkey 49100000
Turkey 50100000
Turkey 51100000
Turkey 52100000
Turkey 53000000
Turkey 53900000
Turkey 54800000
Turkey 55700000
Turkey 56700000
Turkey 57600000
Turkey 58500000
Turkey 59400000
Turkey 60400000
Turkey 61300000
Turkey 62300000
Turkey 63200000
Turkey 64200000
Turkey 65100000
Turkey 66100000
Turkey 67000000
Turkey 67900000
Turkey 68800000
Turkey 69600000
Turkey 70400000
Turkey 71300000
Turkey 72300000
Turkey 73400000
Turkey 74600000
Turkey 75800000
Turkey 77000000
Turkey 78300000
Turkey 79500000
Turkey 80700000
Turkey 81900000

Modify the previous command to print only the last 5 values

Show only the data for Turkey

Turkey 77000000
Turkey 78300000
Turkey 79500000
Turkey 80700000
Turkey 81900000

Modify the previous command to print only the last 5 values

Show only the data for Turkey

awk -F, '/Turkey/ {for(i=NF-4; i<=NF; i++) print $1,$i}' population_total.csv
Turkey 77000000
Turkey 78300000
Turkey 79500000
Turkey 80700000
Turkey 81900000

Summary

This course is easy

if you do the correct things

  • First, you need to understand the question in your own language
    • Make a drawing
  • Decompose the problem in smaller parts
    • You can eat an elephant, piece by piece
  • Write your answer in your own words
  • Translate them to the computer language
    • In this case, awk

Arrays

How many countries on each continent?

awk '$3=="America" {n_America++}
$3=="Africa" {n_Africa++}
$3=="Asia" {n_Asia++}
$3=="Europe" {n_Europe++}
$3=="Oceania" {n_Oceania++}
END {print "America", n_America;
     print "Africa", n_Africa;
     print "Asia", n_Asia;
     print "Europe", n_Europe;
     print "Oceania", n_Oceania;
}' world_2007.txt

Arrays make this easier

awk '$3=="America" {n["America"]++}
$3=="Africa" {n["Africa"]++}
$3=="Asia" {n["Asia"]++}
$3=="Europe" {n["Europe"]++}
$3=="Oceania" {n["Oceania"]++}
END {print "America", n["America"];
     print "Africa", n["Africa"];
     print "Asia", n["Asia"];
     print "Europe", n["Europe"];
     print "Oceania", n["Oceania"];
}' world_2007.txt

$3 is the continent

awk '{n[$3]++}
END {print "America", n["America"];
     print "Africa", n["Africa"];
     print "Asia", n["Asia"];
     print "Europe", n["Europe"];
     print "Oceania", n["Oceania"];
}' world_2007.txt

Repeat commands using for

awk '{n[$3]++}
END {for(continent in n) {
    print continent, n[continent]
    }
}' world_2007.txt

Notice that the output may not be in order

Parts of an array

One array contains several elements

They are pairs of key, and values

We can access the values using []

we write the key inside []

we can read or write the value

array[key] =  value

Exercise: Frequency table

histogram

Using the int() function we can round the life expectancy of each country

What is the absolute frequency of each life expectancy?

39 1
42 5
43 2
44 1
45 1
46 4
48 3
49 2
50 3
51 2
52 4
54 3
55 1
56 4
58 3
59 4
60 2
62 2
63 2
64 3
65 4
66 1
67 1
69 1
70 5
71 8
72 12
73 6
74 8
75 6
76 5
77 2
78 10
79 8
80 8
81 3
82 2

split a text

split(s, a [, r])
Split the string s into the array a on the regular expression r, and return the number of fields.
If r is omitted, FS is used instead.
The array a is cleared (deleted) first.

Example: DNA sequencing cost

Take a look at /home/andres/sequencingcostdata.txt

Date     Cost per Mb     Cost per Genome 
Sep-01  5292.39 95263072
Mar-02  3898.64 70175437
Sep-02  3413.80 61448422

How can we transform Date into number of months since January 2000?

First step

  • File is separated by tab
  • First line can be ignored
  • Date is on the first field
    • something like: Sep-02
    • We have to split it in month and year
  • We need to convert month into a number
  • We need a formula for number of months

Skipping lines

There are several ways to ignore the first line

One useful way is the command next

awk 'NR==1 {next} {print $0}' sequencingcostdata.txt

next means skip the rest of the commands and process the next line

Splitting

awk 'NR==1 {next} {split($1, date, "-");
    print $1, date[1], date[2]}' sequencingcostdata.txt

split creates arrays with numeric indices