Class 16: Telling stories

Computing in Molecular Biology and Genetics 1

Andrés Aravena, PhD

November 23, 2019

We have our own data

library(readr)
students <- read_tsv("students2018-2020.tsv")
students
# A tibble: 117 x 10
   answer_date id    english_level sex   birthdate  birthplace
   <date>      <chr> <chr>         <chr> <date>     <chr>     
 1 2018-09-17  3e50… I can speak … Male  1993-02-01 turkey    
 2 2018-09-17  479d… I can unders… Fema… 1998-05-21 Kahramanm…
 3 2018-09-17  39df… I can read a… Fema… 1998-01-18 Batman, T…
 4 2018-09-17  d2b0… I can read a… Male  1998-08-29 Antalya,T…
 5 2018-09-17  f22b… I can read a… Fema… 1998-05-03 izmir     
 6 2018-09-17  849c… İngilizce bi… Fema… 1995-10-09 Türkiye /…
 7 2018-09-17  8381… I can speak … Fema… 1997-09-19 Adıyaman,…
 8 2018-09-17  b0dd… I can read a… Male  1997-11-27 Bursa     
 9 2018-09-17  2972… I can read a… Fema… 1999-01-02 İstanbul/…
10 2018-09-17  72c0… I can read a… Fema… 1998-10-02 İstanbul,…
# … with 107 more rows, and 4 more variables: height_cm <dbl>,
#   weight_kg <dbl>, handness <chr>, hand_span <dbl>

We can take a vector from our data

Today we will not use NA values

weight <- students$weight_kg[!is.na(students$weight_kg)]
weight
  [1]  67.0  55.0  74.0  68.0  58.0  72.0  68.0  58.0  55.0
 [10]  81.0  42.5  69.0  58.0  47.0  78.0  57.0  55.0  55.0
 [19]  65.0  60.0  50.0  52.0  54.0  75.0 105.0  56.0  50.0
 [28]  67.0  59.0  75.0  60.0  60.0 106.0  94.0  63.0  54.0
 [37]  53.0  75.0  70.0  65.0  65.0  55.0  68.0  55.0  80.0
 [46]  77.0  85.0  65.0  64.0  64.0  60.0  76.0  56.0  78.0
 [55]  77.0  72.0  58.0  66.0  52.0  73.0  82.0  55.0  86.0
 [64]  63.0  85.0  58.0  65.0  65.0  70.0  47.0  82.0  70.0
 [73]  75.0  47.0  72.0  61.0  79.0  55.0  74.0  47.0  54.0
 [82]  60.0  74.0  56.0  65.0  49.0  63.0  65.0  47.0  90.0
 [91]  90.0  76.0  88.0  80.0  72.0  47.0  61.0  95.0  67.0
[100]  80.0

So what?

Descriptive Statistics

We have data

we want to tell something about it

What can we tell about this set of numbers?

How can we make a summary of all the values using only a few numbers?

Standard Data Descriptors

  • Number of elements
    • How many?
  • Location
    • Where?
  • Dispersion
    • Are they homogeneous?
    • Are they similar to each other?

How many in total

  • For data frames and tibbles we use nrow()
nrow(students)
[1] 117
  • dim() gives us rows and columns
dim(students)
[1] 117  10
  • For vectors we use length()
length(weight)
[1] 100

Counting how many of each

table() should be called count

table(students$handness)

 Left Right 
   12   105 
table(students$sex)

Female   Male 
    77     39 

We can count combinations

This looks more like a table

table(students$handness, students$sex)
       
        Female Male
  Left       9    3
  Right     68   36

table() is not good with numeric

table(weight)
weight
42.5   47   49   50   52   53   54   55   56   57   58   59 
   1    6    1    2    2    1    3    8    3    1    5    1 
  60   61   63   64   65   66   67   68   69   70   72   73 
   5    2    3    2    8    1    3    3    1    3    4    1 
  74   75   76   77   78   79   80   81   82   85   86   88 
   3    4    2    2    2    1    3    1    2    2    1    1 
  90   94   95  105  106 
   2    1    1    1    1 

It is not a good summary

What can we say?

Location

Where are the values?

Minimum and Maximum

The easiest are minimum and maximum

min(weight)
[1] 42.5
max(weight)
[1] 106

Range: min and max together

Sometimes can be useful together

range(weight)
[1]  42.5 106.0

(This is like dim(): it combines two functions into one)

Average

If you have to describe the vector v with a single number x, which would it be?

If we have to replace each one of v[i] for a single number, which number is “the best”?

There are several possible answers to that question.

There are several possible averages

Arithmetic Mean

If \(\mathbf v=(v_1,…,v_n)\) is a vector, then the mean value of \(\mathbf v\) is \[\text{mean}(\mathbf v) = \frac{1}{n}\sum_{i=1}^n v_i\]

This value is called the mean of \(\mathbf v\)

In R we write mean(v)

Other means

Besides arithmetic mean, we have

  • Geometrical mean
  • Harmonic mean
  • Quadratic mean
  • Cubic mean

and many others

We use them only in a few specific places

Median

If x is the median of v, then

  • half of the values in v are smaller than x
  • half of the values in v are bigger than x

The median is often used as “average”

Like in “the average person think he/she is smarter than the average person”

Median and mean are usually different

In our case

These are our values

mean(weight)
[1] 66.485
median(weight)
[1] 65

Median is robust

The problem with averages is that they are too sensitive to extreme values

Imagine that one day an elephant comes to our class

What happens with “the average weight”

An elephant joins us

mean( c(weight, 6000) )
[1] 125.2327
median( c(weight, 6000) )
[1] 65

Which one represent us better?

In summary

What are these values?

summary(weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  42.50   56.00   65.00   66.48   75.00  106.00 

What is 1st Qu. and 3rd Qu.

First and Third Quartiles

Quart means one fourth in latin.

If we split the set of values in four subsets of the same size, what are the limits of these sets? \[Q_0, Q_1, Q_2, Q_3, Q_4\]

It is easy to know \(Q_0, Q_2\) and \(Q_4\)

Quartiles

\(Q_0\): Zero elements are smaller than this one
\(Q_1\): One quarter of the elements are smaller
\(Q_2\): Two quarters (half) of the elements are smaller
\(Q_3\): Three quarters of the elements are smaller
\(Q_4\): Four quarters (all) of the elements are smaller

It is easy to see that \(Q_0\) is the minimum, \(Q_2\) is the median, and \(Q_4\) is the maximum

Quartiles and Quantiles

Generalizing, we can ask, for each percentage \(p\),

which is the value on the vector v that is greater than \(p\)% of all the values?

These values are called Quantiles, or sometimes Percentiles

Quartiles and Quantiles

The function in R for quantiles is called quantile()

By default it gives us the quartiles

quantile(weight)
   0%   25%   50%   75%  100% 
 42.5  56.0  65.0  75.0 106.0 

quantile() gives quartiles

quantile(weight)
   0%   25%   50%   75%  100% 
 42.5  56.0  65.0  75.0 106.0 

unless we ask for something else

quantile(weight, seq(from=0, to=1, by=0.1))
   0%   10%   20%   30%   40%   50%   60%   70%   80%   90% 
 42.5  51.8  55.0  58.0  61.0  65.0  68.0  73.3  77.0  82.3 
 100% 
106.0 

In summary()

summary(weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  42.50   56.00   65.00   66.48   75.00  106.00