Class 6

Welcome back

to “Computing for Molecular Biology 1”

Data structures in R

Lists

Like vectores, but mixing different kinds of elements

people <- list(
    c(60,72,57,90,95, 72),
    c(1.75,1.80,1.65,1.90,1.74, 1.91),
    c("Peter", "John", "Frank",
      "Huey", "Dewey", "Louie"),
    TRUE,
    factor(rep("M",6),
          levels=c("M","F")))

Notice that elements can have different length

Result

people

[[1]]
[1] 60 72 57 90 95 72

[[2]]
[1] 1.75 1.80 1.65 1.90 1.74 1.91

[[3]]
[1] "Peter" "John"  "Frank" "Huey"  "Dewey" "Louie"

[[4]]
[1] TRUE

[[5]]
[1] M M M M M M
Levels: M F

Indexing Lists

Can be indexed same as vectors
Returns a sub-list

people[1:2]

[[1]]
[1] 60 72 57 90 95 72

[[2]]
[1] 1.75 1.80 1.65 1.90 1.74 1.91

Elements of Lists

people[1]

[[1]]
[1] 60 72 57 90 95 72

It is a sublist

people[[1]]

[1] 60 72 57 90 95 72

It is an element

Lists with Names

people <- list(
    weight=c(60,72,57,90,95, 72),
    height=c(1.75,1.80,1.65,1.90,1.74, 1.91),
    names=c("Peter", "John", "Frank",
            "Huey", "Dewey", "Louie"),
    valid=TRUE,
    gender=factor(rep("M",6),
           levels=c("M","F")))

(How else can we assign names?)

Lists with Names

people

$weight
[1] 60 72 57 90 95 72

$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91

$names
[1] "Peter" "John"  "Frank" "Huey"  "Dewey" "Louie"

$valid
[1] TRUE

$gender
[1] M M M M M M
Levels: M F

Indexing Lists with Names

Can be indexed same as vectors
Returns a sub-list

people[1:2]

$weight
[1] 60 72 57 90 95 72

$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91

Elements of Lists with Names

This is a sublist:

people[1]

$weight
[1] 60 72 57 90 95 72

This is a singe element:

people[[1]]

[1] 60 72 57 90 95 72

Equivalent to people[["weight"]]
Also equivalent to people$weight

Indexing Lists

List elements are indexed by [[]]
sublists are indexed by []

Try these

people[[2]]
people[2]
people[[2]][3]
people[2][3]
people[[1:3]]
people[1:3]
people[["weight"]]
people$weight
people["weight"]

Result

people[[2]]

[1] 1.75 1.80 1.65 1.90 1.74 1.91

people[2]

$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91

people[[2]][3]

[1] 1.65

people[2][3]

$<NA>
NULL

Result

people[[1:3]]

Error in people[[1:3]]: recursive indexing failed at level 2

people[1:3]

$weight
[1] 60 72 57 90 95 72

$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91

$names
[1] "Peter" "John"  "Frank" "Huey"  "Dewey" "Louie"

Result

people[["weight"]]

[1] 60 72 57 90 95 72

people$weight

[1] 60 72 57 90 95 72

people["weight"]

$weight
[1] 60 72 57 90 95 72

Quiz

If key <- "names",

What is the diference between the following?

people[[key]]
people[[names]]
people$key
people$names

Explain

Changing parts of a List

Indices can also be used to change specifc parts of a list.

Try each of the following and explain the result:

people$names <- toupper(people$names)
people$BMI <- people$weight/people$height^2
people$valid <- NULL

Data Frames

Bi-dimensional, similar to matrices
Each column can be of a different type

ppl <- data.frame(
    weight=c(60, 72, 57, 90, 95, 72),
    height=c(1.75, 1.80, 1.65, 1.90,
             1.74, 1.91),
    names=c("Peter", "John", "Frank",
            "Huey", "Dewey", "Louie"),
    gender=factor(rep("M",6),
             levels=c("F","M")))

Data Frame

ppl

  weight height names gender
1     60   1.75 Peter      M
2     72   1.80  John      M
3     57   1.65 Frank      M
4     90   1.90  Huey      M
5     95   1.74 Dewey      M
6     72   1.91 Louie      M

Connecting with the real world

Data frames are the natural way to read data from files
- and to write data to files
Look for the documentation of read.table()
Read the file birth.txt into the data.frame birth
Do summary(birth)

What is this?

Telling stories

Descriptive Statistics

We have data, we want to tell something about them

How can we summarize all the values in a few numbers?

Let’s use the vector birth$head.

To make it easier let’s rename it to v

v <- birth$head

Standard Data Descriptors

Number of elements
Location
Dispersion

Counting

length(v)
nrows(birth)
dim(birth)
table(birth$sex)

Location

If you have to describe the vector v with a single number X, which would it be?

If we have to replace each one of v[i] for a single number, which number is “the best”?

Location

Better choose one that is the “less wrong”

How can X be wrong?

Measuring error

Many alternatives

Number of times x!=v[i]
Absolute error sum(abs(v-x))
Quadratic error sum((v-x)^2)

Absolute error

sum(abs(v-x))

Which x minimizes absolute error?

Median

If $x$ is the median of v, then

half of the values in v are smaller than x
half of the values in v are bigger than x

Quadratic error

sum((v-x)^2)

Which x minimizes squared error?

Aritmetic Mean

The mean value of v is \[\mathrm{mean}(v) = \frac{1}{n}\sum_{i=1}^n v_i\] where $n$ is the length of v.

Sometimes it is written as $\bar{v}$

This value is usually called average

Location indices in R

Mode (no function for that in R)
Median: median(v)
Aritmetic Mean: mean(v)

Dispersion indices

If we approach v by X, how good is this approximation?

Number of mismatches
Mean of absolute error \[\mathrm{abs.err}(v,x) = \frac{1}{n}\sum_{i=1}^n \vert v_i-x\vert\] In R code we write
```
sum(abs(v-x))/lenght(v)
```
It is minimized when x==median(v).

Mean of quadratic error

if $n$ is the length of $v$, then \[\mathrm{quad.err}(v,x) = \frac{1}{n}\sum_{i=1}^n (v_i-x)^2\] In R code we write

sum((v-x)^2)/lenght(v)

It is minimized when x==mean(v)

In that case this number is called variance of the sample.

Variance and Standard Deviation

The variance of the sample v is

var(v) = sum((v-mean(v))^2)/lenght(v)

which is a number in squared units, so it is hard to compare with the mean value

The standar deviation of the sample is the square root of it

sd(v) = sqrt(sum((v-mean(v))^2)/lenght(v))

\[\mathrm{sd}(v) = \sqrt{\frac{1}{n}\sum_{i=1}^n (v_i-\bar{x})^2}\]

Variance and Standard Deviation

In many cases, including in R, people uses a slightly different formula \[\mathrm{sd}(v) = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (v_i-\bar{x})^2}\] Explaining the reason is for a next course.
(It is because of the bias of the expected value of the expected value)

This value is called standard deviation of the population

The difference is small, especially when $n$ is big

Quartil

Quart means one fourth.

If we split v in four sets of the same size

Which are the limits of these sets?

\[ Q_0, Q_1, Q_2, Q_3, Q_4\]

It is easy to know $Q_0, Q_2$ and $Q_4$