Welcome back

to “Computing for Molecular Biology 1”

Data structures in R

Lists

  • Like vectores, but mixing different kinds of elements
people <- list(
    c(60,72,57,90,95, 72),
    c(1.75,1.80,1.65,1.90,1.74, 1.91),
    c("Peter", "John", "Frank",
      "Huey", "Dewey", "Louie"),
    TRUE,
    factor(rep("M",6),
          levels=c("M","F")))
  • Notice that elements can have different length

Result

people
[[1]]
[1] 60 72 57 90 95 72

[[2]]
[1] 1.75 1.80 1.65 1.90 1.74 1.91

[[3]]
[1] "Peter" "John"  "Frank" "Huey"  "Dewey" "Louie"

[[4]]
[1] TRUE

[[5]]
[1] M M M M M M
Levels: M F

Indexing Lists

  • Can be indexed same as vectors
  • Returns a sub-list
people[1:2]
[[1]]
[1] 60 72 57 90 95 72

[[2]]
[1] 1.75 1.80 1.65 1.90 1.74 1.91

Elements of Lists

people[1]
[[1]]
[1] 60 72 57 90 95 72
  • It is a sublist
people[[1]]
[1] 60 72 57 90 95 72
  • It is an element

Lists with Names

people <- list(
    weight=c(60,72,57,90,95, 72),
    height=c(1.75,1.80,1.65,1.90,1.74, 1.91),
    names=c("Peter", "John", "Frank",
            "Huey", "Dewey", "Louie"),
    valid=TRUE,
    gender=factor(rep("M",6),
           levels=c("M","F")))

(How else can we assign names?)

Lists with Names

people
$weight
[1] 60 72 57 90 95 72

$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91

$names
[1] "Peter" "John"  "Frank" "Huey"  "Dewey" "Louie"

$valid
[1] TRUE

$gender
[1] M M M M M M
Levels: M F

Indexing Lists with Names

  • Can be indexed same as vectors
  • Returns a sub-list
people[1:2]
$weight
[1] 60 72 57 90 95 72

$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91

Elements of Lists with Names

This is a sublist:

people[1]
$weight
[1] 60 72 57 90 95 72

This is a singe element:

people[[1]]
[1] 60 72 57 90 95 72
  • Equivalent to people[["weight"]]
  • Also equivalent to people$weight

Indexing Lists

  • List elements are indexed by [[]]
  • sublists are indexed by []

Try these

people[[2]]
people[2]
people[[2]][3]
people[2][3]
people[[1:3]]
people[1:3]
people[["weight"]]
people$weight
people["weight"]

Result

people[[2]]
[1] 1.75 1.80 1.65 1.90 1.74 1.91
people[2]
$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91
people[[2]][3]
[1] 1.65
people[2][3]
$<NA>
NULL

Result

people[[1:3]]
Error in people[[1:3]]: recursive indexing failed at level 2
people[1:3]
$weight
[1] 60 72 57 90 95 72

$height
[1] 1.75 1.80 1.65 1.90 1.74 1.91

$names
[1] "Peter" "John"  "Frank" "Huey"  "Dewey" "Louie"

Result

people[["weight"]]
[1] 60 72 57 90 95 72
people$weight
[1] 60 72 57 90 95 72
people["weight"]
$weight
[1] 60 72 57 90 95 72

Quiz

If key <- "names",

What is the diference between the following?

  • people[[key]]
  • people[[names]]
  • people$key
  • people$names

Explain

Changing parts of a List

Indices can also be used to change specifc parts of a list.

Try each of the following and explain the result:

people$names <- toupper(people$names)
people$BMI <- people$weight/people$height^2
people$valid <- NULL

Data Frames

Data Frames

  • Bi-dimensional, similar to matrices
  • Each column can be of a different type
ppl <- data.frame(
    weight=c(60, 72, 57, 90, 95, 72),
    height=c(1.75, 1.80, 1.65, 1.90,
             1.74, 1.91),
    names=c("Peter", "John", "Frank",
            "Huey", "Dewey", "Louie"),
    gender=factor(rep("M",6),
             levels=c("F","M")))

Data Frame

ppl
  weight height names gender
1     60   1.75 Peter      M
2     72   1.80  John      M
3     57   1.65 Frank      M
4     90   1.90  Huey      M
5     95   1.74 Dewey      M
6     72   1.91 Louie      M

Connecting with the real world

  • Data frames are the natural way to read data from files
    • and to write data to files
  • Look for the documentation of read.table()

  • Read the file birth.txt into the data.frame birth

  • Do summary(birth)

What is this?

Telling stories

Descriptive Statistics

We have data, we want to tell something about them

How can we summarize all the values in a few numbers?

Let’s use the vector birth$head.

To make it easier let’s rename it to v

v <- birth$head

Standard Data Descriptors

  • Number of elements
  • Location
  • Dispersion

Counting

  • length(v)

  • nrows(birth)

  • dim(birth)

  • table(birth$sex)

Location

If you have to describe the vector v with a single number X, which would it be?

If we have to replace each one of v[i] for a single number, which number is “the best”?

Location

Better choose one that is the “less wrong”

How can X be wrong?

Measuring error

Many alternatives

  • Number of times x!=v[i]
  • Absolute error sum(abs(v-x))
  • Quadratic error sum((v-x)^2)

Absolute error

sum(abs(v-x))

Which x minimizes absolute error?

Median

If \(x\) is the median of v, then

  • half of the values in v are smaller than x
  • half of the values in v are bigger than x

Quadratic error

sum((v-x)^2)

Which x minimizes squared error?

Aritmetic Mean

The mean value of v is \[\mathrm{mean}(v) = \frac{1}{n}\sum_{i=1}^n v_i\] where \(n\) is the length of v.

Sometimes it is written as \(\bar{v}\)

This value is usually called average

Location indices in R

  • Mode (no function for that in R)
  • Median: median(v)
  • Aritmetic Mean: mean(v)

Dispersion indices

If we approach v by X, how good is this approximation?

  • Number of mismatches
  • Mean of absolute error \[\mathrm{abs.err}(v,x) = \frac{1}{n}\sum_{i=1}^n \vert v_i-x\vert\] In R code we write

    sum(abs(v-x))/lenght(v)

    It is minimized when x==median(v).

Mean of quadratic error

if \(n\) is the length of \(v\), then \[\mathrm{quad.err}(v,x) = \frac{1}{n}\sum_{i=1}^n (v_i-x)^2\] In R code we write

sum((v-x)^2)/lenght(v)

It is minimized when x==mean(v)

In that case this number is called variance of the sample.

Variance and Standard Deviation

The variance of the sample v is

var(v) = sum((v-mean(v))^2)/lenght(v)

which is a number in squared units, so it is hard to compare with the mean value

The standar deviation of the sample is the square root of it

sd(v) = sqrt(sum((v-mean(v))^2)/lenght(v))

\[\mathrm{sd}(v) = \sqrt{\frac{1}{n}\sum_{i=1}^n (v_i-\bar{x})^2}\]

Variance and Standard Deviation

In many cases, including in R, people uses a slightly different formula \[\mathrm{sd}(v) = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (v_i-\bar{x})^2}\] Explaining the reason is for a next course.
(It is because of the bias of the expected value of the expected value)

This value is called standard deviation of the population

The difference is small, especially when \(n\) is big

Quartil

Quart means one fourth.

If we split v in four sets of the same size

Which are the limits of these sets?

\[ Q_0, Q_1, Q_2, Q_3, Q_4\]

It is easy to know \(Q_0, Q_2\) and \(Q_4\)