- Like vectores, but mixing different kinds of elements

people <- list( c(60,72,57,90,95, 72), c(1.75,1.80,1.65,1.90,1.74, 1.91), c("Peter", "John", "Frank", "Huey", "Dewey", "Louie"), TRUE, factor(rep("M",6), levels=c("M","F")))

- Notice that elements can have different length

people

[[1]] [1] 60 72 57 90 95 72 [[2]] [1] 1.75 1.80 1.65 1.90 1.74 1.91 [[3]] [1] "Peter" "John" "Frank" "Huey" "Dewey" "Louie" [[4]] [1] TRUE [[5]] [1] M M M M M M Levels: M F

- Can be indexed same as vectors
- Returns a sub-list

people[1:2]

[[1]] [1] 60 72 57 90 95 72 [[2]] [1] 1.75 1.80 1.65 1.90 1.74 1.91

people[1]

[[1]] [1] 60 72 57 90 95 72

- It is a sublist

people[[1]]

[1] 60 72 57 90 95 72

- It is an element

people <- list( weight=c(60,72,57,90,95, 72), height=c(1.75,1.80,1.65,1.90,1.74, 1.91), names=c("Peter", "John", "Frank", "Huey", "Dewey", "Louie"), valid=TRUE, gender=factor(rep("M",6), levels=c("M","F")))

(How else can we assign names?)

people

$weight [1] 60 72 57 90 95 72 $height [1] 1.75 1.80 1.65 1.90 1.74 1.91 $names [1] "Peter" "John" "Frank" "Huey" "Dewey" "Louie" $valid [1] TRUE $gender [1] M M M M M M Levels: M F

- Can be indexed same as vectors
- Returns a sub-list

people[1:2]

$weight [1] 60 72 57 90 95 72 $height [1] 1.75 1.80 1.65 1.90 1.74 1.91

This is a sublist:

people[1]

$weight [1] 60 72 57 90 95 72

This is a singe element:

people[[1]]

[1] 60 72 57 90 95 72

- Equivalent to
`people[["weight"]]`

- Also equivalent to
`people$weight`

- List elements are indexed by [[]]
- sublists are indexed by []

Try these

people[[2]] people[2] people[[2]][3] people[2][3] people[[1:3]] people[1:3] people[["weight"]] people$weight people["weight"]

people[[2]]

[1] 1.75 1.80 1.65 1.90 1.74 1.91

people[2]

$height [1] 1.75 1.80 1.65 1.90 1.74 1.91

people[[2]][3]

[1] 1.65

people[2][3]

$<NA> NULL

people[[1:3]]

Error in people[[1:3]]: recursive indexing failed at level 2

people[1:3]

$weight [1] 60 72 57 90 95 72 $height [1] 1.75 1.80 1.65 1.90 1.74 1.91 $names [1] "Peter" "John" "Frank" "Huey" "Dewey" "Louie"

people[["weight"]]

[1] 60 72 57 90 95 72

people$weight

[1] 60 72 57 90 95 72

people["weight"]

$weight [1] 60 72 57 90 95 72

If `key <- "names"`

,

What is the diference between the following?

`people[[key]]`

`people[[names]]`

`people$key`

`people$names`

Explain

Indices can also be used to change specifc parts of a list.

Try each of the following and explain the result:

people$names <- toupper(people$names) people$BMI <- people$weight/people$height^2 people$valid <- NULL

- Bi-dimensional, similar to matrices
- Each column can be of a different type

ppl <- data.frame( weight=c(60, 72, 57, 90, 95, 72), height=c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91), names=c("Peter", "John", "Frank", "Huey", "Dewey", "Louie"), gender=factor(rep("M",6), levels=c("F","M")))

ppl

weight height names gender 1 60 1.75 Peter M 2 72 1.80 John M 3 57 1.65 Frank M 4 90 1.90 Huey M 5 95 1.74 Dewey M 6 72 1.91 Louie M

- Data frames are the natural way to read data from files
- and to write data to files

Look for the documentation of

`read.table()`

Read the file

`birth.txt`

into the*data.frame*`birth`

Do

`summary(birth)`

What is this?

We have data, we want to tell something about them

How can we summarize all the values in a few numbers?

Let’s use the vector `birth$head`

.

To make it easier let’s rename it to `v`

v <- birth$head

- Number of elements
- Location
- Dispersion

`length(v)`

`nrows(birth)`

`dim(birth)`

`table(birth$sex)`

If you have to describe the vector `v`

with a single number *X*, which would it be?

If we have to replace each one of `v[i]`

for a single number, which number is “the best”?

Better choose one that is the “less wrong”

How can *X* be wrong?

Many alternatives

- Number of times
`x!=v[i]`

- Absolute error
`sum(abs(v-x))`

- Quadratic error
`sum((v-x)^2)`

sum(abs(v-x))

Which `x`

minimizes absolute error?

If \(x\) is the *median* of `v`

, then

- half of the values in
`v`

are smaller than`x`

- half of the values in
`v`

are bigger than`x`

sum((v-x)^2)

Which `x`

minimizes squared error?

The *mean value* of `v`

is \[\mathrm{mean}(v) = \frac{1}{n}\sum_{i=1}^n v_i\] where \(n\) is the length of `v`

.

Sometimes it is written as \(\bar{v}\)

This value is usually called *average*

- Mode (no function for that in R)
- Median:
`median(v)`

- Aritmetic Mean:
`mean(v)`

If we approach `v`

by *X*, how good is this approximation?

- Number of mismatches
Mean of absolute error \[\mathrm{abs.err}(v,x) = \frac{1}{n}\sum_{i=1}^n \vert v_i-x\vert\] In R code we write

sum(abs(v-x))/lenght(v)

It is minimized when

`x==median(v)`

.

if \(n\) is the length of \(v\), then \[\mathrm{quad.err}(v,x) = \frac{1}{n}\sum_{i=1}^n (v_i-x)^2\] In R code we write

sum((v-x)^2)/lenght(v)

It is minimized when `x==mean(v)`

In that case this number is called *variance* of the sample.

The *variance of the sample* `v`

is

var(v) = sum((v-mean(v))^2)/lenght(v)

which is a number in **squared units**, so it is hard to compare with the mean value

The *standar deviation of the sample* is the square root of it

sd(v) = sqrt(sum((v-mean(v))^2)/lenght(v))

\[\mathrm{sd}(v) = \sqrt{\frac{1}{n}\sum_{i=1}^n (v_i-\bar{x})^2}\]

In many cases, including in R, people uses a slightly different formula \[\mathrm{sd}(v) = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (v_i-\bar{x})^2}\] Explaining the reason is for a next course.

(It is because of the bias of the expected value of the expected value)

This value is called *standard deviation of the population*

The difference is small, especially when \(n\) is big

*Quart* means one fourth.

If we split `v`

in four sets of the same size

Which are the limits of these sets?

\[ Q_0, Q_1, Q_2, Q_3, Q_4\]

It is easy to know \(Q_0, Q_2\) and \(Q_4\)