## Lists

• Like vectores, but mixing different kinds of elements
people <- list(
c(60,72,57,90,95, 72),
c(1.75,1.80,1.65,1.90,1.74, 1.91),
c("Peter", "John", "Frank",
"Huey", "Dewey", "Louie"),
TRUE,
factor(rep("M",6),
levels=c("M","F")))
• Notice that elements can have different length

## Result

people
[]
 60 72 57 90 95 72

[]
 1.75 1.80 1.65 1.90 1.74 1.91

[]
 "Peter" "John"  "Frank" "Huey"  "Dewey" "Louie"

[]
 TRUE

[]
 M M M M M M
Levels: M F

## Indexing Lists

• Can be indexed same as vectors
• Returns a sub-list
people[1:2]
[]
 60 72 57 90 95 72

[]
 1.75 1.80 1.65 1.90 1.74 1.91

## Elements of Lists

people
[]
 60 72 57 90 95 72
• It is a sublist
people[]
 60 72 57 90 95 72
• It is an element

## Lists with Names

people <- list(
weight=c(60,72,57,90,95, 72),
height=c(1.75,1.80,1.65,1.90,1.74, 1.91),
names=c("Peter", "John", "Frank",
"Huey", "Dewey", "Louie"),
valid=TRUE,
gender=factor(rep("M",6),
levels=c("M","F")))

(How else can we assign names?)

## Lists with Names

people
$weight  60 72 57 90 95 72$height
 1.75 1.80 1.65 1.90 1.74 1.91

$names  "Peter" "John" "Frank" "Huey" "Dewey" "Louie"$valid
 TRUE

$gender  M M M M M M Levels: M F ## Indexing Lists with Names • Can be indexed same as vectors • Returns a sub-list people[1:2] $weight
 60 72 57 90 95 72

$height  1.75 1.80 1.65 1.90 1.74 1.91 ## Elements of Lists with Names This is a sublist: people $weight
 60 72 57 90 95 72

This is a singe element:

people[]
 60 72 57 90 95 72
• Equivalent to people[["weight"]]
• Also equivalent to people$weight ## Indexing Lists • List elements are indexed by [[]] • sublists are indexed by [] Try these people[] people people[] people people[[1:3]] people[1:3] people[["weight"]] people$weight
people["weight"]

## Result

people[]
 1.75 1.80 1.65 1.90 1.74 1.91
people
$height  1.75 1.80 1.65 1.90 1.74 1.91 people[]  1.65 people $<NA>
NULL

## Result

people[[1:3]]
Error in people[[1:3]]: recursive indexing failed at level 2
people[1:3]
$weight  60 72 57 90 95 72$height
 1.75 1.80 1.65 1.90 1.74 1.91

$names  "Peter" "John" "Frank" "Huey" "Dewey" "Louie" ## Result people[["weight"]]  60 72 57 90 95 72 people$weight
 60 72 57 90 95 72
people["weight"]
people$valid <- NULL ## Data Frames ## Data Frames • Bi-dimensional, similar to matrices • Each column can be of a different type ppl <- data.frame( weight=c(60, 72, 57, 90, 95, 72), height=c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91), names=c("Peter", "John", "Frank", "Huey", "Dewey", "Louie"), gender=factor(rep("M",6), levels=c("F","M"))) ## Data Frame ppl  weight height names gender 1 60 1.75 Peter M 2 72 1.80 John M 3 57 1.65 Frank M 4 90 1.90 Huey M 5 95 1.74 Dewey M 6 72 1.91 Louie M ## Connecting with the real world • Data frames are the natural way to read data from files • and to write data to files • Look for the documentation of read.table() • Read the file birth.txt into the data.frame birth • Do summary(birth) What is this? ## Telling stories ## Descriptive Statistics We have data, we want to tell something about them How can we summarize all the values in a few numbers? Let’s use the vector birth$head.

To make it easier let’s rename it to v

## Location

If you have to describe the vector v with a single number X, which would it be?

If we have to replace each one of v[i] for a single number, which number is “the best”?

## Location

Better choose one that is the “less wrong”

How can X be wrong?

## Measuring error

Many alternatives

• Number of times x!=v[i]
• Absolute error sum(abs(v-x))
• Quadratic error sum((v-x)^2)

## Absolute error

sum(abs(v-x))

Which x minimizes absolute error?

## Median

If $$x$$ is the median of v, then

• half of the values in v are smaller than x
• half of the values in v are bigger than x

sum((v-x)^2)

Which x minimizes squared error?

## Aritmetic Mean

The mean value of v is $\mathrm{mean}(v) = \frac{1}{n}\sum_{i=1}^n v_i$ where $$n$$ is the length of v.

Sometimes it is written as $$\bar{v}$$

This value is usually called average

## Location indices in R

• Mode (no function for that in R)
• Median: median(v)
• Aritmetic Mean: mean(v)

## Dispersion indices

If we approach v by X, how good is this approximation?

• Number of mismatches
• Mean of absolute error $\mathrm{abs.err}(v,x) = \frac{1}{n}\sum_{i=1}^n \vert v_i-x\vert$ In R code we write

sum(abs(v-x))/lenght(v)

It is minimized when x==median(v).

if $$n$$ is the length of $$v$$, then $\mathrm{quad.err}(v,x) = \frac{1}{n}\sum_{i=1}^n (v_i-x)^2$ In R code we write

sum((v-x)^2)/lenght(v)

It is minimized when x==mean(v)

In that case this number is called variance of the sample.

## Variance and Standard Deviation

The variance of the sample v is

var(v) = sum((v-mean(v))^2)/lenght(v)

which is a number in squared units, so it is hard to compare with the mean value

The standar deviation of the sample is the square root of it

sd(v) = sqrt(sum((v-mean(v))^2)/lenght(v))

$\mathrm{sd}(v) = \sqrt{\frac{1}{n}\sum_{i=1}^n (v_i-\bar{x})^2}$

## Variance and Standard Deviation

In many cases, including in R, people uses a slightly different formula $\mathrm{sd}(v) = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (v_i-\bar{x})^2}$ Explaining the reason is for a next course.
(It is because of the bias of the expected value of the expected value)

This value is called standard deviation of the population

The difference is small, especially when $$n$$ is big

## Quartil

Quart means one fourth.

If we split v in four sets of the same size

Which are the limits of these sets?

$Q_0, Q_1, Q_2, Q_3, Q_4$

It is easy to know $$Q_0, Q_2$$ and $$Q_4$$