Let’s recapitulate. We read data with

birth <- read.table("birth.txt", header=T)

which results in a data frame like this:

head(birth)

id birth apgar5 sex weight head age parity weeks 1 4347 1 8 F 1610 41.0 28.5 1 31 2 4346 1 9 F 3580 51.0 35.0 1 39 3 4300 1 9 F 3350 52.0 37.0 1 40 4 4345 1 9 F 3230 50.5 35.0 1 38 5 4349 1 8 F 3650 52.0 36.5 1 40 6 4315 2 8 F 3900 51.0 35.0 1 38

summary(birth)

id birth apgar5 sex weight Min. : 4199 Min. :1.000 Min. :1.000 F:299 Min. :1100 1st Qu.: 6112 1st Qu.:1.000 1st Qu.:8.000 M:395 1st Qu.:2972 Median : 7920 Median :1.000 Median :9.000 Median :3250 Mean : 7877 Mean :1.677 Mean :8.281 Mean :3244 3rd Qu.: 9606 3rd Qu.:2.000 3rd Qu.:9.000 3rd Qu.:3580 Max. :11475 Max. :3.000 Max. :9.000 Max. :5000 head age parity weeks Min. :34.0 Min. :22.00 Min. :1.000 Min. :26.00 1st Qu.:48.0 1st Qu.:33.50 1st Qu.:1.000 1st Qu.:38.00 Median :50.0 Median :34.50 Median :2.000 Median :39.00 Mean :49.3 Mean :34.42 Mean :2.611 Mean :38.75 3rd Qu.:51.0 3rd Qu.:35.50 3rd Qu.:4.000 3rd Qu.:40.00 Max. :55.0 Max. :42.00 Max. :9.000 Max. :42.00

What are these values?

The easiest to understand are *minimum* and *maximum*

min(birth$weight)

[1] 1100

max(birth$weight)

[1] 5000

Which sometimes can be useful together

range(birth$weight)

[1] 1100 5000

If `m`

is the *median* of a vector `v`

, then

- half of the values in
`v`

are smaller than`m`

- half of the values in
`v`

are bigger than`m`

median(birth$weight)

[1] 3250

The median is the value \(m\) minimizes the absolute error \(\sum_{i=1}^n \vert v_i-m\vert\).

What if the number of elements is even?

*Quart* means *one fourth* in latin.

If we split the set of values in four subsets of the same size

Which are the limits of these sets?

\(Q_0\): Zero elements are smaller than this one

\(Q_1\): One quarter of the elements are smaller

\(Q_2\): Two quarters (half) of the elements are smaller

\(Q_3\): Three quarters of the elements are smaller

\(Q_4\): Four quarters (all) of the elements are smaller

It is easy to see that \(Q_0\) is the *minimum*, \(Q_2\) is the median, and \(Q_4\) is the maximum

Generalizing, we can ask, for each percentage `p`

, which is the value on the vector `v`

which is greather than `p`

% of the rest of the values.

The function in *R* for that is called `quantile()`

By default it gives us the *quartiles*

quantile(birth$weight)

0% 25% 50% 75% 100% 1100.0 2972.5 3250.0 3580.0 5000.0

quantile(birth$weight, seq(0, 1, by=0.1))

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1100 2630 2906 3020 3140 3250 3388 3530 3660 3850 5000

The *mean value* of the vector `v`

is \[\mathrm{mean}(v) = \frac{1}{n}\sum_{i=1}^n v_i\] where \(n\) is the length of `v`

. Sometimes it is written as \(\bar{v}\)

mean(birth$weight)

[1] 3243.905

This value is usually called *average*, but the correct name is *mean*

This value minimizes the quadratic error \(\sum_{i=1}^n (v_i-\bar{v})^2\)

if \(n\) is the length of \(v\) and \(\bar{v}\) is its mean, then the mean of the quadratic error is \[\frac{1}{n}\sum_{i=1}^n (v_i-\bar{v})^2\] This number is called *variance of the sample *

This is a number in *squared units*, so it is hard to compare with the mean value

mean(birth$weight)

[1] 3243.905

var(birth$weight)

[1] 273438

To understand easily the variability of the data, we take the square root of the variance

The *standar deviation of the sample* is the square root of the variance \[\sqrt{\frac{1}{n}\sum_{i=1}^n (v_i-\bar{v})^2}\]

In many cases, including in R, people uses a slightly different formula \[\mathrm{sd}(v) = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (v_i-\bar{v})^2}\] Explaining the reason is for a next course.

(It is because of the bias of the expected value of the expected value)

This value is called *standard deviation of the population*

The difference is small, especially when \(n\) is big

v <- birth$weight sd(v)

[1] 522.913

sqrt(sum((v-mean(v))^2)/length(v))

[1] 522.5361

sqrt(sum((v-mean(v))^2)/(length(v)-1))

[1] 522.913

Sometimes the best way to tell the story of the data is with a graphic

plot(birth$weight)

each individual has a position in the *x* axis

plot(birth$head)

The previous graphics used *numeric* data. What about factors?

plot(birth$sex)

par(mfrow = c(1,2)) plot(birth$apgar5) plot(as.factor(birth$apgar5))

par(mfrow = c(1,2)) plot(birth$head) hist(birth$head)

Numeric data is grouped in *N* **classes** or **bins**

par(mfrow = c(1,2)) hist(birth$head, col="grey") hist(birth$head, col="grey", nclass = 30)

par(mfrow = c(1,2)) plot(birth$head) plot(birth$head, col="red")

par(mfrow = c(1,2)) plot(birth$head, cex=2) plot(birth$head, cex=0.5)

par(mfrow = c(1,2)) plot(birth$head, pch=16) plot(birth$head, pch=".")

par(mfrow = c(1,2)) plot(birth$head, type = "l") plot(birth$head, type = "o")

par(mfrow = c(1,2)) plot(birth$head, type = "l", xlim=c(1,100)) plot(birth$head, type = "o", xlim=c(1,100))

par(mfrow = c(1,2)) plot(birth$head, type = "b", xlim=c(1,100)) plot(birth$head, type = "n", xlim=c(1,100))

plot(birth$weight, main = "Weight at Birth", sub = "694 samples", ylab="weight [gr]")

plot(birth$head) points(birth$age, pch=2)

The first one defines the scale

plot(birth$head, type="l", ylim = c(22,55)) lines(birth$age, col="red")

plot(birth$weight[birth$sex=="F"], ylim=range(birth$weight), ylab = "weight [gr]") points(birth$weight[birth$sex=="M"], col="blue")