survey <- read.table("survey1-tidy.txt")

We can even take a vector from our data

height <- survey$height

So what?

October 18, 2018

survey <- read.table("survey1-tidy.txt")

We can even take a vector from our data

height <- survey$height

So what?

We have data, we want to tell something about them

What can we tell about this set of numbers?

How can we make a summary of all the values in a few numbers?

- Number of elements (
*How many?*) - Location (
*Where?*) - Dispersion (
*Are they homogeneous? Are they similar to each other?*)

- For vectors we use
`length()`

- For matrices and data frames we use
`nrow()`

`dim()`

gives us*rows*and*columns*

length(height)

[1] 51

nrow(survey)

[1] 51

dim(survey)

[1] 51 8

`table()`

should be called **count**. It is good for *factors*

table(survey$handness)

Left Right 4 47

table(survey$Gender)

Female Male 30 21

table(survey$handness, survey$Gender)

Female Male Left 3 1 Right 27 20

This looks more like a table

`table()`

is not good with numerictable(height)

height 155 157 158 159 160 162 163 164 165 166 167 168 170 1 1 2 1 3 3 3 1 3 2 2 1 2 171 172 173 174 175 176 177 178 179 180 181 182 183 1 1 3 3 4 1 1 2 1 2 1 1 1 184 185 188 195 1 1 1 1

It is not a *good summary*

What can we say?

`TRUE`

valuesHow many people is taller than 165cm?

sum(height > 165)

[1] 33

If you have to describe the vector `v`

with a single number `x`

, which would it be?

If we have to replace each one of `v[i]`

for a single number, which number is “the best”?

Better choose one that is the “less wrong”

How can `x`

be wrong?

`x`

be wrong?Many alternatives to measure the error

- Number of times that
`x!=v[i]`

- Sum of absolute value of error
- Sum of the square of error

Absolute error when \(x\) represents \(\mathbf v\) \[\mathrm{AE}(x, \mathbf{v})=\sum_i |v_i-x|\]

Which \(x\) minimizes absolute error?

We get the minimum absolute error when \(x=171\)

If `x`

is the *median* of `v`

, then

- half of the values in
`v`

are smaller than`x`

- half of the values in
`v`

are bigger than`x`

The *median* minimizes the absolute error

The squared error when \(x\) represents \(\mathbf v\) is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\] Which \(x\) minimizes the squared error?

We get the minimum squared error when \(x=170.6862745\)

The error is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\]

To find the minimal value we can take the derivative of \(SE\) with respect to \(x\)

\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i (v_i - x)= 2\sum_i v_i - 2nx\]

**The minimal values of functions are located where the derivative is zero**

Now we find the value of \(x\) that makes the derivative equal to zero.

\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i v_i - 2nx\]

Making this last formula equal to zero and solving for \(x\) we found that the best one is

\[x = \frac{1}{n} \sum_i v_i\]

The *mean value* of \(\mathbf v\) is \[\text{mean}(\mathbf v) = \frac{1}{n}\sum_{i=1}^n v_i\] where \(n\) is the length of the vector \(\mathbf v\).

Sometimes it is written as \(\bar{\mathbf v}\)

This value is called *mean*

In R we write `mean(v)`

summary(height)

Min. 1st Qu. Median Mean 3rd Qu. Max. 155.0 163.0 171.0 170.7 176.5 195.0

What are these values?

The easiest to understand are *minimum* and *maximum*

min(height)

[1] 155

max(height)

[1] 195

Which sometimes can be useful together

range(height)

[1] 155 195

*Quart* means *one fourth* in latin.

If we split the set of values in four subsets of the same size

Which are the limits of these sets? \[Q_0, Q_1, Q_2, Q_3, Q_4\]

It is easy to know \(Q_0, Q_2\) and \(Q_4\)

\(Q_0\): Zero elements are smaller than this one

\(Q_1\): One quarter of the elements are smaller

\(Q_2\): Two quarters (half) of the elements are smaller

\(Q_3\): Three quarters of the elements are smaller

\(Q_4\): Four quarters (all) of the elements are smaller

It is easy to see that \(Q_0\) is the *minimum*, \(Q_2\) is the median, and \(Q_4\) is the maximum

Generalizing, we can ask, for each percentage \(p\), which is the value on the vector `v`

which is greater than \(p\)% of the rest of the values.

The function in *R* for that is called `quantile()`

By default it gives us the *quartiles*

quantile(height)

0% 25% 50% 75% 100% 155.0 163.0 171.0 176.5 195.0

`quantile()`

gives quartilesquantile(height)

0% 25% 50% 75% 100% 155.0 163.0 171.0 176.5 195.0

unless we ask for something else

quantile(height, seq(0, 1, by=0.1))

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 155 160 162 165 167 171 174 175 178 182 195

`summary()`

summary(height)

Min. 1st Qu. Median Mean 3rd Qu. Max. 155.0 163.0 171.0 170.7 176.5 195.0

The command `cut()`

separates the vector and makes a **factor** for each group. This is a factor:

cut(height, 4)

[1] (175,185] (165,175] (165,175] (175,185] (155,165] [6] (165,175] (165,175] (155,165] (165,175] (185,195] [11] (175,185] (175,185] (155,165] (155,165] (175,185] [16] (155,165] (165,175] (155,165] (155,165] (155,165] [21] (165,175] (155,165] (165,175] (155,165] (155,165] [26] (155,165] (165,175] (165,175] (175,185] (175,185] [31] (175,185] (165,175] (165,175] (165,175] (165,175] [36] (165,175] (165,175] (155,165] (165,175] (155,165] [41] (175,185] (155,165] (175,185] (165,175] (165,175] [46] (155,165] (155,165] (185,195] (155,165] (175,185] [51] (175,185] Levels: (155,165] (165,175] (175,185] (185,195]

Used this way, the range is split in parts of the same size, not with the same number of people

table(cut(height, 4))

(155,165] (165,175] (175,185] (185,195] 18 19 12 2

These are not the quartiles

We can specify the cut points using `quantile()`

table(cut(height, quantile(height), include.lowest = TRUE))

[155,163] (163,171] (171,176] (176,195] 14 12 12 13

Now every group has (almost) the same size