October 18, 2018

## We have our own data

survey <- read.table("survey1-tidy.txt")

We can even take a vector from our data

height <- survey$height So what? ## Descriptive Statistics We have data, we want to tell something about them What can we tell about this set of numbers? How can we make a summary of all the values in a few numbers? ## Standard Data Descriptors • Number of elements (How many?) • Location (Where?) • Dispersion (Are they homogeneous? Are they similar to each other?) ## How many in total • For vectors we use length() • For matrices and data frames we use nrow() • dim() gives us rows and columns ## Counting how many in total length(height) [1] 51 nrow(survey) [1] 51 dim(survey) [1] 51 8 ## Counting how many of each table() should be called count. It is good for factors table(survey$handness)
 Left Right
4    47 
table(survey$Gender) Female Male 30 21  ## We can count combinations table(survey$handness, survey\$Gender)

Female Male
Left       3    1
Right     27   20

This looks more like a table

## table() is not good with numeric

table(height)
height
155 157 158 159 160 162 163 164 165 166 167 168 170
1   1   2   1   3   3   3   1   3   2   2   1   2
171 172 173 174 175 176 177 178 179 180 181 182 183
1   1   3   3   4   1   1   2   1   2   1   1   1
184 185 188 195
1   1   1   1 

It is not a good summary

What can we say?

## Counting TRUE values

How many people is taller than 165cm?

sum(height > 165)
[1] 33

## Location

If you have to describe the vector v with a single number x, which would it be?

If we have to replace each one of v[i] for a single number, which number is “the best”?

Better choose one that is the “less wrong”

How can x be wrong?

## How can x be wrong?

Many alternatives to measure the error

• Number of times that x!=v[i]
• Sum of absolute value of error
• Sum of the square of error

## Absolute error

Absolute error when $$x$$ represents $$\mathbf v$$ $\mathrm{AE}(x, \mathbf{v})=\sum_i |v_i-x|$

Which $$x$$ minimizes absolute error?

## Median: minimum Absolute Error

We get the minimum absolute error when $$x=171$$

## Median

If x is the median of v, then

• half of the values in v are smaller than x
• half of the values in v are bigger than x

The median minimizes the absolute error

## Squared error

The squared error when $$x$$ represents $$\mathbf v$$ is $\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2$ Which $$x$$ minimizes the squared error?

## Mean: minimum Squared error

We get the minimum squared error when $$x=170.6862745$$

## Minimizing SE using math

The error is $\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2$

To find the minimal value we can take the derivative of $$SE$$ with respect to $$x$$

$\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i (v_i - x)= 2\sum_i v_i - 2nx$

The minimal values of functions are located where the derivative is zero

## Minimizing SE using math

Now we find the value of $$x$$ that makes the derivative equal to zero.

$\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i v_i - 2nx$

Making this last formula equal to zero and solving for $$x$$ we found that the best one is

$x = \frac{1}{n} \sum_i v_i$

## Arithmetic Mean

The mean value of $$\mathbf v$$ is $\text{mean}(\mathbf v) = \frac{1}{n}\sum_{i=1}^n v_i$ where $$n$$ is the length of the vector $$\mathbf v$$.

Sometimes it is written as $$\bar{\mathbf v}$$

This value is called mean

In R we write mean(v)

## In summary

summary(height)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
155.0   163.0   171.0   170.7   176.5   195.0 

What are these values?

## Minimum, Maximum and Range

The easiest to understand are minimum and maximum

min(height)
[1] 155
max(height)
[1] 195

## Range: min and max together

Which sometimes can be useful together

range(height)
[1] 155 195

## Quartiles

Quart means one fourth in latin.

If we split the set of values in four subsets of the same size

Which are the limits of these sets? $Q_0, Q_1, Q_2, Q_3, Q_4$

It is easy to know $$Q_0, Q_2$$ and $$Q_4$$

## Quartiles

$$Q_0$$: Zero elements are smaller than this one
$$Q_1$$: One quarter of the elements are smaller
$$Q_2$$: Two quarters (half) of the elements are smaller
$$Q_3$$: Three quarters of the elements are smaller
$$Q_4$$: Four quarters (all) of the elements are smaller

It is easy to see that $$Q_0$$ is the minimum, $$Q_2$$ is the median, and $$Q_4$$ is the maximum

## Quartiles and Quantiles

Generalizing, we can ask, for each percentage $$p$$, which is the value on the vector v which is greater than $$p$$% of the rest of the values.

The function in R for that is called quantile()

By default it gives us the quartiles

quantile(height)
   0%   25%   50%   75%  100%
155.0 163.0 171.0 176.5 195.0 

## quantile() gives quartiles

quantile(height)
   0%   25%   50%   75%  100%
155.0 163.0 171.0 176.5 195.0 

unless we ask for something else

quantile(height, seq(0, 1, by=0.1))
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100%
155  160  162  165  167  171  174  175  178  182  195 

## Summary

summary(height)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
155.0   163.0   171.0   170.7   176.5   195.0 

## Dividing the vector in groups

The command cut() separates the vector and makes a factor for each group. This is a factor:

cut(height, 4)
 [1] (175,185] (165,175] (165,175] (175,185] (155,165]
[6] (165,175] (165,175] (155,165] (165,175] (185,195]
[11] (175,185] (175,185] (155,165] (155,165] (175,185]
[16] (155,165] (165,175] (155,165] (155,165] (155,165]
[21] (165,175] (155,165] (165,175] (155,165] (155,165]
[26] (155,165] (165,175] (165,175] (175,185] (175,185]
[31] (175,185] (165,175] (165,175] (165,175] (165,175]
[36] (165,175] (165,175] (155,165] (165,175] (155,165]
[41] (175,185] (155,165] (175,185] (165,175] (165,175]
[46] (155,165] (155,165] (185,195] (155,165] (175,185]
[51] (175,185]
Levels: (155,165] (165,175] (175,185] (185,195]

## Are these the quartiles?

Used this way, the range is split in parts of the same size, not with the same number of people

table(cut(height, 4))
(155,165] (165,175] (175,185] (185,195]
18        19        12         2 

These are not the quartiles

## Using the real quartiles

We can specify the cut points using quantile()

table(cut(height, quantile(height),
include.lowest = TRUE))
[155,163] (163,171] (171,176] (176,195]
14        12        12        13 

Now every group has (almost) the same size