Class 9: Essential descriptive statistics

November 8th, 2016

Location

If you have to describe the vector v with a single number x, which would it be?

If we have to replace each one of v[i] for a single number, which number is “the best”?

Better choose one that is the “less wrong”

How can x be wrong?

Measuring error

Many alternatives

Number of errors sum(x!=v[i])
Absolute error sum(abs(v-x))
Squared error sum((v-x)^2)

Absolute error

Absolute error when \(x\) represents \(\mathbf v\) \[\mathrm{AE}(x, \mathbf{v})=\sum_i |v_i-x|\] or, in R code

sum(abs(v-x))

Which \(x\) minimizes absolute error?

Absolute error

We get the minimum absolute error when \(x=425\)

Median

If x is the median of v, then

half of the values in v are smaller than x
half of the values in v are bigger than x

The median minimizes the absolute error

Squared error

The squared error when \(x\) represents \(\mathbf v\) is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\] or, in R code

sum((v-x)^2)

Which \(x\) minimizes the squared error?

Squared error

We get the minimum squared error when \(x=591.1843972\)

Minimizing SE using math

The error is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\]

To find the minimal value we can take the derivative of \(SE\) with respect to \(x\)

\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i (v_i - x)= 2\sum_i v_i - 2nx\]

The minimal values of functions are located where the derivative is zero

Minimizing SE using math

Now we find the value of \(x\) that makes the derivative equal to zero.

\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i v_i - 2nx\]

Making this last formula equal to zero and solving for \(x\) we found that the best one is

\[x = \frac{1}{n} \sum_i v_i\]

Arithmetic Mean

The mean value of \(\mathbf v\) is \[\text{mean}(\mathbf v) = \frac{1}{n}\sum_{i=1}^n v_i\] where \(n\) is the length of the vector \(\mathbf v\).

Sometimes it is written as \(\bar{\mathbf v}\)

This value is called mean

In R we write mean(v)

In summary

summary(rivers)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  135.0   310.0   425.0   591.2   680.0  3710.0

What are these values?

Minimum, Maximum and Range

The easiest to understand are minimum and maximum

min(rivers)

[1] 135

max(rivers)

[1] 3710

Which sometimes can be useful together

range(rivers)

[1]  135 3710

Quartiles

Quart means one fourth in latin.

If we split the set of values in four subsets of the same size

Which are the limits of these sets?

\(Q_0\): Zero elements are smaller than this one
\(Q_1\): One quarter of the elements are smaller
\(Q_2\): Two quarters (half) of the elements are smaller
\(Q_3\): Three quarters of the elements are smaller
\(Q_4\): Four quarters (all) of the elements are smaller

It is easy to see that \(Q_0\) is the minimum, \(Q_2\) is the median, and \(Q_4\) is the maximum

Quartiles and Quantiles

Generalizing, we can ask, for each percentage \(p\), which is the value on the vector v which is greater than \(p\)% of the rest of the values.

The function in R for that is called quantile()

By default it gives us the quartiles

quantile(rivers)

  0%  25%  50%  75% 100% 
 135  310  425  680 3710

quantile(rivers, seq(0, 1, by=0.1))

  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
 135  255  291  330  375  425  505  610  735 1054 3710