If you have to describe the vector `v`

with a single number `x`

, which would it be?

If we have to replace each one of `v[i]`

for a single number, which number is “the best”?

Better choose one that is the “less wrong”

How can `x`

be wrong?

November 8th, 2016

If you have to describe the vector `v`

with a single number `x`

, which would it be?

If we have to replace each one of `v[i]`

for a single number, which number is “the best”?

Better choose one that is the “less wrong”

How can `x`

be wrong?

Many alternatives

- Number of errors
`sum(x!=v[i])`

- Absolute error
`sum(abs(v-x))`

- Squared error
`sum((v-x)^2)`

Absolute error when \(x\) represents \(\mathbf v\) \[\mathrm{AE}(x, \mathbf{v})=\sum_i |v_i-x|\] or, in R code

sum(abs(v-x))

Which \(x\) minimizes absolute error?

We get the minimum absolute error when \(x=425\)

If `x`

is the *median* of `v`

, then

- half of the values in
`v`

are smaller than`x`

- half of the values in
`v`

are bigger than`x`

The *median* minimizes the absolute error

The squared error when \(x\) represents \(\mathbf v\) is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\] or, in R code

sum((v-x)^2)

Which \(x\) minimizes the squared error?

We get the minimum squared error when \(x=591.1843972\)

The error is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\]

To find the minimal value we can take the derivative of \(SE\) with respect to \(x\)

\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i (v_i - x)= 2\sum_i v_i - 2nx\]

**The minimal values of functions are located where the derivative is zero**

Now we find the value of \(x\) that makes the derivative equal to zero.

\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i v_i - 2nx\]

Making this last formula equal to zero and solving for \(x\) we found that the best one is

\[x = \frac{1}{n} \sum_i v_i\]

The *mean value* of \(\mathbf v\) is \[\text{mean}(\mathbf v) = \frac{1}{n}\sum_{i=1}^n v_i\] where \(n\) is the length of the vector \(\mathbf v\).

Sometimes it is written as \(\bar{\mathbf v}\)

This value is called *mean*

In R we write `mean(v)`

summary(rivers)

Min. 1st Qu. Median Mean 3rd Qu. Max. 135.0 310.0 425.0 591.2 680.0 3710.0

What are these values?

The easiest to understand are *minimum* and *maximum*

min(rivers)

[1] 135

max(rivers)

[1] 3710

Which sometimes can be useful together

range(rivers)

[1] 135 3710

*Quart* means *one fourth* in latin.

If we split the set of values in four subsets of the same size

Which are the limits of these sets?

\(Q_0\): Zero elements are smaller than this one

\(Q_1\): One quarter of the elements are smaller

\(Q_2\): Two quarters (half) of the elements are smaller

\(Q_3\): Three quarters of the elements are smaller

\(Q_4\): Four quarters (all) of the elements are smaller

It is easy to see that \(Q_0\) is the *minimum*, \(Q_2\) is the median, and \(Q_4\) is the maximum

Generalizing, we can ask, for each percentage \(p\), which is the value on the vector `v`

which is greater than \(p\)% of the rest of the values.

The function in *R* for that is called `quantile()`

By default it gives us the *quartiles*

quantile(rivers)

0% 25% 50% 75% 100% 135 310 425 680 3710

quantile(rivers, seq(0, 1, by=0.1))

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 135 255 291 330 375 425 505 610 735 1054 3710