October 18, 2018

We have our own data

survey <- read.table("survey1-tidy.txt")

We can even take a vector from our data

height <- survey$height

So what?

Descriptive Statistics

We have data, we want to tell something about them

What can we tell about this set of numbers?

How can we make a summary of all the values in a few numbers?

Standard Data Descriptors

  • Number of elements (How many?)
  • Location (Where?)
  • Dispersion (Are they homogeneous? Are they similar to each other?)

How many in total

  • For vectors we use length()
  • For matrices and data frames we use nrow()
  • dim() gives us rows and columns

Counting how many in total

length(height)
[1] 51
nrow(survey)
[1] 51
dim(survey)
[1] 51  8

Counting how many of each

table() should be called count. It is good for factors

table(survey$handness)
 Left Right 
    4    47 
table(survey$Gender)
Female   Male 
    30     21 

We can count combinations

table(survey$handness, survey$Gender)
       
        Female Male
  Left       3    1
  Right     27   20

This looks more like a table

table() is not good with numeric

table(height)
height
155 157 158 159 160 162 163 164 165 166 167 168 170 
  1   1   2   1   3   3   3   1   3   2   2   1   2 
171 172 173 174 175 176 177 178 179 180 181 182 183 
  1   1   3   3   4   1   1   2   1   2   1   1   1 
184 185 188 195 
  1   1   1   1 

It is not a good summary

What can we say?

Counting TRUE values

How many people is taller than 165cm?

sum(height > 165)
[1] 33

Location

Location

If you have to describe the vector v with a single number x, which would it be?

If we have to replace each one of v[i] for a single number, which number is “the best”?

Better choose one that is the “less wrong”

How can x be wrong?

How can x be wrong?

Many alternatives to measure the error

  • Number of times that x!=v[i]
  • Sum of absolute value of error
  • Sum of the square of error

Absolute error

Absolute error when \(x\) represents \(\mathbf v\) \[\mathrm{AE}(x, \mathbf{v})=\sum_i |v_i-x|\]

Which \(x\) minimizes absolute error?

Absolute error

Median: minimum Absolute Error

We get the minimum absolute error when \(x=171\)

Median

If x is the median of v, then

  • half of the values in v are smaller than x
  • half of the values in v are bigger than x

The median minimizes the absolute error

Squared error

The squared error when \(x\) represents \(\mathbf v\) is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\] Which \(x\) minimizes the squared error?

Squared error

Mean: minimum Squared error

We get the minimum squared error when \(x=170.6862745\)

Median and mean are different

usually

Minimizing SE using math

The error is \[\mathrm{SE}(x, \mathbf{v})=\sum_i (v_i-x)^2\]

To find the minimal value we can take the derivative of \(SE\) with respect to \(x\)

\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i (v_i - x)= 2\sum_i v_i - 2nx\]

The minimal values of functions are located where the derivative is zero

Minimizing SE using math

Now we find the value of \(x\) that makes the derivative equal to zero.

\[\frac{d}{dx} \mathrm{SE}(x, \mathbf{v})= 2\sum_i v_i - 2nx\]

Making this last formula equal to zero and solving for \(x\) we found that the best one is

\[x = \frac{1}{n} \sum_i v_i\]

Arithmetic Mean

The mean value of \(\mathbf v\) is \[\text{mean}(\mathbf v) = \frac{1}{n}\sum_{i=1}^n v_i\] where \(n\) is the length of the vector \(\mathbf v\).

Sometimes it is written as \(\bar{\mathbf v}\)

This value is called mean

In R we write mean(v)

In summary

summary(height)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  155.0   163.0   171.0   170.7   176.5   195.0 

What are these values?

Minimum, Maximum and Range

The easiest to understand are minimum and maximum

min(height)
[1] 155
max(height)
[1] 195

Range: min and max together

Which sometimes can be useful together

range(height)
[1] 155 195

Quartiles

Quart means one fourth in latin.

If we split the set of values in four subsets of the same size

Which are the limits of these sets? \[Q_0, Q_1, Q_2, Q_3, Q_4\]

It is easy to know \(Q_0, Q_2\) and \(Q_4\)

Quartiles

\(Q_0\): Zero elements are smaller than this one
\(Q_1\): One quarter of the elements are smaller
\(Q_2\): Two quarters (half) of the elements are smaller
\(Q_3\): Three quarters of the elements are smaller
\(Q_4\): Four quarters (all) of the elements are smaller

It is easy to see that \(Q_0\) is the minimum, \(Q_2\) is the median, and \(Q_4\) is the maximum

Quartiles and Quantiles

Generalizing, we can ask, for each percentage \(p\), which is the value on the vector v which is greater than \(p\)% of the rest of the values.

The function in R for that is called quantile()

By default it gives us the quartiles

quantile(height)
   0%   25%   50%   75%  100% 
155.0 163.0 171.0 176.5 195.0 

quantile() gives quartiles

quantile(height)
   0%   25%   50%   75%  100% 
155.0 163.0 171.0 176.5 195.0 

unless we ask for something else

quantile(height, seq(0, 1, by=0.1))
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
 155  160  162  165  167  171  174  175  178  182  195 

In summary()

Summary

summary(height)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  155.0   163.0   171.0   170.7   176.5   195.0 

Which is my quartile?

Dividing the vector in groups

The command cut() separates the vector and makes a factor for each group. This is a factor:

cut(height, 4)
 [1] (175,185] (165,175] (165,175] (175,185] (155,165]
 [6] (165,175] (165,175] (155,165] (165,175] (185,195]
[11] (175,185] (175,185] (155,165] (155,165] (175,185]
[16] (155,165] (165,175] (155,165] (155,165] (155,165]
[21] (165,175] (155,165] (165,175] (155,165] (155,165]
[26] (155,165] (165,175] (165,175] (175,185] (175,185]
[31] (175,185] (165,175] (165,175] (165,175] (165,175]
[36] (165,175] (165,175] (155,165] (165,175] (155,165]
[41] (175,185] (155,165] (175,185] (165,175] (165,175]
[46] (155,165] (155,165] (185,195] (155,165] (175,185]
[51] (175,185]
Levels: (155,165] (165,175] (175,185] (185,195]

Are these the quartiles?

Used this way, the range is split in parts of the same size, not with the same number of people

table(cut(height, 4))
(155,165] (165,175] (175,185] (185,195] 
       18        19        12         2 

These are not the quartiles

Using the real quartiles

We can specify the cut points using quantile()

table(cut(height, quantile(height),
          include.lowest = TRUE))
[155,163] (163,171] (171,176] (176,195] 
       14        12        12        13 

Now every group has (almost) the same size

Next Monday: Quiz for Rehearsal