```
# A tibble: 117 x 10
answer_date id english_level sex birthdate birthplace
<date> <chr> <chr> <chr> <date> <chr>
1 2018-09-17 3e50… I can speak … Male 1993-02-01 turkey
2 2018-09-17 479d… I can unders… Fema… 1998-05-21 Kahramanm…
3 2018-09-17 39df… I can read a… Fema… 1998-01-18 Batman, T…
4 2018-09-17 d2b0… I can read a… Male 1998-08-29 Antalya,T…
5 2018-09-17 f22b… I can read a… Fema… 1998-05-03 izmir
6 2018-09-17 849c… İngilizce bi… Fema… 1995-10-09 Türkiye /…
7 2018-09-17 8381… I can speak … Fema… 1997-09-19 Adıyaman,…
8 2018-09-17 b0dd… I can read a… Male 1997-11-27 Bursa
9 2018-09-17 2972… I can read a… Fema… 1999-01-02 İstanbul/…
10 2018-09-17 72c0… I can read a… Fema… 1998-10-02 İstanbul,…
# … with 107 more rows, and 4 more variables: height_cm <dbl>,
# weight_kg <dbl>, handness <chr>, hand_span <dbl>
```

Today we will not use `NA`

values

```
[1] 67.0 55.0 74.0 68.0 58.0 72.0 68.0 58.0 55.0
[10] 81.0 42.5 69.0 58.0 47.0 78.0 57.0 55.0 55.0
[19] 65.0 60.0 50.0 52.0 54.0 75.0 105.0 56.0 50.0
[28] 67.0 59.0 75.0 60.0 60.0 106.0 94.0 63.0 54.0
[37] 53.0 75.0 70.0 65.0 65.0 55.0 68.0 55.0 80.0
[46] 77.0 85.0 65.0 64.0 64.0 60.0 76.0 56.0 78.0
[55] 77.0 72.0 58.0 66.0 52.0 73.0 82.0 55.0 86.0
[64] 63.0 85.0 58.0 65.0 65.0 70.0 47.0 82.0 70.0
[73] 75.0 47.0 72.0 61.0 79.0 55.0 74.0 47.0 54.0
[82] 60.0 74.0 56.0 65.0 49.0 63.0 65.0 47.0 90.0
[91] 90.0 76.0 88.0 80.0 72.0 47.0 61.0 95.0 67.0
[100] 80.0
```

We have data

we want to tell something about it

What can we tell about this set of numbers?

How can we make a summary of all the values using only a few numbers?

- Number of elements
*How many?*

- Location
*Where?*

- Dispersion
*Are they homogeneous?**Are they similar to each other?*

- For data frames and tibbles we use
`nrow()`

`[1] 117`

`dim()`

gives us*rows*and*columns*

`[1] 117 10`

- For vectors we use
`length()`

`[1] 100`

`table()`

should be called **count**

```
Left Right
12 105
```

```
Female Male
77 39
```

This looks more like a table

```
Female Male
Left 9 3
Right 68 36
```

`table()`

is not good with numeric```
weight
42.5 47 49 50 52 53 54 55 56 57 58 59
1 6 1 2 2 1 3 8 3 1 5 1
60 61 63 64 65 66 67 68 69 70 72 73
5 2 3 2 8 1 3 3 1 3 4 1
74 75 76 77 78 79 80 81 82 85 86 88
3 4 2 2 2 1 3 1 2 2 1 1
90 94 95 105 106
2 1 1 1 1
```

It is not a *good summary*

What can we say?

Where are the values?

The easiest are *minimum* and *maximum*

`[1] 42.5`

`[1] 106`

Sometimes can be useful together

`[1] 42.5 106.0`

(This is like `dim()`

: it combines two functions into one)

If you have to describe the vector `v`

with a single number `x`

, which would it be?

If we have to replace each one of `v[i]`

for a single number, which number is “the best”?

There are several possible answers to that question.

There are several possible *averages*

If \(\mathbf v=(v_1,…,v_n)\) is a vector, then the *mean value* of \(\mathbf v\) is \[\text{mean}(\mathbf v) = \frac{1}{n}\sum_{i=1}^n v_i\]

This value is called the *mean* of \(\mathbf v\)

In R we write `mean(v)`

Besides arithmetic mean, we have

- Geometrical mean
- Harmonic mean
- Quadratic mean
- Cubic mean

and many others

We use them only in a few specific places

If `x`

is the *median* of `v`

, then

- half of the values in
`v`

are smaller than`x`

- half of the values in
`v`

are bigger than`x`

The *median* is often used as “average”

Like in *“the average person think he/she is smarter than the average person”*

These are our values

`[1] 66.485`

`[1] 65`

The problem with averages is that they are too sensitive to extreme values

Imagine that one day an elephant comes to our class

What happens with “the average weight”

`[1] 125.2327`

`[1] 65`

Which one represent us better?

What are these values?

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
42.50 56.00 65.00 66.48 75.00 106.00
```

What is `1st Qu.`

and `3rd Qu.`

*Quart* means *one fourth* in latin.

If we split the set of values in four subsets of the same size, what are the limits of these sets? \[Q_0, Q_1, Q_2, Q_3, Q_4\]

It is easy to know \(Q_0, Q_2\) and \(Q_4\)

\(Q_0\): Zero elements are smaller than this one

\(Q_1\): One quarter of the elements are smaller

\(Q_2\): Two quarters (half) of the elements are smaller

\(Q_3\): Three quarters of the elements are smaller

\(Q_4\): Four quarters (all) of the elements are smaller

It is easy to see that \(Q_0\) is the *minimum*, \(Q_2\) is the median, and \(Q_4\) is the maximum

Generalizing, we can ask, for each percentage \(p\),

which is the value on the vector`v`

that is greater than \(p\)% of all the values?

These values are called *Quantiles*, or sometimes *Percentiles*

The function in *R* for quantiles is called `quantile()`

By default it gives us the *quartiles*

```
0% 25% 50% 75% 100%
42.5 56.0 65.0 75.0 106.0
```

`quantile()`

gives quartiles```
0% 25% 50% 75% 100%
42.5 56.0 65.0 75.0 106.0
```

unless we ask for something else

```
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
42.5 51.8 55.0 58.0 61.0 65.0 68.0 73.3 77.0 82.3
100%
106.0
```

`summary()`

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
42.50 56.00 65.00 66.48 75.00 106.00
```