November 12th, 2018

We have our own data

survey <- read.table("survey1-tidy.txt")

We want to tell something about them

- Counting them
- Locating them
- Describe them
- Tell a story

There are many. You have to explore and learn

So far we have seen:

`length()`

`min()`

,`max()`

,`range()`

`head()`

,`tail()`

`summary()`

`table()`

length(survey$handness)

[1] 51

summary(survey$handness)

Left Right 4 47

table(survey$handness)

Left Right 4 47

length(survey$weight_kg)

[1] 51

summary(survey$weight_kg)

Min. 1st Qu. Median Mean 3rd Qu. Max. 42.50 55.00 64.00 65.56 74.50 106.00

table(survey$weight_kg)

42.5 47 50 52 53 54 55 56 57 58 59 1 1 2 1 1 2 6 2 1 3 1 60 63 64 65 67 68 69 70 72 74 75 3 1 1 3 2 3 1 1 1 1 3 76 77 78 80 81 85 94 105 106 1 2 1 1 1 1 1 1 1

Sometimes the best way to tell the story of the data is with a graphic

plot(survey$weight_kg)

- Each value has a different position in the horizontal axis
- The vector’s index is a number from 1 to
`length(vector)`

- The vertical axis represent the value of the element
- So if
`vector[3]`

contains the value`170`

, we will have a point at the coordinates (3,170)

plot(survey$height_cm)

plot(survey$height_cm)

plot(survey$height_cm, col="red")

There are several ways to specify the color

The easiest one is to use a number

Each point can have a different color. You use a vector of the same lenght as the data

Something like this

plot(1:8, col=1:8)

plot(survey$height_cm, cex=2)

plot(survey$height_cm, cex=0.5)

The parameter `cex`

means *character expansion*

Each point can have a different size

You use a vector of the same lenght as the data

plot(1:8, cex=1:8)

plot(survey$height_cm, pch=16)

plot(survey$height_cm, pch=".")

The parameter `pch`

means *plot character*

Each point can have a different symbol. You use a vector of the same lenght as the data

Plot char can be chosen by a number

plot(1:25, pch=1:25)

The parameter `pch`

means *plot character*

Each point can have a different symbol. You use a vector of the same lenght as the data

Plot char can be also chosen by a letter

plot(1:7, pch=c("A", "T", "a", "t", ".", "0", "1"))

Notice that:

- If the number of points is big, using
`pch="."`

is faster and it is understood better `pch=1`

is different from`pch="1"`

- We can use the vector
`LETTERS`

and`letters`

to transform numbers into letters - A plot with too many symbols is hard to understand
- It is better to simplify the message

Plots should help you to tell a story.

Ask yourself:

“Is this telling the story I want to tell?”

plot(survey$height_cm, type = "l")

plot(survey$height_cm, type = "b")

plot(survey$height_cm, type = "o")

plot(survey$height_cm, type = "p")

The type depends on the story you want to tell

- Lines are mostly used to tell a story of change through time
- Using
**b**oth or**o**ver is better to see the indivudual points in the line - If you do not specify, the default is
`type="p"`

- When there are many values, it is better to use points
- The screen has approx 2000 points horizontally
- The projector has 1000 points
- If
`length(vector)>300`

, better use`type="p"`

plot(survey$height_cm, pch=16)

plot(survey$height_cm, pch=16, xlim=c(1,20))

Including *main title*, *subtitle*, x and y *axis label*

plot(survey$height_cm, main="Length of survey$height_cm", sub = "51 samples", xlab="Person", ylab="Height [cm]")

plot(survey$height_cm, ylim=c(0,200)) points(survey$weight_kg, pch=2)

The first plot defines the scale. `points()`

works on a pre-existing plot

plot(survey$height_cm, type="l", ylim=c(0,200)) lines(survey$weight_kg, col="red")

`lines()`

is like `points()`

but with `type="l"`

by default

`types`

plot(survey$height_cm, type="o", ylim=c(0,200)) lines(survey$weight_kg, col="red", type="b")

The previous graphics used *numeric* data. What about factors?

plot(survey$handness)

Plotting a vector of type factor produces a *barlplot*

Each bar size corresponds to

- the frequency of each factor level
- that is, counting how many times for each level

plot(survey$weight_kg)

barplot(survey$weight_kg)

- Numeric vectors are shown element by element
- bars starts at 0
- hard to see when the vector length is large

- Factor vectors are shown as a “table”
- i.e. the frequency of each value

- Can we do the same for a
*numeric*vector?- all values are different
- we have to group them in “similar” sets

Remember that we can use `cut()`

to make a factor vector from numeric values. We need to say how many groups we want

cut(survey$weight_kg, 10)

[1] (61.5,67.9] (55.2,61.5] (55.2,61.5] (93.3,99.7] [5] (55.2,61.5] (74.2,80.6] (55.2,61.5] (74.2,80.6] [9] (74.2,80.6] (99.7,106] (55.2,61.5] (67.9,74.2] [13] (55.2,61.5] (48.9,55.2] (74.2,80.6] (48.9,55.2] [17] (99.7,106] (67.9,74.2] (67.9,74.2] (61.5,67.9] [21] (74.2,80.6] (42.4,48.9] (48.9,55.2] (67.9,74.2] [25] (55.2,61.5] (55.2,61.5] (48.9,55.2] (42.4,48.9] [29] (61.5,67.9] (61.5,67.9] (67.9,74.2] (67.9,74.2] [33] (48.9,55.2] (48.9,55.2] (55.2,61.5] (48.9,55.2] [37] (48.9,55.2] (55.2,61.5] (74.2,80.6] (48.9,55.2] [41] (80.6,86.9] (48.9,55.2] (48.9,55.2] (67.9,74.2] [45] (61.5,67.9] (61.5,67.9] (48.9,55.2] (80.6,86.9] [49] (61.5,67.9] (74.2,80.6] (74.2,80.6] 10 Levels: (42.4,48.9] (48.9,55.2] ... (99.7,106]

Now we have a *factor* that we can plot

plot(cut(survey$weight_kg, 10))

plot(survey$weight_kg)

hist(survey$weight_kg)

Numeric data can be grouped into *classes*

The default number of classes is automatic, but you can change it

*Frequency*means “how many times”

Histogram bars are not separated

- This is because numerical values are
*continuous*, and there is no “space” between them

hist(survey$weight_kg, col="grey")

hist(survey$weight_kg, col="grey", nclass = 20)