Data Visualization

November 12th, 2018

Telling stories

Descriptive Statistics

We have our own data

survey <- read.table("survey1-tidy.txt")

We want to tell something about them

Counting them
Locating them
Describe them
Tell a story

Functions used to describe vectors

There are many. You have to explore and learn

So far we have seen:

length()
min(), max(), range()
head(), tail()
summary()
table()

Describing factor vectors

length(survey$handness)

[1] 51

summary(survey$handness)

 Left Right 
    4    47

table(survey$handness)

 Left Right 
    4    47

Describing numeric vectors

length(survey$weight_kg)

[1] 51

summary(survey$weight_kg)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  42.50   55.00   64.00   65.56   74.50  106.00

table(survey$weight_kg)

42.5   47   50   52   53   54   55   56   57   58   59 
   1    1    2    1    1    2    6    2    1    3    1 
  60   63   64   65   67   68   69   70   72   74   75 
   3    1    1    3    2    3    1    1    1    1    3 
  76   77   78   80   81   85   94  105  106 
   1    2    1    1    1    1    1    1    1

Data Visualization

“one image worths a thousand words”

Graphics

Sometimes the best way to tell the story of the data is with a graphic

plot(survey$weight_kg)

How to read it

Each value has a different position in the horizontal axis
The vector’s index is a number from 1 to length(vector)
The vertical axis represent the value of the element
So if vector[3] contains the value 170, we will have a point at the coordinates (3,170)

Another example

plot(survey$height_cm)

Making it beautiful

You can change the symbol’s color

plot(survey$height_cm)

plot(survey$height_cm,
     col="red")

Color can be a vector

There are several ways to specify the color

The easiest one is to use a number

Each point can have a different color. You use a vector of the same lenght as the data

Something like this

plot(1:8, col=1:8)

You can change the symbol’s size

plot(survey$height_cm,
     cex=2)

plot(survey$height_cm,
     cex=0.5)

Size can be a vector

The parameter cex means character expansion

Each point can have a different size

You use a vector of the same lenght as the data

plot(1:8, cex=1:8)

Choosing the symbol

plot(survey$height_cm,
     pch=16)

plot(survey$height_cm,
     pch=".")

Plot Character can be a vector

The parameter pch means plot character

Each point can have a different symbol. You use a vector of the same lenght as the data

Plot char can be chosen by a number

plot(1:25, pch=1:25)

Plot Character can be a vector

The parameter pch means plot character

Each point can have a different symbol. You use a vector of the same lenght as the data

Plot char can be also chosen by a letter

plot(1:7, pch=c("A", "T", "a", "t", ".", "0", "1"))

Plot characters

Notice that:

If the number of points is big, using pch="." is faster and it is understood better
pch=1 is different from pch="1"
We can use the vector LETTERS and letters to transform numbers into letters
A plot with too many symbols is hard to understand
- It is better to simplify the message

Remember to tell a story

Plots should help you to tell a story.

Ask yourself:

“Is this telling the story I want to tell?”

Plot Type line and both

plot(survey$height_cm,
     type = "l")

plot(survey$height_cm,
     type = "b")

Plot Type over and points

plot(survey$height_cm,
     type = "o")

plot(survey$height_cm,
     type = "p")

Plot Type

The type depends on the story you want to tell

Lines are mostly used to tell a story of change through time
Using both or over is better to see the indivudual points in the line
If you do not specify, the default is type="p"
When there are many values, it is better to use points
- The screen has approx 2000 points horizontally
- The projector has 1000 points
- If length(vector)>300, better use type="p"

Zooming—Choosing the range

plot(survey$height_cm,
     pch=16)

plot(survey$height_cm,
    pch=16, xlim=c(1,20))

Full annotation

Including main title, subtitle, x and y axis label

plot(survey$height_cm, main="Length of survey$height_cm",
 sub = "51 samples", xlab="Person", ylab="Height [cm]")

Two or more variables

Two plots in parallel

plot(survey$height_cm, ylim=c(0,200))
points(survey$weight_kg, pch=2)

The first plot defines the scale. points() works on a pre-existing plot

Two lines in parallel

plot(survey$height_cm, type="l", ylim=c(0,200))
lines(survey$weight_kg, col="red")

lines() is like points() but with type="l" by default

Combining different `types`

plot(survey$height_cm, type="o", ylim=c(0,200))
lines(survey$weight_kg, col="red", type="b")

Plotting Factors

The previous graphics used numeric data. What about factors?

plot(survey$handness)

Factor vectors are shown as a “table”

Plotting a vector of type factor produces a barlplot

Each bar size corresponds to

the frequency of each factor level
that is, counting how many times for each level

You can make Barplots of numeric vectors

plot(survey$weight_kg)

barplot(survey$weight_kg)

Barplots

Numeric vectors are shown element by element
- bars starts at 0
- hard to see when the vector length is large
Factor vectors are shown as a “table”
- i.e. the frequency of each value
Can we do the same for a numeric vector?
- all values are different
- we have to group them in “similar” sets

Grouping and counting

Remember that we can use cut() to make a factor vector from numeric values. We need to say how many groups we want

cut(survey$weight_kg, 10)

 [1] (61.5,67.9] (55.2,61.5] (55.2,61.5] (93.3,99.7]
 [5] (55.2,61.5] (74.2,80.6] (55.2,61.5] (74.2,80.6]
 [9] (74.2,80.6] (99.7,106]  (55.2,61.5] (67.9,74.2]
[13] (55.2,61.5] (48.9,55.2] (74.2,80.6] (48.9,55.2]
[17] (99.7,106]  (67.9,74.2] (67.9,74.2] (61.5,67.9]
[21] (74.2,80.6] (42.4,48.9] (48.9,55.2] (67.9,74.2]
[25] (55.2,61.5] (55.2,61.5] (48.9,55.2] (42.4,48.9]
[29] (61.5,67.9] (61.5,67.9] (67.9,74.2] (67.9,74.2]
[33] (48.9,55.2] (48.9,55.2] (55.2,61.5] (48.9,55.2]
[37] (48.9,55.2] (55.2,61.5] (74.2,80.6] (48.9,55.2]
[41] (80.6,86.9] (48.9,55.2] (48.9,55.2] (67.9,74.2]
[45] (61.5,67.9] (61.5,67.9] (48.9,55.2] (80.6,86.9]
[49] (61.5,67.9] (74.2,80.6] (74.2,80.6]
10 Levels: (42.4,48.9] (48.9,55.2] ... (99.7,106]

Grouping and counting

Now we have a factor that we can plot

plot(cut(survey$weight_kg, 10))

Histograms

Histograms group and count in one step

plot(survey$weight_kg)

hist(survey$weight_kg)

Histograms

Numeric data can be grouped into classes

The default number of classes is automatic, but you can change it
Frequency means “how many times”

Histogram bars are not separated

This is because numerical values are continuous, and there is no “space” between them

Numeric data is grouped in N classes

hist(survey$weight_kg,
 col="grey")

hist(survey$weight_kg,
 col="grey", nclass = 20)

Telling stories

Descriptive Statistics

Functions used to describe vectors

Describing factor vectors

Describing numeric vectors

Data Visualization

“one image worths a thousand words”

Graphics

How to read it

Another example

Making it beautiful

You can change the symbol’s color

Color can be a vector

You can change the symbol’s size

Size can be a vector

Choosing the symbol

Plot Character can be a vector

Plot Character can be a vector

Plot characters

Remember to tell a story

Plot Type line and both

Plot Type over and points

Plot Type

Zooming—Choosing the range

Full annotation

Two or more variables

Two plots in parallel

Two lines in parallel

Combining different types

Plotting Factors

Plotting Factors

Factor vectors are shown as a “table”

You can make Barplots of numeric vectors

Barplots

Grouping and counting

Grouping and counting

Histograms

Histograms group and count in one step

Histograms

Numeric data is grouped in N classes

Combining different `types`