November 12th, 2018

Telling stories

Descriptive Statistics

We have our own data

survey <- read.table("survey1-tidy.txt")

We want to tell something about them

  • Counting them
  • Locating them
  • Describe them
  • Tell a story

Functions used to describe vectors

There are many. You have to explore and learn

So far we have seen:

  • length()
  • min(), max(), range()
  • head(), tail()
  • summary()
  • table()

Describing factor vectors

length(survey$handness)
[1] 51
summary(survey$handness)
 Left Right 
    4    47 
table(survey$handness)
 Left Right 
    4    47 

Describing numeric vectors

length(survey$weight_kg)
[1] 51
summary(survey$weight_kg)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  42.50   55.00   64.00   65.56   74.50  106.00 
table(survey$weight_kg)
42.5   47   50   52   53   54   55   56   57   58   59 
   1    1    2    1    1    2    6    2    1    3    1 
  60   63   64   65   67   68   69   70   72   74   75 
   3    1    1    3    2    3    1    1    1    1    3 
  76   77   78   80   81   85   94  105  106 
   1    2    1    1    1    1    1    1    1 

Data Visualization

“one image worths a thousand words”

Graphics

Sometimes the best way to tell the story of the data is with a graphic

plot(survey$weight_kg)

How to read it

  • Each value has a different position in the horizontal axis
  • The vector’s index is a number from 1 to length(vector)
  • The vertical axis represent the value of the element
  • So if vector[3] contains the value 170, we will have a point at the coordinates (3,170)

Another example

plot(survey$height_cm)

Making it beautiful

You can change the symbol’s color

plot(survey$height_cm)

plot(survey$height_cm,
     col="red")

Color can be a vector

There are several ways to specify the color

The easiest one is to use a number

Each point can have a different color. You use a vector of the same lenght as the data

Something like this

plot(1:8, col=1:8)

You can change the symbol’s size

plot(survey$height_cm,
     cex=2)

plot(survey$height_cm,
     cex=0.5)

Size can be a vector

The parameter cex means character expansion

Each point can have a different size

You use a vector of the same lenght as the data

plot(1:8, cex=1:8)

Choosing the symbol

plot(survey$height_cm,
     pch=16)

plot(survey$height_cm,
     pch=".")

Plot Character can be a vector

The parameter pch means plot character

Each point can have a different symbol. You use a vector of the same lenght as the data

Plot char can be chosen by a number

plot(1:25, pch=1:25)

Plot Character can be a vector

The parameter pch means plot character

Each point can have a different symbol. You use a vector of the same lenght as the data

Plot char can be also chosen by a letter

plot(1:7, pch=c("A", "T", "a", "t", ".", "0", "1"))

Plot characters

Notice that:

  • If the number of points is big, using pch="." is faster and it is understood better
  • pch=1 is different from pch="1"
  • We can use the vector LETTERS and letters to transform numbers into letters
  • A plot with too many symbols is hard to understand
    • It is better to simplify the message

Remember to tell a story

Plots should help you to tell a story.

Ask yourself:

“Is this telling the story I want to tell?”

Plot Type line and both

plot(survey$height_cm,
     type = "l")

plot(survey$height_cm,
     type = "b")

Plot Type over and points

plot(survey$height_cm,
     type = "o")

plot(survey$height_cm,
     type = "p")

Plot Type

The type depends on the story you want to tell

  • Lines are mostly used to tell a story of change through time
  • Using both or over is better to see the indivudual points in the line
  • If you do not specify, the default is type="p"
  • When there are many values, it is better to use points
    • The screen has approx 2000 points horizontally
    • The projector has 1000 points
    • If length(vector)>300, better use type="p"

Zooming—Choosing the range

plot(survey$height_cm,
     pch=16)

plot(survey$height_cm,
    pch=16, xlim=c(1,20))

Full annotation

Including main title, subtitle, x and y axis label

plot(survey$height_cm, main="Length of survey$height_cm",
 sub = "51 samples", xlab="Person", ylab="Height [cm]")

Two or more variables

Two plots in parallel

plot(survey$height_cm, ylim=c(0,200))
points(survey$weight_kg, pch=2)

The first plot defines the scale. points() works on a pre-existing plot

Two lines in parallel

plot(survey$height_cm, type="l", ylim=c(0,200))
lines(survey$weight_kg, col="red")

lines() is like points() but with type="l" by default

Combining different types

plot(survey$height_cm, type="o", ylim=c(0,200))
lines(survey$weight_kg, col="red", type="b")

Plotting Factors

Plotting Factors

The previous graphics used numeric data. What about factors?

plot(survey$handness)

Factor vectors are shown as a “table”

Plotting a vector of type factor produces a barlplot

Each bar size corresponds to

  • the frequency of each factor level
  • that is, counting how many times for each level

You can make Barplots of numeric vectors

plot(survey$weight_kg)

barplot(survey$weight_kg)

Barplots

  • Numeric vectors are shown element by element
    • bars starts at 0
    • hard to see when the vector length is large
  • Factor vectors are shown as a “table”
    • i.e. the frequency of each value
  • Can we do the same for a numeric vector?
    • all values are different
    • we have to group them in “similar” sets

Grouping and counting

Remember that we can use cut() to make a factor vector from numeric values. We need to say how many groups we want

cut(survey$weight_kg, 10)
 [1] (61.5,67.9] (55.2,61.5] (55.2,61.5] (93.3,99.7]
 [5] (55.2,61.5] (74.2,80.6] (55.2,61.5] (74.2,80.6]
 [9] (74.2,80.6] (99.7,106]  (55.2,61.5] (67.9,74.2]
[13] (55.2,61.5] (48.9,55.2] (74.2,80.6] (48.9,55.2]
[17] (99.7,106]  (67.9,74.2] (67.9,74.2] (61.5,67.9]
[21] (74.2,80.6] (42.4,48.9] (48.9,55.2] (67.9,74.2]
[25] (55.2,61.5] (55.2,61.5] (48.9,55.2] (42.4,48.9]
[29] (61.5,67.9] (61.5,67.9] (67.9,74.2] (67.9,74.2]
[33] (48.9,55.2] (48.9,55.2] (55.2,61.5] (48.9,55.2]
[37] (48.9,55.2] (55.2,61.5] (74.2,80.6] (48.9,55.2]
[41] (80.6,86.9] (48.9,55.2] (48.9,55.2] (67.9,74.2]
[45] (61.5,67.9] (61.5,67.9] (48.9,55.2] (80.6,86.9]
[49] (61.5,67.9] (74.2,80.6] (74.2,80.6]
10 Levels: (42.4,48.9] (48.9,55.2] ... (99.7,106]

Grouping and counting

Now we have a factor that we can plot

plot(cut(survey$weight_kg, 10))

Histograms

Histograms group and count in one step

plot(survey$weight_kg)

hist(survey$weight_kg)

Histograms

Numeric data can be grouped into classes

  • The default number of classes is automatic, but you can change it

  • Frequency means “how many times”

Histogram bars are not separated

  • This is because numerical values are continuous, and there is no “space” between them

Numeric data is grouped in N classes

hist(survey$weight_kg,
 col="grey")

hist(survey$weight_kg,
 col="grey", nclass = 20)