November 12th, 2018

## Descriptive Statistics

We have our own data

survey <- read.table("survey1-tidy.txt")

We want to tell something about them

• Counting them
• Locating them
• Describe them
• Tell a story

## Functions used to describe vectors

There are many. You have to explore and learn

So far we have seen:

• length()
• min(), max(), range()
• head(), tail()
• summary()
• table()

## Describing factor vectors

length(survey$handness)  51 summary(survey$handness)
 Left Right
4    47 
table(survey$handness)  Left Right 4 47  ## Describing numeric vectors length(survey$weight_kg)
 51
summary(survey$weight_kg)  Min. 1st Qu. Median Mean 3rd Qu. Max. 42.50 55.00 64.00 65.56 74.50 106.00  table(survey$weight_kg)
42.5   47   50   52   53   54   55   56   57   58   59
1    1    2    1    1    2    6    2    1    3    1
60   63   64   65   67   68   69   70   72   74   75
3    1    1    3    2    3    1    1    1    1    3
76   77   78   80   81   85   94  105  106
1    2    1    1    1    1    1    1    1 

## Graphics

Sometimes the best way to tell the story of the data is with a graphic

plot(survey$weight_kg) ## How to read it • Each value has a different position in the horizontal axis • The vector’s index is a number from 1 to length(vector) • The vertical axis represent the value of the element • So if vector contains the value 170, we will have a point at the coordinates (3,170) ## Another example plot(survey$height_cm) ## You can change the symbol’s color

plot(survey$height_cm) plot(survey$height_cm,
col="red") ## Color can be a vector

There are several ways to specify the color

The easiest one is to use a number

Each point can have a different color. You use a vector of the same lenght as the data

Something like this

plot(1:8, col=1:8) ## You can change the symbol’s size

plot(survey$height_cm, cex=2) plot(survey$height_cm,
cex=0.5) ## Size can be a vector

The parameter cex means character expansion

Each point can have a different size

You use a vector of the same lenght as the data

plot(1:8, cex=1:8) ## Choosing the symbol

plot(survey$height_cm, pch=16) plot(survey$height_cm,
pch=".") ## Plot Character can be a vector

The parameter pch means plot character

Each point can have a different symbol. You use a vector of the same lenght as the data

Plot char can be chosen by a number

plot(1:25, pch=1:25) ## Plot Character can be a vector

The parameter pch means plot character

Each point can have a different symbol. You use a vector of the same lenght as the data

Plot char can be also chosen by a letter

plot(1:7, pch=c("A", "T", "a", "t", ".", "0", "1")) ## Plot characters

Notice that:

• If the number of points is big, using pch="." is faster and it is understood better
• pch=1 is different from pch="1"
• We can use the vector LETTERS and letters to transform numbers into letters
• A plot with too many symbols is hard to understand
• It is better to simplify the message

## Remember to tell a story

“Is this telling the story I want to tell?”

## Plot Type line and both

plot(survey$height_cm, type = "l") plot(survey$height_cm,
type = "b") ## Plot Type over and points

plot(survey$height_cm, type = "o") plot(survey$height_cm,
type = "p") ## Plot Type

The type depends on the story you want to tell

• Lines are mostly used to tell a story of change through time
• Using both or over is better to see the indivudual points in the line
• If you do not specify, the default is type="p"
• When there are many values, it is better to use points
• The screen has approx 2000 points horizontally
• The projector has 1000 points
• If length(vector)>300, better use type="p"

## Zooming—Choosing the range

plot(survey$height_cm, pch=16) plot(survey$height_cm,
pch=16, xlim=c(1,20)) ## Full annotation

Including main title, subtitle, x and y axis label

plot(survey$height_cm, main="Length of survey$height_cm",
sub = "51 samples", xlab="Person", ylab="Height [cm]") ## Two plots in parallel

plot(survey$height_cm, ylim=c(0,200)) points(survey$weight_kg, pch=2) The first plot defines the scale. points() works on a pre-existing plot

## Two lines in parallel

plot(survey$height_cm, type="l", ylim=c(0,200)) lines(survey$weight_kg, col="red") lines() is like points() but with type="l" by default

## Combining different types

plot(survey$height_cm, type="o", ylim=c(0,200)) lines(survey$weight_kg, col="red", type="b") ## Plotting Factors

The previous graphics used numeric data. What about factors?

plot(survey$handness) ## Factor vectors are shown as a “table” Plotting a vector of type factor produces a barlplot Each bar size corresponds to • the frequency of each factor level • that is, counting how many times for each level ## You can make Barplots of numeric vectors plot(survey$weight_kg) barplot(survey$weight_kg) ## Barplots • Numeric vectors are shown element by element • bars starts at 0 • hard to see when the vector length is large • Factor vectors are shown as a “table” • i.e. the frequency of each value • Can we do the same for a numeric vector? • all values are different • we have to group them in “similar” sets ## Grouping and counting Remember that we can use cut() to make a factor vector from numeric values. We need to say how many groups we want cut(survey$weight_kg, 10)
  (61.5,67.9] (55.2,61.5] (55.2,61.5] (93.3,99.7]
 (55.2,61.5] (74.2,80.6] (55.2,61.5] (74.2,80.6]
 (74.2,80.6] (99.7,106]  (55.2,61.5] (67.9,74.2]
 (55.2,61.5] (48.9,55.2] (74.2,80.6] (48.9,55.2]
 (99.7,106]  (67.9,74.2] (67.9,74.2] (61.5,67.9]
 (74.2,80.6] (42.4,48.9] (48.9,55.2] (67.9,74.2]
 (55.2,61.5] (55.2,61.5] (48.9,55.2] (42.4,48.9]
 (61.5,67.9] (61.5,67.9] (67.9,74.2] (67.9,74.2]
 (48.9,55.2] (48.9,55.2] (55.2,61.5] (48.9,55.2]
 (48.9,55.2] (55.2,61.5] (74.2,80.6] (48.9,55.2]
 (80.6,86.9] (48.9,55.2] (48.9,55.2] (67.9,74.2]
 (61.5,67.9] (61.5,67.9] (48.9,55.2] (80.6,86.9]
 (61.5,67.9] (74.2,80.6] (74.2,80.6]
10 Levels: (42.4,48.9] (48.9,55.2] ... (99.7,106]

## Grouping and counting

Now we have a factor that we can plot

plot(cut(survey$weight_kg, 10)) ## Histograms ## Histograms group and count in one step plot(survey$weight_kg) hist(survey$weight_kg) ## Histograms Numeric data can be grouped into classes • The default number of classes is automatic, but you can change it • Frequency means “how many times” Histogram bars are not separated • This is because numerical values are continuous, and there is no “space” between them ## Numeric data is grouped in Nclasses hist(survey$weight_kg,
col="grey") hist(survey\$weight_kg,
col="grey", nclass = 20) 