November 19, 2019

## Descriptive Statistics

We have our own data

survey <- read.table("survey1-tidy.txt")

We want to tell something about them

• Counting them
• Locating them
• Describe them
• Tell a story

## Functions used to describe vectors

There are many. You have to explore and learn

So far we have seen:

• length()
• min(), max(), range()
• head(), tail()
• summary()
• table()

## Describing factor vectors

length(survey$handness) [1] 51 summary(survey$handness)
 Left Right
4    47 
table(survey$handness)  Left Right 4 47  ## Describing numeric vectors length(survey$weight_kg)
[1] 51
summary(survey$weight_kg)  Min. 1st Qu. Median Mean 3rd Qu. Max. 42.50 55.00 64.00 65.56 74.50 106.00  table(survey$weight_kg)
42.5   47   50   52   53   54   55   56   57   58   59
1    1    2    1    1    2    6    2    1    3    1
60   63   64   65   67   68   69   70   72   74   75
3    1    1    3    2    3    1    1    1    1    3
76   77   78   80   81   85   94  105  106
1    2    1    1    1    1    1    1    1 

## Graphics

Sometimes the best way to tell the story of the data is with a graphic

## You can change the symbol’s color

plot(survey$height_cm) plot(survey$height_cm,
col="red")

## Color can be a vector

There are several ways to specify the color

The easiest one is to use a number

Each point can have a different color. You use a vector of the same lenght as the data

Something like this

plot(1:8, col=1:8)

## You can change the symbol’s size

plot(survey$height_cm, cex=2) plot(survey$height_cm,
cex=0.5)

## Size can be a vector

The parameter cex means character expansion

Each point can have a different size

You use a vector of the same lenght as the data

plot(1:8, cex=1:8)

## Choosing the symbol

plot(survey$height_cm, pch=16) plot(survey$height_cm,
pch=".")

## Plot Character can be a vector

The parameter pch means plot character

Each point can have a different symbol. You use a vector of the same lenght as the data

Plot char can be chosen by a number

plot(1:25, pch=1:25)

## Plot Character can be a vector

The parameter pch means plot character

Each point can have a different symbol. You use a vector of the same lenght as the data

Plot char can be also chosen by a letter

plot(1:7, pch=c("A", "T", "a", "t", ".", "0", "1"))

## Plot characters

Notice that:

• If the number of points is big, using pch="." is faster and it is understood better
• pch=1 is different from pch="1"
• We can use the vector LETTERS and letters to transform numbers into letters
• A plot with too many symbols is hard to understand
• It is better to simplify the message

## Remember to tell a story

“Is this telling the story I want to tell?”

## Plot Type line and both

plot(survey$height_cm, type = "l") plot(survey$height_cm,
type = "b")

## Plot Type over and points

plot(survey$height_cm, type = "o") plot(survey$height_cm,
type = "p")

## Plot Type

The type depends on the story you want to tell

• Lines are mostly used to tell a story of change through time
• Using both or over is better to see the indivudual points in the line
• If you do not specify, the default is type="p"
• When there are many values, it is better to use points
• The screen has approx 2000 points horizontally
• The projector has 1000 points
• If length(vector)>300, better use type="p"

## Zooming—Choosing the range

plot(survey$height_cm, pch=16) plot(survey$height_cm,
pch=16, xlim=c(1,20))

## Full annotation

Including main title, subtitle, x and y axis label

plot(survey$height_cm, main="Length of survey$height_cm",
sub = "51 samples", xlab="Person", ylab="Height [cm]")

## Two plots in parallel

plot(survey$height_cm, ylim=c(0,200)) points(survey$weight_kg, pch=2)

The first plot defines the scale. points() works on a pre-existing plot

## Two lines in parallel

plot(survey$height_cm, type="l", ylim=c(0,200)) lines(survey$weight_kg, col="red")

lines() is like points() but with type="l" by default

## Combining different types

plot(survey$height_cm, type="o", ylim=c(0,200)) lines(survey$weight_kg, col="red", type="b")

## Plotting Factors

The previous graphics used numeric data. What about factors?

barplot(survey$weight_kg) ## Barplots • Numeric vectors are shown element by element • bars starts at 0 • hard to see when the vector length is large • Factor vectors are shown as a “table” • i.e. the frequency of each value • Can we do the same for a numeric vector? • all values are different • we have to group them in “similar” sets ## Grouping and counting Remember that we can use cut() to make a factor vector from numeric values. We need to say how many groups we want cut(survey$weight_kg, 10)
 [1] (61.5,67.9] (55.2,61.5] (55.2,61.5] (93.3,99.7]
[5] (55.2,61.5] (74.2,80.6] (55.2,61.5] (74.2,80.6]
[9] (74.2,80.6] (99.7,106]  (55.2,61.5] (67.9,74.2]
[13] (55.2,61.5] (48.9,55.2] (74.2,80.6] (48.9,55.2]
[17] (99.7,106]  (67.9,74.2] (67.9,74.2] (61.5,67.9]
[21] (74.2,80.6] (42.4,48.9] (48.9,55.2] (67.9,74.2]
[25] (55.2,61.5] (55.2,61.5] (48.9,55.2] (42.4,48.9]
[29] (61.5,67.9] (61.5,67.9] (67.9,74.2] (67.9,74.2]
[33] (48.9,55.2] (48.9,55.2] (55.2,61.5] (48.9,55.2]
[37] (48.9,55.2] (55.2,61.5] (74.2,80.6] (48.9,55.2]
[41] (80.6,86.9] (48.9,55.2] (48.9,55.2] (67.9,74.2]
[45] (61.5,67.9] (61.5,67.9] (48.9,55.2] (80.6,86.9]
[49] (61.5,67.9] (74.2,80.6] (74.2,80.6]
10 Levels: (42.4,48.9] (48.9,55.2] ... (99.7,106]

## Grouping and counting

Now we have a factor that we can plot

hist(survey$weight_kg) ## Histograms Numeric data can be grouped into classes • The default number of classes is automatic, but you can change it • Frequency means “how many times” Histogram bars are not separated • This is because numerical values are continuous, and there is no “space” between them ## Numeric data is grouped in Nclasses hist(survey$weight_kg,
col="grey")

hist(survey\$weight_kg,
col="grey", nclass = 20)