November 22, 2018

## Data tidying

One of the important cleaning processes in the practice of data science

Tidy data sets have structure and working with them is easy

They are easy to manipulate, model and visualize

Tidy data sets are arranged such that each variable is a column and each observation (or case) is a row

## Characteristics

• Each variable you measure should be in one column.
• Each different observation of that variable should be in a different row.
• There should be one table for each “kind” of variable.
• If you have multiple tables, they should include a column in the table that allows them to be linked.
• Jeff Leek, “The Elements of Data Analytic Style” (2015)
• Wikipedia

## What people says

“Tidy datasets are all alike but every messy dataset is messy in its own way.”

## Sample of answers

n_marbles     length      repetition
0                 10           1
1                 10.8         1
2                 12.2         1
3                 14           1
4                 15.7         1
5                 18.1         1
6                 20.5         1
0                 10           2
1                 10.8         2
2                 12.2         2
3                 14           2
4                 15.7         2
5                 18.1         2
6                 20.5         2

## Sample of answers

n_marbels length  repetition
0         8.0       1
0         8.0       2
0         8.0       3
1         8.5       1
1         8.8       2
1         8.8       3
2         8.9       1
2         8.9       2
2         9.0       3
3         9.4       1
3         9.2       2
3         9.1       3

## Sample of answers

n_coins length  repetition
0       7.5cm    1
5       8  cm    1
10      9  cm    1
15      11 cm    1
0       8  cm    2
5       9  cm    2
10      10 cm    2
15      11 cm    2
0       8  cm    3
5       9  cm    3
10      10 cm    3
15      11 cm    3

## Sample of answers

n_marble         length              repetition
0                5cm                 1
1                5.5cm               1
2                6cm                 1
3                6.5cm               1
4                7cm                 1
5                7.5cm               1
6                8.1cm               1
7                8.6cm               1
0                5cm                 1
1                5.4cm               1
2                6cm                 1
3                6.4cm               1
4                6.9cm               1
5                7.5cm               1
6                8.2cm               1
7                8.7cm               1

## Sample of answers

n_coins     length_cm     repetition
0           7,5           1
5           8             1
10          9             1
15          11            1
0           8             2
5           9             2
10          10            2
15          11            2
0           8             3
5           9             3
10          10            3
15          11            3

## Sample of answers

n_marbles     length      repetition
0                 10           2
1                 10.8         2
2               12.2           2
3                 14             2
4                 15.7         2
5                 18.1         2
6                 20.5         2

## Sample of answers

Empty   1_Marble        2_Marbles       3_Marbles
Exp1    50.5    65.5    81.5    96.0
Exp2    50.5    67.0    82.5    98.0
Exp3    51.5    67.5    83.0    98.0


What are the units here?

## Sample of answers

            0 marble    1 marbles    2 marbles    3 marbles
Repetition1 8.4         9.5          10           10.8
Repetition2 8.3         9            10.1         10.8



## Tidy data

• Data that is easy to model, visualize and aggregate

• Only one kind of object in a data frame (e.g. experiment)

• Variables in columns, observations in rows

• Only one measuring unit on each column

• Do not mix numbers and text

• Units can be in the column name

## Tendency line

We draw the figure using a formula

plot(weight_kg ~ height_cm, data=survey)

Using the same formula we can get a linear model

lm(weight_kg ~ height_cm, data=survey)
Call:
lm(formula = weight_kg ~ height_cm, data = survey)

Coefficients:
(Intercept)    height_cm
-81.5045       0.8616  

## Linear models

In science we work by creating models of how nature works

There are several kinds of models

One of the easiest and more commonly used are the linear models

We approximate all our data by a straight line that shows the relationship between some variables, with a formula like $y=a + b\cdot x$

## Drawing the line

We draw the tendency using abline() with the coefficients of lm()

plot(weight_kg ~ height_cm, data=survey)
abline(a=-81.5045, b=0.8616)

## Making the computer work for us

### we must avoid copying data manually

model <- lm(weight_kg ~ height_cm, data=survey)
model
Call:
lm(formula = weight_kg ~ height_cm, data = survey)

Coefficients:
(Intercept)    height_cm
-81.5045       0.8616  

## The useful part are the coefficients

coef(model)
(Intercept)   height_cm
-81.5044904   0.8616001 
• The formula is $$y=a + b\cdot x$$
• $$a$$ is coef(model)[1]
• $$b$$ is coef(model)[2]

## Drawing the tendency line

plot(weight_kg ~ height_cm, data=survey)
model <- lm(weight_kg ~ height_cm, data=survey)
abline(a=coef(model)[1], b=coef(model)[2])

## Easily drawing the tendency line

plot(weight_kg ~ height_cm, data=survey)
model <- lm(weight_kg ~ height_cm, data=survey)
abline(model)

## Predicting with the model

Beyond giving a description of the data, models are often used to get a prediction of what would be the output of the system when we have new data

In this case we need to provide a data.frame with at least one column. The column name must be the same as the one used to create the model. For example

data.frame(height_cm=155:205)

## Predicting with the model

plot(weight_kg ~ height_cm, data=survey)
model <- lm(weight_kg ~ height_cm, data=survey)
guess <- predict(model, newdata = data.frame(height_cm=155:205))
points(155:205, guess, col="red", pch=19)

## Homework

1. Build a linear model for the rubber.txt data you created last class
2. Print the coefficients
3. Predict the rubber length when n_marbles is 0, 5, 10, 20, and 50
4. Finish the flag of Turkey

## Plotting two vectors

plot(survey$height_cm, survey$weight_kg)

plot(survey$weight_kg ~ survey$height_cm)

## Formulas have context

### data= option gives the context for the formula

plot(survey$weight_kg ~ survey$height_cm)

plot(weight_kg ~ height_cm,
data = survey)

## Drawing only girls

You can do like this

plot(survey$height_cm[ survey$Gender=="Female"],
survey$weight_kg[ survey$Gender=="Female"])

or like this

plot(survey[survey$Gender=="Female", "height_cm"], survey[survey$Gender=="Female",
"weight_kg"])

## Preprocessing data

plot(survey[survey$Gender=="Female", "height_cm"], survey[survey$Gender=="Female",
"weight_kg"])

We can do this

grl <- survey[
survey$Gender=="Female", ] plot(grl$height_cm,
grl$weight_kg) ## Selecting rows using subset() We can select using indices … grl <- survey[ survey$Gender=="Female", ]
plot(grl$height_cm, grl$weight_kg)

… or with subset()

grl <- subset(survey,
Gender=="Female")
plot(grl$height_cm, grl$weight_kg)

## Using subset() with formula

You don’t need to use $ girls <- subset(survey, Gender=="Female") plot(girls$weight_kg ~
girls\$height_cm)

data= is the formula’s context

girls <- subset(survey,
Gender=="Female")
plot(weight_kg ~ height_cm,
data = girls)

## The subset= option

Instead of using subset()

plot(weight_kg ~ height_cm,
data = subset(survey,
Gender=="Female"))

you can use subset= option

plot(weight_kg ~ height_cm,
data = survey,
subset = Gender=="Female")

## Drawing two plots at the same time

par(mfrow=c(1,2))
plot(weight_kg ~ height_cm, data=survey,
subset=Gender=="Female", main="Girls")
plot(weight_kg ~ height_cm, data=survey,
subset=Gender=="Male", main="Boys")

## Drawing two plots at the same time

There are many graphical parameters that can be changed with the function par()

It is a good idea to read the manual page help(par)

Here we use the parameter mfrow: a vector c(num_rows, num_colums)

After doing par(mfrow=c(num_rows, num_colums)) all figures will be drawn in an num_rows-by-num_colums shape