November 22, 2018

One of the important cleaning processes in the practice of data science

*Tidy data sets* have structure and working with them is easy

They are easy to manipulate, model and visualize

Tidy data sets are arranged such that each variable is a column and each observation (or case) is a row

- Each variable you measure should be in one column.
- Each different observation of that variable should be in a different row.
- There should be one table for each “kind” of variable.
- If you have multiple tables, they should include a column in the table that allows them to be linked.

- Jeff Leek, “The Elements of Data Analytic Style” (2015)
- Wikipedia

“Tidy datasets are all alike but every messy dataset is messy in its own way.”

– Hadley Wickham

n_marbles length repetition 0 10 1 1 10.8 1 2 12.2 1 3 14 1 4 15.7 1 5 18.1 1 6 20.5 1 0 10 2 1 10.8 2 2 12.2 2 3 14 2 4 15.7 2 5 18.1 2 6 20.5 2

n_marbels length repetition 0 8.0 1 0 8.0 2 0 8.0 3 1 8.5 1 1 8.8 2 1 8.8 3 2 8.9 1 2 8.9 2 2 9.0 3 3 9.4 1 3 9.2 2 3 9.1 3

n_coins length repetition 0 7.5cm 1 5 8 cm 1 10 9 cm 1 15 11 cm 1 0 8 cm 2 5 9 cm 2 10 10 cm 2 15 11 cm 2 0 8 cm 3 5 9 cm 3 10 10 cm 3 15 11 cm 3

n_marble length repetition 0 5cm 1 1 5.5cm 1 2 6cm 1 3 6.5cm 1 4 7cm 1 5 7.5cm 1 6 8.1cm 1 7 8.6cm 1 0 5cm 1 1 5.4cm 1 2 6cm 1 3 6.4cm 1 4 6.9cm 1 5 7.5cm 1 6 8.2cm 1 7 8.7cm 1

n_coins length_cm repetition 0 7,5 1 5 8 1 10 9 1 15 11 1 0 8 2 5 9 2 10 10 2 15 11 2 0 8 3 5 9 3 10 10 3 15 11 3

n_marbles length repetition 0 10 2 1 10.8 2 2 12.2 2 3 14 2 4 15.7 2 5 18.1 2 6 20.5 2

Empty 1_Marble 2_Marbles 3_Marbles Exp1 50.5 65.5 81.5 96.0 Exp2 50.5 67.0 82.5 98.0 Exp3 51.5 67.5 83.0 98.0

What are the units here?

0 marble 1 marbles 2 marbles 3 marbles Repetition1 8.4 9.5 10 10.8 Repetition2 8.3 9 10.1 10.8

Data that is easy to model, visualize and aggregate

Only one kind of object in a data frame (e.g. experiment)

Variables in columns, observations in rows

Only one measuring unit on each column

Do not mix numbers and text

Units can be in the column name

We draw the figure using a formula

plot(weight_kg ~ height_cm, data=survey)

Using the same formula we can get a *linear model*

lm(weight_kg ~ height_cm, data=survey)

Call: lm(formula = weight_kg ~ height_cm, data = survey) Coefficients: (Intercept) height_cm -81.5045 0.8616

In science we work by creating models of how nature works

There are several kinds of models

One of the easiest and more commonly used are the *linear models*

We approximate all our data by a straight line that shows the relationship between some variables, with a formula like \[y=a + b\cdot x\]

We draw the tendency using `abline()`

with the coefficients of `lm()`

plot(weight_kg ~ height_cm, data=survey) abline(a=-81.5045, b=0.8616)

model <- lm(weight_kg ~ height_cm, data=survey) model

Call: lm(formula = weight_kg ~ height_cm, data = survey) Coefficients: (Intercept) height_cm -81.5045 0.8616

coef(model)

(Intercept) height_cm -81.5044904 0.8616001

- The formula is \(y=a + b\cdot x\)
- \(a\) is
`coef(model)[1]`

- \(b\) is
`coef(model)[2]`

plot(weight_kg ~ height_cm, data=survey) model <- lm(weight_kg ~ height_cm, data=survey) abline(a=coef(model)[1], b=coef(model)[2])

plot(weight_kg ~ height_cm, data=survey) model <- lm(weight_kg ~ height_cm, data=survey) abline(model)

Beyond giving a *description* of the data, models are often used to get a *prediction* of what would be the *output* of the system when we have *new data*

In this case we need to provide a `data.frame`

with at least one column. The column name **must be** the same as the one used to create the model. For example

data.frame(height_cm=155:205)

plot(weight_kg ~ height_cm, data=survey) model <- lm(weight_kg ~ height_cm, data=survey) guess <- predict(model, newdata = data.frame(height_cm=155:205)) points(155:205, guess, col="red", pch=19)

- Build a linear model for the
**rubber.txt**data you created last class - Print the coefficients
- Predict the rubber length when
`n_marbles`

is 0, 5, 10, 20, and 50 - Finish the flag of Turkey

plot(survey$height_cm, survey$weight_kg)

plot(survey$weight_kg ~ survey$height_cm)

`data=`

option gives the context for the formulaplot(survey$weight_kg ~ survey$height_cm)

plot(weight_kg ~ height_cm, data = survey)

You can do like this

plot(survey$height_cm[ survey$Gender=="Female"], survey$weight_kg[ survey$Gender=="Female"])

or like this

plot(survey[survey$Gender=="Female", "height_cm"], survey[survey$Gender=="Female", "weight_kg"])

Instead of this

plot(survey[survey$Gender=="Female", "height_cm"], survey[survey$Gender=="Female", "weight_kg"])

We can do this

grl <- survey[ survey$Gender=="Female", ] plot(grl$height_cm, grl$weight_kg)

We can select using indices …

grl <- survey[ survey$Gender=="Female", ] plot(grl$height_cm, grl$weight_kg)

… or with `subset()`

grl <- subset(survey, Gender=="Female") plot(grl$height_cm, grl$weight_kg)

`subset()`

with formulaYou don’t need to use `$`

girls <- subset(survey, Gender=="Female") plot(girls$weight_kg ~ girls$height_cm)

`data=`

is the formula’s context

girls <- subset(survey, Gender=="Female") plot(weight_kg ~ height_cm, data = girls)

`subset=`

optionInstead of using `subset()`

…

plot(weight_kg ~ height_cm, data = subset(survey, Gender=="Female"))

you can use `subset=`

option

plot(weight_kg ~ height_cm, data = survey, subset = Gender=="Female")

par(mfrow=c(1,2)) plot(weight_kg ~ height_cm, data=survey, subset=Gender=="Female", main="Girls") plot(weight_kg ~ height_cm, data=survey, subset=Gender=="Male", main="Boys")

There are many graphical parameters that can be changed with the function `par()`

It is a good idea to read the manual page `help(par)`

Here we use the parameter `mfrow`

: a vector `c(num_rows, num_colums)`

After doing `par(mfrow=c(num_rows, num_colums))`

all figures will be drawn in an `num_rows`

-by-`num_colums`

shape