December 3, 2019

About last week experiment’s data

Sample of answers (185a34)

id rep N x1 x2 y1 y2
1 a 1 40 125 143 225
2 a 1 50 155 175 280
3 a 1 40 195 215 375
4 a 1 10 192 212 400
5 a 1 110 258 278 435
1 b 1 40 125 143 225
2 b 1 50 156 174 280
3 b 1 40 156 175 375
4 b 1 10 238 258 400
5 b 1 110 261 281 435
1 c 1 40 125 143 225
2 c 1 50 155 174 280
3 c 1 40 196 214 375
4 c 1 10 193 212 400
5 c 1 110 260 279 435

Sample of answers (1e4a6e)

id rep N x1 x2 y1 y2
1 a 1 171 180 182 188
2 a 1 184 176 181 208
3 a 1 188 195 191 212
5 a 1 179 180 177 205
1 b 2 191 195 202 199
2 b 2 190 192 189 205
3 b 2 191 194 207 214
4 b 2 210 208 210 222
5 b 2 220 217 205 206
1 c 2 191 195 202 211
2 c 2 190 192 189 202
3 c 2 191 194 207 205
4 c 2 210 208 210 212
5 c 2 220 217 205 207

Sample of answers (3b2b4b)

id rep N x1   x2   x3 
1 a 2 40 15.7 13.0 11.3
2 a 2 50 21.3 15.0 13.7
3 a 2 60 25.0 18.8 16.2
4 a 2 70 28.8 21.8 19.4
5 a 2 80 32.2 25.2 22.8
1 b 2 40 15.7 12.9 11.4
2 b 2 50 20.7 15.7 13.6
3 b 2 60 24.8 18.5 16.6
4 b 2 70 28.9 21.7 19.2
5 b 2 80 32.1 25.3 22.6
1 c 2 40 15.8 12.9 11.3
2 c 2 50 20.5 15.7 13.8
3 c 2 60 24.3 19.3 16.4
4 c 2 70 28.7 21.9 19.4
5 c 2 80 32.3 25.3 22.4

Sample of answers (6ed952)

id rep x1    x2   y1   y2
1 a 1   150  255  275  370
2 a 1   100  200  220  300
3 a 1   200  290  310  396
4 a 1   210  305  325  410
5 a 1   150  274  293  400
1 b 1   150  256  274  370
2 b 1   100  192  210  300
3 b 1   200  290  309  396
4 b 1   210  303  321  410
5 b 1   150  275  293  400
1 c 1   150  256  274  370
2 c 1   100  192  210  300
3 c 1   200  290  309  396
4 c 1   210  303  321  410
5 c 1   150  275  293  400

Sample of answers (7183bd)

id rep N x1 x2  x2  y1  y2
1 a 1 250 346 370 475
2 a 1 250 365 385 495
3 a 1 250 378 398 515
4 a 1 250 365 384 495
5 a 1 250 359 378 475
1 b 1 250 349 369 475
2 b 1 250 350 370 495
3 b 1 250 365 385 515
4 b 1 250 357 397 495
5 b 1 250 349 369 475
1 c 1 250 348 368 475
2 c 1 250 356 376 495
3 c 1 250 364 384 515
4 c 1 250 356 376 495
7 c 1 250 350 370 475

Sample of answers (e3459b)

id   rep x1  x2  y1  y2  n  
1 a 1    355 356 372 416
2 a 1    384 450 380 382
3 a 1    420 446 740 775
4 a 1    434 442 775 670
5 a 1    425 460 755 759
1 b 1    256 290 632 705
2 b 1      295 306 630 650
3 b 1    285 296 630 660
4 b 1     260 280 650 680 
5 b 1    280 350 700 720
1 c 1     300 290 725 733
2 c    
3 c
4 c
5 c

Data tidying

One of the important cleaning processes in the practice of data science

Tidy data sets have structure and working with them is easy

They are easy to manipulate, model and visualize

Tidy data sets are arranged such that each variable is a column and each observation (or case) is a row

Characteristics

  • Each variable you measure should be in one column.
  • Each different observation of that variable should be in a different row.
  • There should be one table for each “kind” of variable.
  • If you have multiple tables, they should include a column in the table that allows them to be linked.
  • Jeff Leek, “The Elements of Data Analytic Style” (2015)
  • Wikipedia

What people says

“Tidy datasets are all alike but every messy dataset is messy in its own way.”

– Hadley Wickham

Tidy data

  • Data that is easy to model, visualize and aggregate

  • Only one kind of object in a data frame (e.g. experiment)

  • Variables in columns, observations in rows

  • Only one measuring unit on each column

  • Do not mix numbers and text

  • Units can be in the column name

Linear models

Tendency line

We draw the figure using a formula

plot(weight_kg ~ height_cm, data=survey)

Using the same formula we can get a linear model

lm(weight_kg ~ height_cm, data=survey)
Call:
lm(formula = weight_kg ~ height_cm, data = survey)

Coefficients:
(Intercept)    height_cm  
   -77.1697       0.8382  

Linear models

In science we work by creating models of how nature works

There are several kinds of models

One of the easiest and more commonly used are the linear models

We approximate all our data by a straight line that shows the relationship between some variables, with a formula like \[y=a + b\cdot x\]

Drawing the line

We draw the tendency using abline() with the coefficients of lm()

plot(weight_kg ~ height_cm, data=survey)
abline(a=-77.169742, b=0.838182)

Making the computer work for us

we must avoid copying data manually

model <- lm(weight_kg ~ height_cm, data=survey)
model
Call:
lm(formula = weight_kg ~ height_cm, data = survey)

Coefficients:
(Intercept)    height_cm  
   -77.1697       0.8382  

The useful part are the coefficients

coef(model)
(Intercept)   height_cm 
 -77.169742    0.838182 
  • The formula is \(y=a + b\cdot x\)
  • \(a\) is coef(model)[1]
  • \(b\) is coef(model)[2]

Drawing the tendency line

plot(weight_kg ~ height_cm, data=survey)
model <- lm(weight_kg ~ height_cm, data=survey)
abline(a=coef(model)[1], b=coef(model)[2])

Easily drawing the tendency line

plot(weight_kg ~ height_cm, data=survey)
model <- lm(weight_kg ~ height_cm, data=survey)
abline(model)

Predicting with the model

Beyond giving a description of the data, models are often used to get a prediction of what would be the output of the system when we have new data

In this case we need to provide a data.frame with at least one column. The column name must be the same as the one used to create the model. For example

data.frame(height_cm=155:205)

Predicting with the model

plot(weight_kg ~ height_cm, data=survey)
model <- lm(weight_kg ~ height_cm, data=survey)
guess <- predict(model, newdata=data.frame(height_cm=155:205))
points(155:205, guess, col="red", pch=19)

Homework

  1. Clean the data in the file you created last class
  2. Calculate dx=x2-x1 and dy=y2-y1
  3. Create a model of d2 ~ d1 when N==1
  4. Print the coefficients
  5. Predict d2 when d1 is 100, 200, and 500

Subsets

Plotting two vectors

plot(survey$height_cm,
     survey$weight_kg)

plot(survey$weight_kg ~ 
    survey$height_cm)

Formulas have context

data= option gives the context for the formula

plot(survey$weight_kg ~ 
    survey$height_cm)

plot(weight_kg ~ height_cm, 
    data = survey)

Drawing only girls

You can do like this

plot(survey$height_cm[
      survey$Gender=="Female"],
     survey$weight_kg[
      survey$Gender=="Female"])

or like this

plot(survey[survey$Gender=="Female",
        "height_cm"],
     survey[survey$Gender=="Female",
        "weight_kg"])

Preprocessing data

Instead of this

plot(survey[survey$Gender=="Female",
        "height_cm"],
     survey[survey$Gender=="Female",
        "weight_kg"])

We can do this

grl <- survey[
    survey$Gender=="Female", ]
plot(grl$height_cm,
     grl$weight_kg)

Selecting rows using subset()

We can select using indices …

grl <- survey[
    survey$Gender=="Female", ]
plot(grl$height_cm,
     grl$weight_kg)

… or with subset()

grl <- subset(survey,
         Gender=="Female")
plot(grl$height_cm,
     grl$weight_kg)

Using subset() with formula

You don’t need to use $

girls <- subset(survey,
      Gender=="Female")
plot(girls$weight_kg ~
      girls$height_cm)

data= is the formula’s context

girls <- subset(survey,
     Gender=="Female")
plot(weight_kg ~ height_cm,
     data = girls)

The subset= option

Instead of using subset()

plot(weight_kg ~ height_cm, 
  data = subset(survey,
       Gender=="Female"))

you can use subset= option

plot(weight_kg ~ height_cm, 
   data = survey,
   subset = Gender=="Female")

Drawing two plots at the same time

par(mfrow=c(1,2), mar=c(5,4,2,0))
plot(weight_kg ~ height_cm, survey, subset=Gender=="Female", main="Girls")
plot(weight_kg ~ height_cm, survey, subset=Gender=="Male", main="Boys")

Drawing two plots at the same time

There are many graphical parameters that can be changed with the function par()

It is a good idea to read the manual page help(par)

Here we use the parameter mfrow: a vector c(num_rows, num_colums)

After doing par(mfrow=c(num_rows, num_colums)) all figures will be drawn in an num_rows-by-num_colums shape