Class 22: Linear models

Computing in Molecular Biology and Genetics 1

Andrés Aravena, PhD

28 December 2020

Why?

Why are we here?

To become Scientists

Someone who does Science

Science is not only making experiments

Science is the process of creating knowledge

We search for truth with the Scientific Method

Scientific Method

  • We observe the nature and find patterns
  • We create models that can explain the patterns
  • We make experiments to test if the models are valid

This way we avoid fooling ourselves

Scientist look for the truth

  • Does smoking causes cancer?
  • Does eating sugar makes you fat?
  • Does your cellphone produce brain cancer?
  • Will an expensive medicine cure your sickness?

The society expect from us, the scientists, to answer these questions with the truth

Even in everyday jobs

Tomorrow you may work in a blood bank.
Is the blood safe?

Or in a food factory.
Is the food safe? Is it GMO?

Or in a University.
Is this pharmaceutical company telling the truth?

Or you do a paternity test.
Is this person the real father?

Truth is essential for scientists

  • Sometimes your experiments fail
  • Sometimes you get a wrong result
  • Sometimes your model is incomplete
    • Most models are incomplete, and we are always updating them

You can be wrong, but you cannot lie

The Scientific Method

Some key parts of the Scientific Method

  • Models are tested with experiments

  • To be valid, experiments must be replicable

    • That is, other people doing the same experiment must get the same result
  • There may be some variation between experiments

    • You must declare what is your margin of error
    • Every measurement has a margin of error

Doing Science with our data

We load the data

library(readr)
students <- read_tsv("students2018-2020-tidy.tsv",
    col_types = cols(
        sex = col_factor(levels = c("Male", "Female")),
        handedness = col_factor(levels = c("Left", "Right"))
    )
)
students
# A tibble: 117 x 10
   answer_date id    english_level sex   birthdate  birthplace height_cm
   <date>      <chr> <chr>         <fct> <date>     <chr>          <dbl>
 1 2018-09-17  3e50… I can speak … Male  1993-02-01 -/Turkey         179
 2 2018-09-17  479d… I can unders… Fema… 1998-05-21 Kahramanm…       168
 3 2018-09-17  39df… I can read a… Fema… 1998-01-18 Batman/Tu…        NA
 4 2018-09-17  d2b0… I can read a… Male  1998-08-29 Antalya/T…       170
 5 2018-09-17  f22b… I can read a… Fema… 1998-05-03 Izmir/Tur…       162
 6 2018-09-17  849c… İngilizce bi… Fema… 1995-10-09 Yalova/Tu…       167
 7 2018-09-17  8381… I can speak … Fema… 1997-09-19 Adıyaman/…       174
 8 2018-09-17  b0dd… I can read a… Male  1997-11-27 Bursa/Tur…       180
 9 2018-09-17  2972… I can read a… Fema… 1999-01-02 Istanbul/…       162
10 2018-09-17  72c0… I can read a… Fema… 1998-10-02 Istanbul/…       172
# … with 107 more rows, and 3 more variables: weight_kg <dbl>,
#   handedness <fct>, hand_span <dbl>

Plotting a formula

plot(weight_kg ~ height_cm, data = students)

Reminder of plot() function

  • plot(y ~ x) looks like plot(x, y)
  • Formulas are nice:
    • plot(y~x, data=dframe)
  • In general the defaults are good
    • axis labels are the plotted variable’s names
    • ranges are automatic
  • You can choose the horizontal and vertical ranges

Linear models

In science we work by creating models of how nature works

There are several kinds of models

One of the easiest and more commonly used are the linear models

We approximate all our data by a straight line that shows the relationship between some variables, with a formula like \[y=a + b\cdot x\]

A simple model

We want to describe the relationship between weight and height

A basic rule of thumb is that weight in kg is usually equal to height in cm minus 100 \[weight = height - 100\]

That is, the weight is the number of centimeters over one meter

Adding a straight line to a plot

plot(weight_kg ~ height_cm, data = students)
abline(a = -100, b = 1)

A-B-line

This command adds a straight line in a specific position

  • abline(h=1) adds a horizontal line in 1

  • abline(v=2) adds a vertical line in 2

  • abline(a=3, b=4) adds an \(a +b\cdot x\) line

Linear models

Deciding the model formula

The hard part is to decide

  • What is the dependent variable
    • The values that we want to predict
    • The vertical axis
  • What are the independent variables
    • The values that we can control
    • The horizontal axis
  • What is the formula connecting them

Our model

Here we are using a linear model \[y=a + b\cdot x\] more precisely \[weight = a + b\cdot height\]

Which are the best \(a\) and \(b\) values?

Finding the best line

We draw the figure using a formula

plot(weight_kg ~ height_cm, data=students)

Using the same formula we can get a linear model

lm(weight_kg ~ height_cm, data=students)

Call:
lm(formula = weight_kg ~ height_cm, data = students)

Coefficients:
(Intercept)    height_cm  
   -84.0079       0.8822  

Drawing the tendency line

We draw using abline() with the coefficients of lm()

plot(weight_kg ~ height_cm, data=students)
abline(a = -84.0079, b = 0.8822)

Making the computer work for us

We must avoid copying data manually or using fixed values

model <- lm(weight_kg ~ height_cm, data=students)
model

Call:
lm(formula = weight_kg ~ height_cm, data = students)

Coefficients:
(Intercept)    height_cm  
   -84.0079       0.8822  

The useful part are the coefficients

coef(model)
(Intercept)   height_cm 
 -84.007888    0.882157 

The formula is \(y=a + b\cdot x\)

  • \(a\) is coef(model)[1]

  • \(b\) is coef(model)[2]

Drawing the tendency line

plot(weight_kg ~ height_cm, data=students)
model <- lm(weight_kg ~ height_cm, data=students)
abline(a=coef(model)[1], b=coef(model)[2])

Easily drawing the tendency line

plot(weight_kg ~ height_cm, data=students)
model <- lm(weight_kg ~ height_cm, data=students)
abline(model)

Predicting with the model

Beyond giving a description of the data, models are often used to get a prediction of what would be the output of the system when we have new data

In this case we need to provide a data.frame with at least one column. The column name must be the same as the one used to create the model. For example

data.frame(height_cm=155:205)

Predicting with the model

model <- lm(weight_kg ~ height_cm, data=students)
guess <- predict(model, newdata=data.frame(height_cm=155:205))

An experiment

Coils and Rubber bands

  • Coils and rubber bands have a natural size
  • If you apply a force to them, they expand
  • What is the relation between the expansion and the force?

Robert Hooke said it first

Robert Hooke (1635–1703) was an English natural philosopher, architect and polymath.

In 1660, Hooke discovered the law of elasticity which describes the linear variation of tension with extension

“The extension is proportional to the force”

Robert Hooke

Natural philosophy was the study of nature and the physical universe that was dominant before the development of modern science

Polymath (from Greek “having learned much”) is a person whose expertise spans a significant number of different subject areas

Biologist. Hooke used the microscope and was the fists to use the term cell for describing biological organisms.

How do we model a coil?

The essence of the coil is:

  • It has a natural length \(L\)
  • If we change the length by \(x\), it pulls with a force \[\mathrm{force}(x)= K \cdot (L-x)\]

Doing the experiment

Results

n_balls length repetition
0 78.00 1
1 82.61 1
2 85.85 1
3 90.26 1
4 95.05 1
0 79.21 2
2 85.55 2
3 90.06 2
4 94.35 2

(some data omitted from the table. Get all data at
http://dry-lab.org/static/2017/rubber1.txt)

Graphic Results

plot(length~n_balls, data=rubber)

Best-fit line

When data seems to be in a straight line, we can find that line

The best-fitting line is found using a linear model

model <- lm(length ~ n_balls, data=rubber)
model

Call:
lm(formula = length ~ n_balls, data = rubber)

Coefficients:
(Intercept)      n_balls  
      76.48         5.06  

What are the coefficients?

Remember that straight lines can be represented by the formula \[\text{n_marbles}=A+B\cdot \text{length}\] The coefficient \(A\) is the value where the line intercepts the vertical axis

The coefficient \(B\) is how much length goes up when n_marbles increases. This is called slope

Physical interpretation of the linear model

The formula from Hooke’s Law is \[\text{force}=K\cdot(L-\text{height})\] Since force is the weight of the balls, we can write \[-m g\cdot\text{n_balls}=K\cdot(L-\text{lenght})\] which can be re-written as \[\text{lenght}=\underbrace{L}_{A}+\underbrace{\frac{m g}{K}}_{B}\cdot\text{n_balls}\]

Physical interpretation of coef(model)

When there are no balls, the length of the coil is \(L\), in this case

coef(model)[1]
(Intercept) 
   76.48188 

If the mass of each ball is 20gr, we can find \(K\) as

coef(model)[2]/(0.020 * 9.8)
 n_balls 
25.81566 

Marbles weight \(\approx 21\pm 0.5\) grams