December 17, 2019

`plot(sra)`

:- plots all the data frame

`plot(day~log(bases), data = sra)`

:- sideways

`plot(log(bases)~log(day), data=sra)`

:- log-log, not semi-log

`plot(bases~day, data=sra, log="y")`

- Publication Quality

`plot(log(bases) ~ day, data = sra)`

- Good for analysis

sra <- read.table("sra_bases.txt", header=TRUE) model <- lm(log(bases) ~ day, data=sra) model

Call: lm(formula = log(bases) ~ day, data = sra) Coefficients: (Intercept) day 28.34146 0.00248

In a semi-log model, we have \[\log(\text{bases}) = \underbrace{\log(\text{initial})}_{\texttt{coef(model)[1]}} + \underbrace{\log(\text{rate})}_{\texttt{coef(model)[2]}} \cdot \text{days}\]

Undoing `log()`

, we have \[\text{bases} = \text{initial}\cdot\text{rate}^\text{days}\]

Thus, \(\text{rate}=\)`exp(coef(model)[2])`

\(=1.0025\)

If \(\text{rate}=1.0025\) it means that the database grows \(0.25\%\) every day

- That is \(147.5\%\) every year
- That is \(295\%\) in two years

To make predictions using the model, we use`predict(model, newdata)`

We need a *data frame* with one column named `day`

, because we used `day`

in the formula

The last day in `sra`

is `max(sra$day)`

Two years after the last will be `max(sra$day) + 2 * 365`

so `newdata=data.frame(day = max(sra$day) + 2 * 365)`

height | speed | time |
---|---|---|

2.0000 | -0.20 | 0.0001 |

1.9778 | -0.69 | 0.0496 |

1.9310 | -1.18 | 0.0999 |

1.8598 | -1.67 | 0.1492 |

1.7640 | -2.16 | 0.2004 |

1.6438 | -2.65 | 0.2512 |

1.4990 | -3.14 | 0.3002 |

1.3297 | -3.63 | 0.3500 |

1.1360 | -4.12 | 0.4000 |

0.9177 | -4.61 | 0.4490 |

0.6750 | -5.10 | 0.5008 |

We can use a logarithmic scale to see if we get a straight line

- If we get straight line in
*semi-log*then the formula is \[y=A\cdot B^x\] - If we get straight line in
*log-log*then the formula is \[y=A\cdot x^B\]

There are many possible functions that connect \(x\) and \(y\)

We will try a *polynomial* formula

A polynomial is a formula like \[y=A + B\cdot x + C\cdot x^2 +\cdots+ D\cdot x^n\]

The highest exponent is the **degree** of the polynomial

We will try a 2-degree formula \[\text{height}=A+B\cdot\text{time}+C\cdot\text{time}^2\]

We need to add an *extra column*, with the square of the time

ball$time_sq <- ball$time^2 head(ball)

height speed time time_sq 1 2.000 -0.20 0.00010 1.000e-08 2 1.978 -0.69 0.04961 2.461e-03 3 1.931 -1.18 0.09986 9.972e-03 4 1.860 -1.67 0.14919 2.226e-02 5 1.764 -2.16 0.20037 4.015e-02 6 1.644 -2.65 0.25118 6.309e-02

You may be tempted to use

model <- lm(height ~ time^2, data=ball)

But that does not work.

Exponents and multiplications in formulas have a different meaning

(that we will see later)

Now we use **two** independent variables in the model. This is different from the `plot()`

function

model <- lm(height ~ time + time_sq, data=ball) model

Call: lm(formula = height ~ time + time_sq, data = ball) Coefficients: (Intercept) time time_sq 2.000 -0.195 -4.912

We used the formula \[\text{height}=A+B\cdot\text{time}+C\cdot\text{time}^2\] The linear model gave us \[\begin{aligned} A&=2\\ B&=-0.195\\ C&=-4.9 \end{aligned}\]

plot(height ~ time, data=ball) lines(predict(model) ~ time, data=ball, col="red")

fall <- read.table("free-fall.txt", header=TRUE) fall

msec position replica 1 2355 10 A 2 2457 20 A 3 2533 30 A 4 2558 40 A 5 2639 50 A 6 2690 60 A 7 2742 70 A 8 2742 80 A 9 2797 90 A 10 1837 10 B 11 1970 20 B 12 2022 30 B 13 2104 40 B 14 2129 50 B 15 2210 60 B 16 2237 70 B 17 2262 80 B 18 2313 90 B

plot(position~msec, data=fall)

- Initial time is not zero
*position*is upside down- We can change these values for the analysis

- We started the timer before dropping the ball
- The first data is when the ball crosses
**10** - We can choose this as our time reference
- The ball is already moving
- Initial speed is not zero

- The column name is
`msec`

. It means**milliseconds**

Let’s add a new column, called `time`

, with the correct value of *seconds*

fall$time <- fall$msec/1000

Now, for each replica, we change `time`

to start at 0

old <- fall$time[fall$replica=="A"] fall$time[fall$replica=="A"] <- old - min(old) old <- fall$time[fall$replica=="B"] fall$time[fall$replica=="B"] <- old - min(old)

We should also transform position to meters. The first data should be at 0

Each black/white strip was a little more than 19 centimeters

fall$height <- fall$position/10 * 0.192 old <- fall$height[fall$replica=="A"] fall$height[fall$replica=="A"] <- min(old) - old old <- fall$height[fall$replica=="B"] fall$height[fall$replica=="B"] <- min(old) - old

fall[fall$replica=="A",]

msec position replica time height 1 2355 10 A 0.000 0.000 2 2457 20 A 0.102 -0.192 3 2533 30 A 0.178 -0.384 4 2558 40 A 0.203 -0.576 5 2639 50 A 0.284 -0.768 6 2690 60 A 0.335 -0.960 7 2742 70 A 0.387 -1.152 8 2742 80 A 0.387 -1.344 9 2797 90 A 0.442 -1.536

fall[fall$replica=="B",]

msec position replica time height 10 1837 10 B 0.000 0.000 11 1970 20 B 0.133 -0.192 12 2022 30 B 0.185 -0.384 13 2104 40 B 0.267 -0.576 14 2129 50 B 0.292 -0.768 15 2210 60 B 0.373 -0.960 16 2237 70 B 0.400 -1.152 17 2262 80 B 0.425 -1.344 18 2313 90 B 0.476 -1.536

plot(height~time, data=fall, pch=as.character(replica))

This is a parabola, which can be modeled with a second degree polynomial \[y=A + B\cdot x + C\cdot x^2\] When we use a polynomial model, we need to add new columns to our data frame

fall$time_sq <- fall$time^2

This is used **only** for the model

model_A <- lm(height ~ time + time_sq, data=fall, subset=replica=="A") model_A

Call: lm(formula = height ~ time + time_sq, data = fall, subset = replica == "A") Coefficients: (Intercept) time time_sq -0.00131 -1.55751 -4.26664

model_B <- lm(height ~ time + time_sq, data=fall, subset=replica=="B") model_B

Call: lm(formula = height ~ time + time_sq, data = fall, subset = replica == "B") Coefficients: (Intercept) time time_sq 0.00377 -1.08373 -4.57079

Since we choose that at `time=0`

we have `height=0`

, we **know** that `(Intercept)`

is 0

In other words, our model here is \[y= B\cdot x + C\cdot x^2\] The old \(A\) has a fixed value of 0

How do we do that in R?

new_model_A <- lm(height ~ time + time_sq + 0, data=fall, subset=replica=="A") new_model_A

Call: lm(formula = height ~ time + time_sq + 0, data = fall, subset = replica == "A") Coefficients: time time_sq -1.57 -4.25

new_model_B <- lm(height ~ time + time_sq + 0, data=fall, subset=replica=="B") new_model_B

Call: lm(formula = height ~ time + time_sq + 0, data = fall, subset = replica == "B") Coefficients: time time_sq -1.06 -4.61

If you remember your physics lessons, you can recognize that in the formula \[y= B\cdot x + C\cdot x^2\] the value \(B\) is the *initial speed*, and \(C=-g/2\)

Thus, we can find the *gravitational acceleration* in the second coefficient

The *gravitational acceleration* is constant and independent of the replica

But the initial speed **depends** on the replica

We want to combine both replicas to get

- initial speed of each replica
- gravitational constant

combi <- lm(height ~ time:replica + time_sq + 0, data=fall) combi

Call: lm(formula = height ~ time:replica + time_sq + 0, data = fall) Coefficients: time_sq time:replicaA time:replicaB -4.46 -1.49 -1.12

The symbol `:`

means *interaction*

- Linear models are useful to extract scientific information from experimental data
- Can handle linear, exponential, power law and polynomial models
- Using
`+0`

in the formula forces the intercept to 0 - Using
`:`

allows us to combine two variables and measure their interaction

`survey`

datasurvey <- read.table("survey1-tidy.txt") colnames(survey)

[1] "Gender" "birth_day" [3] "birth_month" "birth_year" [5] "height_cm" "weight_kg" [7] "handness" "hand_span_cm"

plot(weight_kg~Gender, data=survey)

lm(weight_kg~Gender, data=survey)

Call: lm(formula = weight_kg ~ Gender, data = survey) Coefficients: (Intercept) GenderMale 57.8 18.7

model <- lm(weight_kg ~ Gender, data = survey)

Call: lm(formula = weight_kg ~ Gender, data = survey) Coefficients: (Intercept) GenderMale 57.8 18.7

Here `(Intercept)`

is “normal weight”, and `GenderMale`

is the extra weight when `Gender=='Male'`

(on average)

\[\text{weight_kg} = coef(model)[1] + coef(model)[2]\cdot\text{[Gender is 'Male']}\]

lm(weight_kg ~ Gender + 0, data = survey)

Call: lm(formula = weight_kg ~ Gender + 0, data = survey) Coefficients: GenderFemale GenderMale 57.8 76.6

Now the formula includes `+0`

. This means “do not use intercept”. Now

`GenderMale`

is the average weight when`Gender=='Male'`

`GenderFemale`

is the average weight when`Gender=='Female'`

Let’s start with the simplest model

plot(weight_kg ~ height_cm, data=survey, col=Gender) model1 <- lm(weight_kg ~ height_cm, data=survey) points(predict(model1) ~ height_cm, data=survey, col="green3", pch=16)

model1

Call: lm(formula = weight_kg ~ height_cm, data = survey) Coefficients: (Intercept) height_cm -81.504 0.862

Obviously the weight also depends on the gender

plot(weight_kg ~ height_cm, data=survey, col=Gender) model2 <- lm(weight_kg ~ height_cm + Gender+0, data = survey) points(predict(model2) ~ height_cm, data=survey, col="green3", pch=16)

model2

Call: lm(formula = weight_kg ~ height_cm + Gender + 0, data = survey) Coefficients: height_cm GenderFemale GenderMale 0.279 11.688 26.896

There are two lines with the same slope: is the coefficient for `height_cm`

The “intercept” of each line are the coefficients `GenderFemale`

and `GenderMale`

- Factors can be used in formulas
- The
*intercept*can be taken out using`+0`

in the formula - Variables can have interactions (synergy)
- Use
`:`

to indicate interaction

- Use
**Homework:**Learn about`*`

in formulas