December 17, 2019

## Database growth

• plot(sra):
• plots all the data frame
• plot(day~log(bases), data = sra):
• sideways
• plot(log(bases)~log(day), data=sra):
• log-log, not semi-log

## Good plots

• plot(bases~day, data=sra, log="y")
• Publication Quality
• plot(log(bases) ~ day, data = sra)
• Good for analysis

## Meaning

sra <- read.table("sra_bases.txt", header=TRUE)
model <- lm(log(bases) ~ day, data=sra)
model
Call:
lm(formula = log(bases) ~ day, data = sra)

Coefficients:
(Intercept)          day
28.34146      0.00248  

## Coefficients are log(bases)

In a semi-log model, we have $\log(\text{bases}) = \underbrace{\log(\text{initial})}_{\texttt{coef(model)}} + \underbrace{\log(\text{rate})}_{\texttt{coef(model)}} \cdot \text{days}$

Undoing log(), we have $\text{bases} = \text{initial}\cdot\text{rate}^\text{days}$

Thus, $$\text{rate}=$$exp(coef(model))$$=1.0025$$

## Meaning of rate

If $$\text{rate}=1.0025$$ it means that the database grows $$0.25\%$$ every day

• That is $$147.5\%$$ every year
• That is $$295\%$$ in two years

## Predicting the future

To make predictions using the model, we use
predict(model, newdata)

We need a data frame with one column named day, because we used day in the formula

The last day in sra is max(sra$day) Two years after the last will be max(sra$day) + 2 * 365

so newdata=data.frame(day = max(sra$day) + 2 * 365) ## “Free fall” experiment ## An ideal falling ball height speed time 2.0000 -0.20 0.0001 1.9778 -0.69 0.0496 1.9310 -1.18 0.0999 1.8598 -1.67 0.1492 1.7640 -2.16 0.2004 1.6438 -2.65 0.2512 1.4990 -3.14 0.3002 1.3297 -3.63 0.3500 1.1360 -4.12 0.4000 0.9177 -4.61 0.4490 0.6750 -5.10 0.5008 ## What is the good formula this time? We can use a logarithmic scale to see if we get a straight line • If we get straight line in semi-log then the formula is $y=A\cdot B^x$ • If we get straight line in log-log then the formula is $y=A\cdot x^B$ ## Logarithmic scale  ## This is not any exponential There are many possible functions that connect $$x$$ and $$y$$ We will try a polynomial formula A polynomial is a formula like $y=A + B\cdot x + C\cdot x^2 +\cdots+ D\cdot x^n$ The highest exponent is the degree of the polynomial We will try a 2-degree formula $\text{height}=A+B\cdot\text{time}+C\cdot\text{time}^2$ ## A new column for polynomial model We need to add an extra column, with the square of the time ball$time_sq <- balltime^2 head(ball)  height speed time time_sq 1 2.000 -0.20 0.00010 1.000e-08 2 1.978 -0.69 0.04961 2.461e-03 3 1.931 -1.18 0.09986 9.972e-03 4 1.860 -1.67 0.14919 2.226e-02 5 1.764 -2.16 0.20037 4.015e-02 6 1.644 -2.65 0.25118 6.309e-02 ## We really have to add a new column You may be tempted to use model <- lm(height ~ time^2, data=ball) But that does not work. Exponents and multiplications in formulas have a different meaning (that we will see later) ## Linear model for polynomial Now we use two independent variables in the model. This is different from the plot() function model <- lm(height ~ time + time_sq, data=ball) model Call: lm(formula = height ~ time + time_sq, data = ball) Coefficients: (Intercept) time time_sq 2.000 -0.195 -4.912  ## Interpretation We used the formula $\text{height}=A+B\cdot\text{time}+C\cdot\text{time}^2$ The linear model gave us \begin{aligned} A&=2\\ B&=-0.195\\ C&=-4.9 \end{aligned} ## Predicted v/s original plot(height ~ time, data=ball) lines(predict(model) ~ time, data=ball, col="red") ## It works! ## Our experiment ## The data fall <- read.table("free-fall.txt", header=TRUE) fall  msec position replica 1 2355 10 A 2 2457 20 A 3 2533 30 A 4 2558 40 A 5 2639 50 A 6 2690 60 A 7 2742 70 A 8 2742 80 A 9 2797 90 A 10 1837 10 B 11 1970 20 B 12 2022 30 B 13 2104 40 B 14 2129 50 B 15 2210 60 B 16 2237 70 B 17 2262 80 B 18 2313 90 B ## First view plot(position~msec, data=fall) • Initial time is not zero • position is upside down • We can change these values for the analysis ## Time should start at zero • We started the timer before dropping the ball • The first data is when the ball crosses 10 • We can choose this as our time reference • The ball is already moving • Initial speed is not zero • The column name is msec. It means milliseconds ## Normalizing time data Let’s add a new column, called time, with the correct value of seconds falltime <- fall$msec/1000 Now, for each replica, we change time to start at 0 old <- fall$time[fall$replica=="A"] fall$time[fall$replica=="A"] <- old - min(old) old <- fall$time[fall$replica=="B"] fall$time[fall$replica=="B"] <- old - min(old) ## Normalizing position data We should also transform position to meters. The first data should be at 0 Each black/white strip was a little more than 19 centimeters fall$height <- fall$position/10 * 0.192 old <- fall$height[fall$replica=="A"] fall$height[fall$replica=="A"] <- min(old) - old old <- fall$height[fall$replica=="B"] fall$height[fall$replica=="B"] <- min(old) - old ## The tidy data fall[fall$replica=="A",]
  msec position replica  time height
1 2355       10       A 0.000  0.000
2 2457       20       A 0.102 -0.192
3 2533       30       A 0.178 -0.384
4 2558       40       A 0.203 -0.576
5 2639       50       A 0.284 -0.768
6 2690       60       A 0.335 -0.960
7 2742       70       A 0.387 -1.152
8 2742       80       A 0.387 -1.344
9 2797       90       A 0.442 -1.536

fall[fall$replica=="B",]  msec position replica time height 10 1837 10 B 0.000 0.000 11 1970 20 B 0.133 -0.192 12 2022 30 B 0.185 -0.384 13 2104 40 B 0.267 -0.576 14 2129 50 B 0.292 -0.768 15 2210 60 B 0.373 -0.960 16 2237 70 B 0.400 -1.152 17 2262 80 B 0.425 -1.344 18 2313 90 B 0.476 -1.536 ## Normalized data plot(height~time, data=fall, pch=as.character(replica)) ## We will use a polinomial model This is a parabola, which can be modeled with a second degree polynomial $y=A + B\cdot x + C\cdot x^2$ When we use a polynomial model, we need to add new columns to our data frame fall$time_sq <- fall\$time^2

This is used only for the model

## Fitting the model (replica “A”)

model_A <- lm(height ~ time + time_sq,
data=fall, subset=replica=="A")
model_A
Call:
lm(formula = height ~ time + time_sq, data = fall, subset = replica ==
"A")

Coefficients:
(Intercept)         time      time_sq
-0.00131     -1.55751     -4.26664  

## Fitting the model (replica “B”)

model_B <- lm(height ~ time + time_sq,
data=fall, subset=replica=="B")
model_B
Call:
lm(formula = height ~ time + time_sq, data = fall, subset = replica ==
"B")

Coefficients:
(Intercept)         time      time_sq
0.00377     -1.08373     -4.57079  

## (Intercept) should be 0

Since we choose that at time=0 we have height=0, we know that (Intercept) is 0

In other words, our model here is $y= B\cdot x + C\cdot x^2$ The old $$A$$ has a fixed value of 0

How do we do that in R?

## Fixing the intercept to 0

new_model_A <- lm(height ~ time + time_sq + 0,
data=fall, subset=replica=="A")
new_model_A
Call:
lm(formula = height ~ time + time_sq + 0, data = fall, subset = replica ==
"A")

Coefficients:
time  time_sq
-1.57    -4.25  

## Fixing the intercept to 0 (B)

new_model_B <- lm(height ~ time + time_sq + 0,
data=fall, subset=replica=="B")
new_model_B
Call:
lm(formula = height ~ time + time_sq + 0, data = fall, subset = replica ==
"B")

Coefficients:
time  time_sq
-1.06    -4.61  

## Meaning of the coefficients

If you remember your physics lessons, you can recognize that in the formula $y= B\cdot x + C\cdot x^2$ the value $$B$$ is the initial speed, and $$C=-g/2$$

Thus, we can find the gravitational acceleration in the second coefficient

## Combining both replicas

The gravitational acceleration is constant and independent of the replica

But the initial speed depends on the replica

We want to combine both replicas to get

• initial speed of each replica
• gravitational constant

## Models with factors

combi <- lm(height ~ time:replica + time_sq + 0,
data=fall)
combi
Call:
lm(formula = height ~ time:replica + time_sq + 0, data = fall)

Coefficients:
time_sq  time:replicaA  time:replicaB
-4.46          -1.49          -1.12  

The symbol : means interaction

## Summary

• Linear models are useful to extract scientific information from experimental data
• Can handle linear, exponential, power law and polynomial models
• Using +0 in the formula forces the intercept to 0
• Using : allows us to combine two variables and measure their interaction

## With survey data

survey <- read.table("survey1-tidy.txt")
colnames(survey)
 "Gender"       "birth_day"
 "birth_month"  "birth_year"
 "height_cm"    "weight_kg"
 "handness"     "hand_span_cm"

## Boys weight_kg more than girls

plot(weight_kg~Gender,
data=survey) lm(weight_kg~Gender,
data=survey)
Call:
lm(formula = weight_kg ~ Gender, data = survey)

Coefficients:
(Intercept)   GenderMale
57.8         18.7  

## Models with factors

model <- lm(weight_kg ~ Gender, data = survey)
Call:
lm(formula = weight_kg ~ Gender, data = survey)

Coefficients:
(Intercept)   GenderMale
57.8         18.7  

Here (Intercept) is “normal weight”, and GenderMale is the extra weight when Gender=='Male' (on average)

$\text{weight_kg} = coef(model) + coef(model)\cdot\text{[Gender is 'Male']}$

## Model with factor, without intercept

lm(weight_kg ~ Gender + 0, data = survey)
Call:
lm(formula = weight_kg ~ Gender + 0, data = survey)

Coefficients:
GenderFemale    GenderMale
57.8          76.6  

Now the formula includes +0. This means “do not use intercept”. Now

• GenderMale is the average weight when Gender=='Male'
• GenderFemale is the average weight when Gender=='Female'

## Weight v/s Height

plot(weight_kg ~ height_cm, data=survey, col=Gender)
model1 <- lm(weight_kg ~ height_cm, data=survey)
points(predict(model1) ~ height_cm, data=survey, col="green3", pch=16) ## Interpretation

model1
Call:
lm(formula = weight_kg ~ height_cm, data = survey)

Coefficients:
(Intercept)    height_cm
-81.504        0.862  

## Including the Gender factor

Obviously the weight also depends on the gender

plot(weight_kg ~ height_cm, data=survey, col=Gender)
model2 <- lm(weight_kg ~ height_cm + Gender+0, data = survey)
points(predict(model2) ~ height_cm, data=survey, col="green3", pch=16) ## Interpretation

model2
Call:
lm(formula = weight_kg ~ height_cm + Gender + 0, data = survey)

Coefficients:
height_cm  GenderFemale    GenderMale
0.279        11.688        26.896  

There are two lines with the same slope: is the coefficient for height_cm

The “intercept” of each line are the coefficients GenderFemale and GenderMale

## Summary

• Factors can be used in formulas
• The intercept can be taken out using +0 in the formula
• Variables can have interactions (synergy)
• Use : to indicate interaction
• Homework: Learn about * in formulas