December 17, 2019

Database growth

Plots: bad ideas

  • plot(sra):
    • plots all the data frame
  • plot(day~log(bases), data = sra):
    • sideways
  • plot(log(bases)~log(day), data=sra):
    • log-log, not semi-log

Good plots

  • plot(bases~day, data=sra, log="y")
    • Publication Quality
  • plot(log(bases) ~ day, data = sra)
    • Good for analysis

Meaning

sra <- read.table("sra_bases.txt", header=TRUE)
model <- lm(log(bases) ~ day, data=sra)
model
Call:
lm(formula = log(bases) ~ day, data = sra)

Coefficients:
(Intercept)          day  
   28.34146      0.00248  

Coefficients are log(bases)

In a semi-log model, we have \[\log(\text{bases}) = \underbrace{\log(\text{initial})}_{\texttt{coef(model)[1]}} + \underbrace{\log(\text{rate})}_{\texttt{coef(model)[2]}} \cdot \text{days}\]

Undoing log(), we have \[\text{bases} = \text{initial}\cdot\text{rate}^\text{days}\]

Thus, \(\text{rate}=\)exp(coef(model)[2])\(=1.0025\)

Meaning of rate

If \(\text{rate}=1.0025\) it means that the database grows \(0.25\%\) every day

  • That is \(147.5\%\) every year
  • That is \(295\%\) in two years

Predicting the future

To make predictions using the model, we use
predict(model, newdata)

We need a data frame with one column named day, because we used day in the formula

The last day in sra is max(sra$day)

Two years after the last will be max(sra$day) + 2 * 365

so newdata=data.frame(day = max(sra$day) + 2 * 365)

“Free fall” experiment

An ideal falling ball

height speed time
2.0000 -0.20 0.0001
1.9778 -0.69 0.0496
1.9310 -1.18 0.0999
1.8598 -1.67 0.1492
1.7640 -2.16 0.2004
1.6438 -2.65 0.2512
1.4990 -3.14 0.3002
1.3297 -3.63 0.3500
1.1360 -4.12 0.4000
0.9177 -4.61 0.4490
0.6750 -5.10 0.5008

 

What is the good formula this time?

We can use a logarithmic scale to see if we get a straight line

  • If we get straight line in semi-log then the formula is \[y=A\cdot B^x\]
  • If we get straight line in log-log then the formula is \[y=A\cdot x^B\]

Logarithmic scale

This is not any exponential

There are many possible functions that connect \(x\) and \(y\)

We will try a polynomial formula

A polynomial is a formula like \[y=A + B\cdot x + C\cdot x^2 +\cdots+ D\cdot x^n\]

The highest exponent is the degree of the polynomial

We will try a 2-degree formula \[\text{height}=A+B\cdot\text{time}+C\cdot\text{time}^2\]

A new column for polynomial model

We need to add an extra column, with the square of the time

ball$time_sq <- ball$time^2
head(ball)
  height speed    time   time_sq
1  2.000 -0.20 0.00010 1.000e-08
2  1.978 -0.69 0.04961 2.461e-03
3  1.931 -1.18 0.09986 9.972e-03
4  1.860 -1.67 0.14919 2.226e-02
5  1.764 -2.16 0.20037 4.015e-02
6  1.644 -2.65 0.25118 6.309e-02

We really have to add a new column

You may be tempted to use

model <- lm(height ~ time^2, data=ball)

But that does not work.

Exponents and multiplications in formulas have a different meaning

(that we will see later)

Linear model for polynomial

Now we use two independent variables in the model. This is different from the plot() function

model <- lm(height ~ time + time_sq, data=ball)
model
Call:
lm(formula = height ~ time + time_sq, data = ball)

Coefficients:
(Intercept)         time      time_sq  
      2.000       -0.195       -4.912  

Interpretation

We used the formula \[\text{height}=A+B\cdot\text{time}+C\cdot\text{time}^2\] The linear model gave us \[\begin{aligned} A&=2\\ B&=-0.195\\ C&=-4.9 \end{aligned}\]

Predicted v/s original

plot(height ~ time, data=ball)
lines(predict(model) ~ time, data=ball, col="red")

It works!

Our experiment

The data

fall <- read.table("free-fall.txt", header=TRUE)
fall
   msec position replica
1  2355       10       A
2  2457       20       A
3  2533       30       A
4  2558       40       A
5  2639       50       A
6  2690       60       A
7  2742       70       A
8  2742       80       A
9  2797       90       A
10 1837       10       B
11 1970       20       B
12 2022       30       B
13 2104       40       B
14 2129       50       B
15 2210       60       B
16 2237       70       B
17 2262       80       B
18 2313       90       B

First view

plot(position~msec,
     data=fall)

  • Initial time is not zero
  • position is upside down
  • We can change these values for the analysis

Time should start at zero

  • We started the timer before dropping the ball
  • The first data is when the ball crosses 10
  • We can choose this as our time reference
  • The ball is already moving
    • Initial speed is not zero
  • The column name is msec. It means milliseconds

Normalizing time data

Let’s add a new column, called time, with the correct value of seconds

fall$time <- fall$msec/1000

Now, for each replica, we change time to start at 0

old <- fall$time[fall$replica=="A"]
fall$time[fall$replica=="A"] <-  old - min(old)

old <- fall$time[fall$replica=="B"]
fall$time[fall$replica=="B"] <-  old - min(old)

Normalizing position data

We should also transform position to meters. The first data should be at 0

Each black/white strip was a little more than 19 centimeters

fall$height <- fall$position/10 * 0.192
old <- fall$height[fall$replica=="A"]
fall$height[fall$replica=="A"] <-  min(old) - old

old <- fall$height[fall$replica=="B"]
fall$height[fall$replica=="B"] <-  min(old) - old

The tidy data

fall[fall$replica=="A",]
  msec position replica  time height
1 2355       10       A 0.000  0.000
2 2457       20       A 0.102 -0.192
3 2533       30       A 0.178 -0.384
4 2558       40       A 0.203 -0.576
5 2639       50       A 0.284 -0.768
6 2690       60       A 0.335 -0.960
7 2742       70       A 0.387 -1.152
8 2742       80       A 0.387 -1.344
9 2797       90       A 0.442 -1.536

The tidy data

fall[fall$replica=="B",]
   msec position replica  time height
10 1837       10       B 0.000  0.000
11 1970       20       B 0.133 -0.192
12 2022       30       B 0.185 -0.384
13 2104       40       B 0.267 -0.576
14 2129       50       B 0.292 -0.768
15 2210       60       B 0.373 -0.960
16 2237       70       B 0.400 -1.152
17 2262       80       B 0.425 -1.344
18 2313       90       B 0.476 -1.536

Normalized data

plot(height~time, data=fall, pch=as.character(replica))

We will use a polinomial model

This is a parabola, which can be modeled with a second degree polynomial \[y=A + B\cdot x + C\cdot x^2\] When we use a polynomial model, we need to add new columns to our data frame

fall$time_sq <- fall$time^2

This is used only for the model

Fitting the model (replica “A”)

model_A <- lm(height ~ time + time_sq, 
            data=fall, subset=replica=="A")
model_A
Call:
lm(formula = height ~ time + time_sq, data = fall, subset = replica == 
    "A")

Coefficients:
(Intercept)         time      time_sq  
   -0.00131     -1.55751     -4.26664  

Fitting the model (replica “B”)

model_B <- lm(height ~ time + time_sq, 
            data=fall, subset=replica=="B")
model_B
Call:
lm(formula = height ~ time + time_sq, data = fall, subset = replica == 
    "B")

Coefficients:
(Intercept)         time      time_sq  
    0.00377     -1.08373     -4.57079  

(Intercept) should be 0

Since we choose that at time=0 we have height=0, we know that (Intercept) is 0

In other words, our model here is \[y= B\cdot x + C\cdot x^2\] The old \(A\) has a fixed value of 0

How do we do that in R?

Fixing the intercept to 0

new_model_A <- lm(height ~ time + time_sq + 0, 
            data=fall, subset=replica=="A")
new_model_A
Call:
lm(formula = height ~ time + time_sq + 0, data = fall, subset = replica == 
    "A")

Coefficients:
   time  time_sq  
  -1.57    -4.25  

Fixing the intercept to 0 (B)

new_model_B <- lm(height ~ time + time_sq + 0, 
            data=fall, subset=replica=="B")
new_model_B
Call:
lm(formula = height ~ time + time_sq + 0, data = fall, subset = replica == 
    "B")

Coefficients:
   time  time_sq  
  -1.06    -4.61  

Meaning of the coefficients

If you remember your physics lessons, you can recognize that in the formula \[y= B\cdot x + C\cdot x^2\] the value \(B\) is the initial speed, and \(C=-g/2\)

Thus, we can find the gravitational acceleration in the second coefficient

Combining both replicas

The gravitational acceleration is constant and independent of the replica

But the initial speed depends on the replica

We want to combine both replicas to get

  • initial speed of each replica
  • gravitational constant

Models with factors

combi <- lm(height ~ time:replica + time_sq + 0, 
            data=fall)
combi
Call:
lm(formula = height ~ time:replica + time_sq + 0, data = fall)

Coefficients:
      time_sq  time:replicaA  time:replicaB  
        -4.46          -1.49          -1.12  

The symbol : means interaction

Summary

  • Linear models are useful to extract scientific information from experimental data
  • Can handle linear, exponential, power law and polynomial models
  • Using +0 in the formula forces the intercept to 0
  • Using : allows us to combine two variables and measure their interaction

Application

With survey data

survey <- read.table("survey1-tidy.txt")
colnames(survey)
[1] "Gender"       "birth_day"   
[3] "birth_month"  "birth_year"  
[5] "height_cm"    "weight_kg"   
[7] "handness"     "hand_span_cm"

Boys weight_kg more than girls

plot(weight_kg~Gender,
    data=survey)

lm(weight_kg~Gender,
    data=survey)
Call:
lm(formula = weight_kg ~ Gender, data = survey)

Coefficients:
(Intercept)   GenderMale  
       57.8         18.7  

Models with factors

model <- lm(weight_kg ~ Gender, data = survey)
Call:
lm(formula = weight_kg ~ Gender, data = survey)

Coefficients:
(Intercept)   GenderMale  
       57.8         18.7  

Here (Intercept) is “normal weight”, and GenderMale is the extra weight when Gender=='Male' (on average)

\[\text{weight_kg} = coef(model)[1] + coef(model)[2]\cdot\text{[Gender is 'Male']}\]

Model with factor, without intercept

lm(weight_kg ~ Gender + 0, data = survey)
Call:
lm(formula = weight_kg ~ Gender + 0, data = survey)

Coefficients:
GenderFemale    GenderMale  
        57.8          76.6  

Now the formula includes +0. This means “do not use intercept”. Now

  • GenderMale is the average weight when Gender=='Male'
  • GenderFemale is the average weight when Gender=='Female'

Weight v/s Height

Let’s start with the simplest model

plot(weight_kg ~ height_cm, data=survey, col=Gender)
model1 <- lm(weight_kg ~ height_cm, data=survey)
points(predict(model1) ~ height_cm, data=survey, col="green3", pch=16)

Interpretation

model1
Call:
lm(formula = weight_kg ~ height_cm, data = survey)

Coefficients:
(Intercept)    height_cm  
    -81.504        0.862  

Including the Gender factor

Obviously the weight also depends on the gender

plot(weight_kg ~ height_cm, data=survey, col=Gender)
model2 <- lm(weight_kg ~ height_cm + Gender+0, data = survey)
points(predict(model2) ~ height_cm, data=survey, col="green3", pch=16)

Interpretation

model2
Call:
lm(formula = weight_kg ~ height_cm + Gender + 0, data = survey)

Coefficients:
   height_cm  GenderFemale    GenderMale  
       0.279        11.688        26.896  

There are two lines with the same slope: is the coefficient for height_cm

The “intercept” of each line are the coefficients GenderFemale and GenderMale

Summary

  • Factors can be used in formulas
  • The intercept can be taken out using +0 in the formula
  • Variables can have interactions (synergy)
    • Use : to indicate interaction
  • Homework: Learn about * in formulas