Class 4: Practical examples

Systems Biology

Andrés Aravena, PhD

October 12, 2021

Early response to shock

The data for today is from the paper

“Putative MET30 and WD40/YVTN Proteins Regulate NaCl and High Temperature Stress in Legumes”

by Haluk Çelik, Andres Aravena and Neslihan Turgut Kara

submitted this year to Frontiers in Plant Science, section Plant Abiotic Stress

Goal/Question

To identify genes involved in early response to Salt and/or High Temperature stress in legumes

In particular, we selected 3 legume species and identified in them two clusters of ortholog of genes containing both an F-Box domain and a WD40 domain

How to answer the question

We tested if these genes have an early response to two separate stresses

Details

Three legumes:

  • Cicer arietinum a.k.a. chickpea
  • Medicago truncatula a.k.a. barrelclover, strong-spined medick, barrel medic
  • Phaseolus vulgaris a.k.a. common bean

Two Genes, named according to A. thaliana ortholog:

  • MET30
  • WD40/YT

Technical comment

Today we will see an application using R

Next class we will do the same in Excel

I’m using R version 4.1

It has a new way of writing complex formulas

Instead of writing this

summary(lm(y~x))

we can (if we want) write this

lm(y~x) |> summary()

We read “do the linear model AND THEN do summary”

Formally

primers represent all systematic technical distortions

After Normalization

Same written as a Formula

logFC(organism, time, tissue, stress, gene, replica)

The formula corresponds to the data table

Organism Time Tissue Stress Gene Replica logFC
P. vulgaris Control Root Heat MET30 Rep1 7.588
P. vulgaris Control Root Heat MET30 Rep2 7.044
P. vulgaris Control Root Heat MET30 Rep3 7.282
P. vulgaris Control Stem Heat MET30 Rep1 5.801
P. vulgaris Control Stem Heat MET30 Rep2 5.172
P. vulgaris Control Stem Heat MET30 Rep3 5.348

(we show only the first lines)

Descriptive plot

Typical summary plot

It should be like this

Or better, like this

Simple case

We compare one gene’s expression in Root for two conditions: Control and 1 Hour. First, we isolate the data

hs_1 <- subset(one_gene, Tissue=="Root" & Time!="2 hours")
hs_1
      Organism    Time Tissue Stress  Gene Replica logFC
1  P. vulgaris Control   Root   Heat MET30    Rep1 7.588
2  P. vulgaris Control   Root   Heat MET30    Rep2 7.044
3  P. vulgaris Control   Root   Heat MET30    Rep3 7.282
10 P. vulgaris  1 hour   Root   Heat MET30    Rep1 4.156
11 P. vulgaris  1 hour   Root   Heat MET30    Rep2 3.612
12 P. vulgaris  1 hour   Root   Heat MET30    Rep3 3.804

T test

t.test(hs_1$logFC[hs_1$Time=="Control"],
       hs_1$logFC[hs_1$Time=="1 hour"])

    Welch Two Sample t-test

data:  hs_1$logFC[hs_1$Time == "Control"] and hs_1$logFC[hs_1$Time == "1 hour"]
t = 15.392, df = 3.9995, p-value = 0.000104
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2.825462 4.069205
sample estimates:
mean of x mean of y 
 7.304667  3.857333 

Using a formula

t.test(logFC ~ Time, data=hs_1)

    Welch Two Sample t-test

data:  logFC by Time
t = 15.392, df = 3.9995, p-value = 0.000104
alternative hypothesis: true difference in means between group Control and group 1 hour is not equal to 0
95 percent confidence interval:
 2.825462 4.069205
sample estimates:
mean in group Control  mean in group 1 hour 
             7.304667              3.857333 

We do not need extra data frames

t.test(logFC ~ Time, data=one_gene, subset=Tissue=="Root" & Time!="2 hours")

    Welch Two Sample t-test

data:  logFC by Time
t = 15.392, df = 3.9995, p-value = 0.000104
alternative hypothesis: true difference in means between group Control and group 1 hour is not equal to 0
95 percent confidence interval:
 2.825462 4.069205
sample estimates:
mean in group Control  mean in group 1 hour 
             7.304667              3.857333 

With a linear model

lm(logFC ~ Time, data=one_gene, subset=Tissue=="Root" & Time!="2 hours") |>
 summary()

Call:
lm(formula = logFC ~ Time, data = one_gene, subset = Tissue == 
    "Root" & Time != "2 hours")

Residuals:
       1        2        3       10       11       12 
 0.28333 -0.26067 -0.02267  0.29867 -0.24533 -0.05333 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.3047     0.1584   46.12 1.32e-06 ***
Time1 hour   -3.4473     0.2240  -15.39 0.000104 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2743 on 4 degrees of freedom
Multiple R-squared:  0.9834,    Adjusted R-squared:  0.9792 
F-statistic: 236.9 on 1 and 4 DF,  p-value: 0.000104

The p-values are the same

Doing Control versus 2 hours

t.test(logFC ~ Time, data=one_gene, subset=Tissue=="Root" & Time!="1 hour")

    Welch Two Sample t-test

data:  logFC by Time
t = 15.576, df = 2.6234, p-value = 0.00117
alternative hypothesis: true difference in means between group Control and group 2 hours is not equal to 0
95 percent confidence interval:
 5.140307 8.073026
sample estimates:
mean in group Control mean in group 2 hours 
             7.304667              0.698000 

With a linear model

lm(logFC ~ Time, data=one_gene, subset=Tissue=="Root" & Time!="1 hour") |>
 summary()

Call:
lm(formula = logFC ~ Time, data = one_gene, subset = Tissue == 
    "Root" & Time != "1 hour")

Residuals:
       1        2        3       19       20       21 
 0.28333 -0.26067 -0.02267  0.74400 -0.59600 -0.14800 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.3047     0.2999   24.36 1.69e-05 ***
Time2 hours  -6.6067     0.4241  -15.58 9.92e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5195 on 4 degrees of freedom
Multiple R-squared:  0.9838,    Adjusted R-squared:  0.9797 
F-statistic: 242.6 on 1 and 4 DF,  p-value: 9.918e-05

This time the p values are different. Why?

Linear model assumes equal variance

We do not know the population variance

We estimate it using our data

If we assume that all values come from the same population, then we can use this extra information to get a better standard error

(we are dividing by a larger \(n\))

T test with equal variance

t.test(logFC ~ Time, data=one_gene, var.equal=TRUE,
       subset=Tissue=="Root" & Time!="1 hour")

    Two Sample t-test

data:  logFC by Time
t = 15.576, df = 4, p-value = 9.918e-05
alternative hypothesis: true difference in means between group Control and group 2 hours is not equal to 0
95 percent confidence interval:
 5.429051 7.784282
sample estimates:
mean in group Control mean in group 2 hours 
             7.304667              0.698000 

Now the p-values are the same as in the linear model

Using all data to estimate variance

If we can assume that all results come from the same distribution, then we can use a linear model and calculate all at the same time

Using all data to estimate variance

lm(logFC ~ Time, data=one_gene, subset=Tissue=="Root") |> summary()

Call:
lm(formula = logFC ~ Time, data = one_gene, subset = Tissue == 
    "Root")

Residuals:
     Min       1Q   Median       3Q      Max 
-0.59600 -0.24533 -0.05333  0.28333  0.74400 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.3047     0.2616  27.925 1.40e-07 ***
Time1 hour   -3.4473     0.3699  -9.319 8.65e-05 ***
Time2 hours  -6.6067     0.3699 -17.859 1.98e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4531 on 6 degrees of freedom
Multiple R-squared:  0.9815,    Adjusted R-squared:  0.9754 
F-statistic: 159.6 on 2 and 6 DF,  p-value: 6.283e-06

ANOVA

Under the same hypotheses we can do a one-way ANOVA

aov(logFC ~ Time, data=one_gene, subset=Tissue=="Root") |> summary()
            Df Sum Sq Mean Sq F value   Pr(>F)    
Time         2  65.51   32.76   159.6 6.28e-06 ***
Residuals    6   1.23    0.21                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the F value and p value are the same as in the last line of the linear model (in previous slide)

Linear model does all

The linear model summary gives us

  • confidence interval for every coefficient
  • p-value for every coefficient
  • ANOVA

Why and why not ANOVA

Each p-value evaluates different null hypotheses

  • t test evaluates if each \(\beta_i=0\)
  • ANOVA evaluates if all \(\beta_i=0\) \[\beta_1=\beta_2=\cdots=\beta_n=0\] These are different questions

Input data

  • Long format, not wide
  • One row for each observation
  • One column for each variable
  • One column for each unit
  • Single values on each cell, either factors or numbers