# Systems Biology

## Early response to shock

The data for today is from the paper

“Putative MET30 and WD40/YVTN Proteins Regulate NaCl and High Temperature Stress in Legumes”

by Haluk Çelik, Andres Aravena and Neslihan Turgut Kara

submitted this year to Frontiers in Plant Science, section Plant Abiotic Stress

## Goal/Question

To identify genes involved in early response to Salt and/or High Temperature stress in legumes

In particular, we selected 3 legume species and identified in them two clusters of ortholog of genes containing both an F-Box domain and a WD40 domain

## How to answer the question

We tested if these genes have an early response to two separate stresses

## Details

Three legumes:

• Cicer arietinum a.k.a. chickpea
• Medicago truncatula a.k.a. barrelclover, strong-spined medick, barrel medic
• Phaseolus vulgaris a.k.a. common bean

Two Genes, named according to A. thaliana ortholog:

• MET30
• WD40/YT

## Technical comment

Today we will see an application using R

Next class we will do the same in Excel

## I’m using R version 4.1

It has a new way of writing complex formulas

summary(lm(y~x))

we can (if we want) write this

lm(y~x) |> summary()

We read “do the linear model AND THEN do summary”

## Formally

primers represent all systematic technical distortions

## Same written as a Formula

logFC(organism, time, tissue, stress, gene, replica)

The formula corresponds to the data table

Organism Time Tissue Stress Gene Replica logFC
P. vulgaris Control Root Heat MET30 Rep1 7.588
P. vulgaris Control Root Heat MET30 Rep2 7.044
P. vulgaris Control Root Heat MET30 Rep3 7.282
P. vulgaris Control Stem Heat MET30 Rep1 5.801
P. vulgaris Control Stem Heat MET30 Rep2 5.172
P. vulgaris Control Stem Heat MET30 Rep3 5.348

(we show only the first lines)

## Simple case

We compare one gene’s expression in Root for two conditions: Control and 1 Hour. First, we isolate the data

hs_1 <- subset(one_gene, Tissue=="Root" & Time!="2 hours")
hs_1
Organism    Time Tissue Stress  Gene Replica logFC
1  P. vulgaris Control   Root   Heat MET30    Rep1 7.588
2  P. vulgaris Control   Root   Heat MET30    Rep2 7.044
3  P. vulgaris Control   Root   Heat MET30    Rep3 7.282
10 P. vulgaris  1 hour   Root   Heat MET30    Rep1 4.156
11 P. vulgaris  1 hour   Root   Heat MET30    Rep2 3.612
12 P. vulgaris  1 hour   Root   Heat MET30    Rep3 3.804

## T test

t.test(hs_1$logFC[hs_1$Time=="Control"],
hs_1$logFC[hs_1$Time=="1 hour"])

Welch Two Sample t-test

data:  hs_1$logFC[hs_1$Time == "Control"] and hs_1$logFC[hs_1$Time == "1 hour"]
t = 15.392, df = 3.9995, p-value = 0.000104
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.825462 4.069205
sample estimates:
mean of x mean of y
7.304667  3.857333

## Using a formula

t.test(logFC ~ Time, data=hs_1)

Welch Two Sample t-test

data:  logFC by Time
t = 15.392, df = 3.9995, p-value = 0.000104
alternative hypothesis: true difference in means between group Control and group 1 hour is not equal to 0
95 percent confidence interval:
2.825462 4.069205
sample estimates:
mean in group Control  mean in group 1 hour
7.304667              3.857333

## We do not need extra data frames

t.test(logFC ~ Time, data=one_gene, subset=Tissue=="Root" & Time!="2 hours")

Welch Two Sample t-test

data:  logFC by Time
t = 15.392, df = 3.9995, p-value = 0.000104
alternative hypothesis: true difference in means between group Control and group 1 hour is not equal to 0
95 percent confidence interval:
2.825462 4.069205
sample estimates:
mean in group Control  mean in group 1 hour
7.304667              3.857333

## With a linear model

lm(logFC ~ Time, data=one_gene, subset=Tissue=="Root" & Time!="2 hours") |>
summary()

Call:
lm(formula = logFC ~ Time, data = one_gene, subset = Tissue ==
"Root" & Time != "2 hours")

Residuals:
1        2        3       10       11       12
0.28333 -0.26067 -0.02267  0.29867 -0.24533 -0.05333

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   7.3047     0.1584   46.12 1.32e-06 ***
Time1 hour   -3.4473     0.2240  -15.39 0.000104 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2743 on 4 degrees of freedom
Multiple R-squared:  0.9834,    Adjusted R-squared:  0.9792
F-statistic: 236.9 on 1 and 4 DF,  p-value: 0.000104

The p-values are the same

## Doing Control versus 2 hours

t.test(logFC ~ Time, data=one_gene, subset=Tissue=="Root" & Time!="1 hour")

Welch Two Sample t-test

data:  logFC by Time
t = 15.576, df = 2.6234, p-value = 0.00117
alternative hypothesis: true difference in means between group Control and group 2 hours is not equal to 0
95 percent confidence interval:
5.140307 8.073026
sample estimates:
mean in group Control mean in group 2 hours
7.304667              0.698000

## With a linear model

lm(logFC ~ Time, data=one_gene, subset=Tissue=="Root" & Time!="1 hour") |>
summary()

Call:
lm(formula = logFC ~ Time, data = one_gene, subset = Tissue ==
"Root" & Time != "1 hour")

Residuals:
1        2        3       19       20       21
0.28333 -0.26067 -0.02267  0.74400 -0.59600 -0.14800

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   7.3047     0.2999   24.36 1.69e-05 ***
Time2 hours  -6.6067     0.4241  -15.58 9.92e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5195 on 4 degrees of freedom
Multiple R-squared:  0.9838,    Adjusted R-squared:  0.9797
F-statistic: 242.6 on 1 and 4 DF,  p-value: 9.918e-05

This time the p values are different. Why?

## Linear model assumes equal variance

We do not know the population variance

We estimate it using our data

If we assume that all values come from the same population, then we can use this extra information to get a better standard error

(we are dividing by a larger $$n$$)

## T test with equal variance

t.test(logFC ~ Time, data=one_gene, var.equal=TRUE,
subset=Tissue=="Root" & Time!="1 hour")

Two Sample t-test

data:  logFC by Time
t = 15.576, df = 4, p-value = 9.918e-05
alternative hypothesis: true difference in means between group Control and group 2 hours is not equal to 0
95 percent confidence interval:
5.429051 7.784282
sample estimates:
mean in group Control mean in group 2 hours
7.304667              0.698000

Now the p-values are the same as in the linear model

## Using all data to estimate variance

If we can assume that all results come from the same distribution, then we can use a linear model and calculate all at the same time

## Using all data to estimate variance

lm(logFC ~ Time, data=one_gene, subset=Tissue=="Root") |> summary()

Call:
lm(formula = logFC ~ Time, data = one_gene, subset = Tissue ==
"Root")

Residuals:
Min       1Q   Median       3Q      Max
-0.59600 -0.24533 -0.05333  0.28333  0.74400

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   7.3047     0.2616  27.925 1.40e-07 ***
Time1 hour   -3.4473     0.3699  -9.319 8.65e-05 ***
Time2 hours  -6.6067     0.3699 -17.859 1.98e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4531 on 6 degrees of freedom
Multiple R-squared:  0.9815,    Adjusted R-squared:  0.9754
F-statistic: 159.6 on 2 and 6 DF,  p-value: 6.283e-06

## ANOVA

Under the same hypotheses we can do a one-way ANOVA

aov(logFC ~ Time, data=one_gene, subset=Tissue=="Root") |> summary()
Df Sum Sq Mean Sq F value   Pr(>F)
Time         2  65.51   32.76   159.6 6.28e-06 ***
Residuals    6   1.23    0.21
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the F value and p value are the same as in the last line of the linear model (in previous slide)

## Linear model does all

The linear model summary gives us

• confidence interval for every coefficient
• p-value for every coefficient
• ANOVA

## Why and why not ANOVA

Each p-value evaluates different null hypotheses

• t test evaluates if each $$\beta_i=0$$
• ANOVA evaluates if all $$\beta_i=0$$ $\beta_1=\beta_2=\cdots=\beta_n=0$ These are different questions

## Input data

• Long format, not wide
• One row for each observation
• One column for each variable
• One column for each unit
• Single values on each cell, either factors or numbers