The data for today is from the paper

“Putative MET30 and WD40/YVTN Proteins Regulate NaCl and High Temperature Stress in Legumes”

by Haluk Çelik, Andres Aravena and Neslihan Turgut Kara

submitted this year to Frontiers in Plant Science, section Plant Abiotic Stress

To identify genes involved in early response to Salt and/or High Temperature stress in legumes

In particular, we selected 3 legume species and identified in them two clusters of ortholog of genes containing both an F-Box domain and a WD40 domain

We tested if these genes have an early response to two separate stresses

Three legumes:

*Cicer arietinum*a.k.a. chickpea*Medicago truncatula*a.k.a. barrelclover, strong-spined medick, barrel medic*Phaseolus vulgaris*a.k.a. common bean

Two Genes, named according to *A. thaliana* ortholog:

- MET30
- WD40/YT

Today we will see an application using R

Next class we will do the same in Excel

It has a new way of writing complex formulas

Instead of writing this

we can (if we want) write this

We read “do the linear model AND THEN do summary”

*primers* represent all systematic technical distortions

\[ logFC(organism, time, tissue, stress, gene, replica) \]

The formula corresponds to the data table

Organism | Time | Tissue | Stress | Gene | Replica | logFC |
---|---|---|---|---|---|---|

P. vulgaris | Control | Root | Heat | MET30 | Rep1 | 7.588 |

P. vulgaris | Control | Root | Heat | MET30 | Rep2 | 7.044 |

P. vulgaris | Control | Root | Heat | MET30 | Rep3 | 7.282 |

P. vulgaris | Control | Stem | Heat | MET30 | Rep1 | 5.801 |

P. vulgaris | Control | Stem | Heat | MET30 | Rep2 | 5.172 |

P. vulgaris | Control | Stem | Heat | MET30 | Rep3 | 5.348 |

(we show only the first lines)

We compare one gene’s expression in Root for two conditions:
*Control* and *1 Hour*. First, we isolate the data

```
Organism Time Tissue Stress Gene Replica logFC
1 P. vulgaris Control Root Heat MET30 Rep1 7.588
2 P. vulgaris Control Root Heat MET30 Rep2 7.044
3 P. vulgaris Control Root Heat MET30 Rep3 7.282
10 P. vulgaris 1 hour Root Heat MET30 Rep1 4.156
11 P. vulgaris 1 hour Root Heat MET30 Rep2 3.612
12 P. vulgaris 1 hour Root Heat MET30 Rep3 3.804
```

```
Welch Two Sample t-test
data: hs_1$logFC[hs_1$Time == "Control"] and hs_1$logFC[hs_1$Time == "1 hour"]
t = 15.392, df = 3.9995, p-value = 0.000104
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.825462 4.069205
sample estimates:
mean of x mean of y
7.304667 3.857333
```

```
Welch Two Sample t-test
data: logFC by Time
t = 15.392, df = 3.9995, p-value = 0.000104
alternative hypothesis: true difference in means between group Control and group 1 hour is not equal to 0
95 percent confidence interval:
2.825462 4.069205
sample estimates:
mean in group Control mean in group 1 hour
7.304667 3.857333
```

```
Welch Two Sample t-test
data: logFC by Time
t = 15.392, df = 3.9995, p-value = 0.000104
alternative hypothesis: true difference in means between group Control and group 1 hour is not equal to 0
95 percent confidence interval:
2.825462 4.069205
sample estimates:
mean in group Control mean in group 1 hour
7.304667 3.857333
```

```
Call:
lm(formula = logFC ~ Time, data = one_gene, subset = Tissue ==
"Root" & Time != "2 hours")
Residuals:
1 2 3 10 11 12
0.28333 -0.26067 -0.02267 0.29867 -0.24533 -0.05333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.3047 0.1584 46.12 1.32e-06 ***
Time1 hour -3.4473 0.2240 -15.39 0.000104 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2743 on 4 degrees of freedom
Multiple R-squared: 0.9834, Adjusted R-squared: 0.9792
F-statistic: 236.9 on 1 and 4 DF, p-value: 0.000104
```

The p-values are the same

```
Welch Two Sample t-test
data: logFC by Time
t = 15.576, df = 2.6234, p-value = 0.00117
alternative hypothesis: true difference in means between group Control and group 2 hours is not equal to 0
95 percent confidence interval:
5.140307 8.073026
sample estimates:
mean in group Control mean in group 2 hours
7.304667 0.698000
```

```
Call:
lm(formula = logFC ~ Time, data = one_gene, subset = Tissue ==
"Root" & Time != "1 hour")
Residuals:
1 2 3 19 20 21
0.28333 -0.26067 -0.02267 0.74400 -0.59600 -0.14800
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.3047 0.2999 24.36 1.69e-05 ***
Time2 hours -6.6067 0.4241 -15.58 9.92e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5195 on 4 degrees of freedom
Multiple R-squared: 0.9838, Adjusted R-squared: 0.9797
F-statistic: 242.6 on 1 and 4 DF, p-value: 9.918e-05
```

This time the p values are different. Why?

We do not know the population variance

We estimate it using our data

If we *assume* that all values come from the same population,
then we can use this extra information to get a better standard
error

(we are dividing by a larger \(n\))

```
Two Sample t-test
data: logFC by Time
t = 15.576, df = 4, p-value = 9.918e-05
alternative hypothesis: true difference in means between group Control and group 2 hours is not equal to 0
95 percent confidence interval:
5.429051 7.784282
sample estimates:
mean in group Control mean in group 2 hours
7.304667 0.698000
```

Now the p-values are the same as in the linear model

If we can assume that all results come from the same distribution, then we can use a linear model and calculate all at the same time

```
Call:
lm(formula = logFC ~ Time, data = one_gene, subset = Tissue ==
"Root")
Residuals:
Min 1Q Median 3Q Max
-0.59600 -0.24533 -0.05333 0.28333 0.74400
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.3047 0.2616 27.925 1.40e-07 ***
Time1 hour -3.4473 0.3699 -9.319 8.65e-05 ***
Time2 hours -6.6067 0.3699 -17.859 1.98e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4531 on 6 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9754
F-statistic: 159.6 on 2 and 6 DF, p-value: 6.283e-06
```

Under the same hypotheses we can do a one-way ANOVA

```
Df Sum Sq Mean Sq F value Pr(>F)
Time 2 65.51 32.76 159.6 6.28e-06 ***
Residuals 6 1.23 0.21
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Notice that the F value and p value are the same as in the last line of the linear model (in previous slide)

The linear model summary gives us

- confidence interval for every coefficient
- p-value for every coefficient
- ANOVA

Each p-value evaluates different *null* hypotheses

- t test evaluates if
*each*\(\beta_i=0\) - ANOVA evaluates if
*all*\(\beta_i=0\) \[\beta_1=\beta_2=\cdots=\beta_n=0\] These are different questions

- Long format, not wide
- One row for each observation
- One column for each variable
- One column for each unit
- Single values on each cell, either factors or numbers