*Please download the file homework11.R and write your
results there. Send the your answers to my
mailbox.*

# 1. GC content

Now that we have more powerful tools, we will explore again the GC
content of *Carsonella rudii*. First, we will see that GC content
can be seen as the average of an event.

## 1.1 Is this nucleotide G or C?

Please write a function called `is_G_or_C`

that determines
if a nucleotide is either *“G”* or *“C”*. The function
takes a *character vector* (called `dna`

) and returns
a *logic vector* called `ans`

. If `dna[i]`

is *“G”, “g”, “C”* or *“c”*, then `ans[i]`

is
`TRUE`

, otherwise it is `FALSE`

.

```
<- function(dna) {
is_G_or_C # write here
}
```

Test your function with the following code:

`is_G_or_C(c("a","c","t","g"))`

`[1] FALSE TRUE FALSE TRUE`

`is_G_or_C(c("A","C","T","G"))`

`[1] FALSE TRUE FALSE TRUE`

`mean(is_G_or_C(c("A","C","T","G")))`

`[1] 0.5`

As you can see, the GC content can be seen as the *mean* of
the event `is_G_or_C()`

. It is natural to think that we can
also calculate the variance and the standard deviation. How can we do
that?

## 1.2 Simulating some genes

We will use this function over simulated data. We need 100 genes of
different lengths. Please create a *vector*, called
`sim_gene_length`

, with 100 elements taken randomly (with
replacement) from the values 100, 103, 106, …, 1201. Then create a
*list*, called `sim_genes`

, with “random DNA
sequences”Notice that `sim_gene_length`

is a
*vector*, since all elements are numbers. On the other side
`sim_genes`

has to be a *list*, since every element is
itself a vector, and each one can have different length. We cannot
handle this kind of data in a vector, it has to be a list.

. Each element `sim_genes[[i]]`

must be a
random character vector with length `sim_gene_length[i]`

.
Only the letters “A”,“C”,“G”, and “T” are valid, and all have the same
probability.

After you have created `sim_genes`

, calculate the
*mean* of `is_G_or_C()`

for each gene and store them
in the vector `sim_mean_GC`

. You also need to calculate the
*variance* of `is_G_or_C()`

for each gene and store
the result in `sim_var_GC`

.

Finally, calculate the *standard error of the mean* by taking
the square root of `sim_var_GC`

divided by
`sim_gene_length`

, and store it in
`sim_std_err_mean`

.

`# write here`

Test your results with this code:

```
par(mfrow=c(2,2))
hist(sim_mean_GC, col="grey")
hist(sim_var_GC, col="grey")
hist(sim_gene_length, col="grey")
hist(sim_std_err_mean, col="grey")
```

## 1.3 Genes versus genome

Since all genes are taking from a random population, and all
nucleotides have the same probability, then **in this
case** we know that the *genome GC content*do you agree?

is 0.5.

But in many cases we do not have the full genome, just some genes.
**We want use the information of each gene to build a confidence
interval for the genome GC content.**

Fortunately we know that, if the genes are random DNA sequences, the
GC content of the genes is not too different from the GC content of the
genomeWe know this because GC content is an average so they
obey the Law of Large Numbers.

. Moreover, we know that the GC content of the genes is a
random variable following the Normal distributionThis is true because of the Central Limit
Theorem.

.

Unfortunately, we do not know the standard deviation of the
population (we are using each gene as a sample). Instead, we can
approximate it with the standard error of each geneStandard error of the sample is the standard deviation
of the sample divided by the square root of the sample size \[\sqrt{\frac{\text{var}(x)}{\text{length}(x)}}\]

, but we will have to use the *Student’s t*
distribution.

We start by choosing the confidence level \(1-\alpha=95\%\), therefore
`alpha`

is 0.05. You must use the function `qt()`

to find *the number of standard errors* `k`

that will
give you the confidence interval. Remember that the number of
*degrees of freedom* is `sim_gene_length-1`

.

Then you must calculate the vectors `sim_lower`

, with the
lower limits of each confidence interval, and `sim_upper`

,
with the upper limits.

```
<- 0.05
alpha # write here
```

Now you can count how many of the confidence intervals really contained the real value. In other words, you can see how good was the prediction.

`mean(sim_lower <= 0.5 & 0.5 <= sim_upper)`

`[1] 0.96`

You can see your results with this code:

```
<- 1:length(sim_genes)
gene_order plot(gene_order, sim_mean_GC, pch=16, ylim=c(0.3, 0.7))
arrows(gene_order, sim_upper, gene_order, sim_lower,
angle=90, code=3, length = 0.02)
abline(h=0.5, col="red")
```

The previous plot is hard to see, because there is a lot of
variability. It can be a little easier if we use the function
`order()`

to sort the values, as in the following code:

```
<- order(sim_mean_GC)
sorted plot(gene_order, sim_mean_GC[sorted], pch=16, ylim=c(0.3, 0.7))
arrows(gene_order, sim_upper[sorted], gene_order, sim_lower[sorted],
angle=90, code=3, length = 0.02)
abline(h=0.5, col="red")
```

# 2. Apply to real data

Now we will use this function over the genes of *Carsonella
rudii*.

## 2.1 Read the genes from a file

Please download the file `c_rudii.fna.txt`

and save it in your computer. It has to be stored in your computer to be
fast and efficient.

Using the library `seqinr`

, please write the code to

- read the FASTA file and load all genes in a list called
`genes`

, - calculate the mean of
`is_G_or_C()`

for each gene, and store each value in a vector called`mean_is_G_or_C`

. The length of this vector must be the number of genes - calculate the variance of
`is_G_or_C()`

for each gene, and store each value in a vector called`var_is_G_or_C`

. - calculate the length of each gene, and store each value in a vector
called
`gene_length`

. - calculate the
*standard error of the mean*for each gene. That is, the square root of the variance divided by the gene length.

`# write here`

Test your results with this code:

`length(genes)`

`[1] 182`

```
par(mfrow=c(2,2))
hist(mean_is_G_or_C, col="grey")
hist(var_is_G_or_C, col="grey")
hist(gene_length, col="grey")
hist(std_err_mean, col="grey")
```

## 2.2 Using genes to estimate genome GC content

We want use the information of the real genes to build a confidence
interval for the *genome GC content*.

Since the values have much more variability than in the simulation,
we choose a confidence level of \(1-\alpha=99.9\%\), therefore
`alpha`

is 0.001. You must use the function `qt()`

to find *the number of standard errors* `k`

that will
give you the confidence interval. Remember that the number of
*degrees of freedom* is `gene_length-1`

.

Then you must calculate the vectors `lower`

, with the
lower limits of each confidence interval, and `upper`

, with
the upper limits.

```
<- 0.001
alpha # write here
```

Test your results with this code:

```
<- 1:length(genes)
gene_order plot(gene_order, mean_is_G_or_C, pch=16, ylim=c(0, 0.4))
arrows(gene_order, upper, gene_order, lower,
angle=90, code=3, length = 0.02)
```

The previous plot is hard to see, because there is a lot of
variability. It can be a little easier if we use the function
`order()`

to sort the values, as in the following code:

```
<- order(mean_is_G_or_C)
sorted plot(gene_order, mean_is_G_or_C[sorted], pch=16, ylim=c(0, 0.4))
arrows(gene_order, upper[sorted], gene_order, lower[sorted],
angle=90, code=3, length = 0.02)
```