The
midterm exam has three mandatory questions and one optional. All
questions point to evaluate *Computational Thinking* skills:
decomposition, pattern recognition, abstraction and algorithm design

# 1. Turtle draws Snow

This question asks us to make a **recursive** function.
Have we seen them before?

Yes! We saw examples of recursive functions, such as the
*factorial*, the *Fibonacci* and the one drawing trees. We
made tree drawings on Quiz 2 and also in classes.

We have seen that *recursive* functions are functions that
call themselves. The factorial of `N`

is `N`

times
the factorial of `N-1`

. A tree (`N`

levels) has
trunk and branches, and each branch is a smaller tree (with
`N-1`

levels). We also saw that all recursive functions
always have an `if()`

to handle the *exit condition*,
typically when `N==1`

.

So we can start with a copy of the `tree()`

function. Here
it is:

```
<- function(n, length, angle) {
tree turtle_lwd(n)
turtle_forward(length)
if (n > 1) {
turtle_left(angle)
tree(n-1,length*0.8, angle)
turtle_right(2*angle)
tree(n-1, length*0.8, angle)
turtle_left(angle)
tree(n-1, length*0.8, angle)
}turtle_backward(length)
}
```

We have to change it to draw *snow* instead of *trees*.
First we change the name from `tree`

to `snow`

. We
also change the inputs. Instead of `n`

we use `N`

and instead of `length`

we use `L`

. There are no
more inputs to `snow()`

.

```
<- function(N, L) {
snow turtle_lwd(N)
turtle_forward(L)
if (N > 1) {
turtle_left(angle)
snow(N-1,L*0.8, angle)
turtle_right(2*angle)
snow(N-1, L*0.8, angle)
turtle_left(angle)
snow(N-1, L*0.8, angle)
}turtle_backward(L)
}
```

This version does not work, because it uses `angle`

which
is not defined. Also, the question says that, instead of
`L*0.8`

, each part must be of length `L/3`

. The
important parts of the question are:

The turtle draws the first part, turns left 60 degrees, draws the second part, turns 120 degrees to the right, draws another part, turns 60 degrees left, and draws the last part.

[…]

In generalsnow levelof length`N`

`L`

is made of four parts ofsnow levelof length`N-1`

`L/3`

.

So we need to insert one *snow level N-1* and we
know the angles:

```
<- function(N, L) {
snow turtle_lwd(N)
turtle_forward(L)
if (N > 1) {
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
turtle_right(120)
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
}turtle_backward(L)
}
```

We can test this function and it works 😀 but the drawing is weird
🤪. The lines have different width, and the snow is outside the box. We
prefer all lines of the same length, so we have to delete the
`turtle_lwd(N)`

command.

The command `snow(1, 80)`

is ok, but the others are
outside. So we need to do `turtle_forward(L)`

**only** when `N==1`

. The new version is

```
<- function(N, L) {
snow if(N==1) {
turtle_forward(L)
}if (N > 1) {
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
turtle_right(120)
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
}turtle_backward(L)
}
```

This one is also weird, but at least we get most of the snow inside the box. It looks like we are drawing and moving outside. What can be wrong?

The first `if()`

is ok, because the question says
*“Snow level 1 is just a straight line of length
L”*. The second

`if()`

is also right. It
does exactly what the question asks.What about the `turtle_backward(L)`

? The question does not
say anything about “going backwards” or “return to the initial
condition”. In fact each *snow* has to start at the end of the
previous snow. Let’s delete `turtle_backward(L)`

and see what
happens.

```
<- function(N, L) {
snow if(N==1) {
turtle_forward(L)
}if (N > 1) {
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
turtle_right(120)
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
} }
```

Now it works. We can simplify it a last time using
`if() {} else {}`

.

```
<- function(N, L) {
snow if(N==1) {
turtle_forward(L)
else {
} snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
turtle_right(120)
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
} }
```

Ready! Let’s go to the next question.

# 2. Algorithm design

We have to write a function, and we got this starting point:

```
<- function(x) {
locate_half # write your code here
}
```

Have we seen this before? Maybe is like finding *minimum* or
*maximum*. We have never programed our own version of
`min()`

, but it should not be so hard, isn’t?

Wait. There is `min()`

and `which.min()`

. This
question ask for an *index*, so it is more like
`which.min()`

. It is not about the *value* of the half
value, it is about the *location* of the half value.

The *half value* is the average of the minimum and the
maximum. Let’s re-write this phrase in R language using the starting
point:

```
<- function(x) {
locate_half <- (min(x) + max(x))/2
half_value return(half_value)
}
```

Is this enough? Let’s test:

```
<- 1:9
x locate_half(x)
```

`## [1] 5`

Good!

`locate_half(x+20)`

`## [1] 25`

Mmm…🤔 It does not look right. We got the *half value*, not
the * location of the half value*. Is like doing

`min()`

instead of `which.min()`

^{1}.

Let’s read the question again. The question says:

The

location of the half valueof the vector`x`

is theindex of the first valuethat is equal or bigger than thehalf valueof`x`

.

Therefore we will need an index. Let’s say that the index is
`i`

. And we need to check each one of the components of
`x`

from the first to the last. So we need a
`for(){}`

loop using `i`

as a counter.

```
<- function(x) {
locate_half <- (min(x) + max(x))/2
half_value for(i in 1:length(x)) {
...return(...)
} }
```

For each element of `x`

pointed by `i`

we need
to make a decision. We want to know if `x[i]`

is bigger or
equal to `half_value`

. Every time we make a decision, there
must be an `if(){}`

. Let’s write that.

```
<- function(x) {
locate_half <- (min(x) + max(x))/2
half_value for(i in 1:length(x)) {
if(x[i] >= half_value) {
return(...)
}
} }
```

Now, what shall we do if we find that
`x[i] >= half_value`

? In that case we have found the
*location of the half value*. It is the *index* of
`x[i]`

, in other words, it is `i`

. We have to
return that number.

```
<- function(x) {
locate_half <- (min(x) + max(x))/2
half_value for(i in 1:length(x)) {
if(x[i] >= half_value) {
return(i)
}
} }
```

And we test again

```
<- 1:9
x locate_half(x)
```

`## [1] 5`

`locate_half(x + 20)`

`## [1] 5`

`locate_half(x * x)`

`## [1] 7`

`locate_half(sqrt(x))`

`## [1] 4`

Good! we got the expected result. Can we make it better?

Well, the question says that `x`

is growing, so
`x[1]=min(x)`

and `x[length(x)]=max(x)`

. Using
indices is faster than using functions^{2} so
we can rewrite our function as:

```
<- function(x) {
locate_half <- (x[1] + x[length(x)])/2
half_value for(i in 1:length(x)) {
if(x[i] >= half_value) {
return(i)
}
} }
```

Let’s move to the next question.

# 3. Systems: Polymerase Chain Reaction

The PCR system is shown in the question as this:

Have we seen a similar system before?

## Simulate several PCR cycles

Yes! It is the same plot we got for cell growth in Class 11 and in
the blog document. **This is pattern recognition
again**. A system depends only on how many circles connect to how
many boxes with how many arrows. It does

*not*depend on the names in the circles/boxes, or the way we draw them.

**This is**.

*abstraction*againSo we can start with the code for `cell_culture()`

from
the class.

```
<- function(N, replication_rate, cells_init, food_ini) {
cell_culture # create empty vectors
<- d_cells <- rep(NA, N)
cells <- d_food <- rep(NA, N)
food # initialize values
1] <- cells_init
cells[1] <- food_ini
food[1] <- d_food[1] <- 0
d_cells[for(i in 2:N) {
# update `d_cells` and `d_food`
<- +replication_rate * cells[i-1] * food[i-1]
d_cells[i] <- -replication_rate * cells[i-1] * food[i-1]
d_food[i] # update `cells` and `food` as a cumulative sum
<- cells[i-1] + d_cells[i]
cells[i] <- food[i-1] + d_food[i]
food[i]
}return(data.frame(cells, food))
}
```

We have to change the name of the function. We also change the names
of the variables, from `cells`

to `DNA`

, from
`food`

to `primers`

and from `eating`

to `cycle`

. We change also the names for the variables with
names like `d_cells`

. This is easy to do with the *Search
& Replace* option in the editor. Also,
`replication_rate`

becomes `cycle_rate`

, or just
`rate`

, as the question indicates. We got this:

```
<- function(N, rate, dna_ini, primers_ini) {
pcr # create empty vectors
<- d_dna <- rep(NA, N)
dna <- d_primers <- rep(NA, N)
primers # initialize values
1] <- dna_ini
dna[1] <- primers_ini
primers[1] <- d_primers[1] <- 0
d_dna[for(i in 2:N) {
# update `d_dna` and `d_primers`
<- +rate * dna[i-1] * primers[i-1]
d_dna[i] <- -rate * dna[i-1] * primers[i-1]
d_primers[i] # update `dna` and `primers` as a cumulative sum
<- dna[i-1] + d_dna[i]
dna[i] <- primers[i-1] + d_primers[i]
primers[i]
}return(data.frame(dna, primers))
}
```

Now we need to include the *default values* for
`rate`

and `primer_ini`

.

```
<- function(N, dna_ini, primers_ini=1e8, rate=1e-8) {
pcr # create empty vectors
<- d_dna <- rep(NA, N)
dna <- d_primers <- rep(NA, N)
primers # initialize values
1] <- dna_ini
dna[1] <- primers_ini
primers[1] <- d_primers[1] <- 0
d_dna[for(i in 2:N) {
# update `d_dna` and `d_primers`
<- +rate * dna[i-1] * primers[i-1]
d_dna[i] <- -rate * dna[i-1] * primers[i-1]
d_primers[i] # update `dna` and `primers` as a cumulative sum
<- dna[i-1] + d_dna[i]
dna[i] <- primers[i-1] + d_primers[i]
primers[i]
}return(data.frame(dna, primers))
}
```

Now we test and see if we did it right.

`pcr(N=6, dna_ini=1e6)`

```
## dna primers
## 1 1000000 100000000
## 2 2000000 99000000
## 3 3980000 97020000
## 4 7841396 93158604
## 5 15146331 85853669
## 6 28150012 72849988
```

Yup! It’s done. And we only used *Search & Replace* and
pattern recognition.

## PCR depends on initial concentration

This question asks us to write code directly, not any function. It gives us some initial data:

```
<- 10**(0:6)
initial_dna initial_dna
```

`## [1] 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06`

For each of these values we need to simulate 30 PCR cycles and get
the `dna`

value. Let’s see if we can do just one first.

`pcr(N=30, dna_ini=initial_dna[1])`

```
## dna primers
## 1 1.000000e+00 100000000.0
## 2 2.000000e+00 99999999.0
## 3 4.000000e+00 99999997.0
## 4 8.000000e+00 99999993.0
## 5 1.600000e+01 99999985.0
## 6 3.200000e+01 99999969.0
## 7 6.399998e+01 99999937.0
## 8 1.279999e+02 99999873.0
## 9 2.559997e+02 99999745.0
## 10 5.119987e+02 99999489.0
## 11 1.023995e+03 99998977.0
## 12 2.047979e+03 99997953.0
## 13 4.095916e+03 99995905.1
## 14 8.191665e+03 99991809.3
## 15 1.638266e+04 99983618.3
## 16 3.276263e+04 99967238.4
## 17 6.551454e+04 99934486.5
## 18 1.309861e+05 99869014.9
## 19 2.618007e+05 99738200.3
## 20 5.229161e+05 99477084.9
## 21 1.043098e+06 98956903.3
## 22 2.075315e+06 97924686.1
## 23 4.107561e+06 95892440.5
## 24 8.046401e+06 91953600.4
## 25 1.544536e+07 84554645.4
## 26 2.850512e+07 71494879.8
## 27 4.888482e+07 51115177.6
## 28 7.387239e+07 26127613.3
## 29 9.317348e+07 6826521.5
## 30 9.953399e+07 466013.9
```

Ok, we got something. Maybe too much, since we got `dna`

and `primers`

. We only want `dna`

. Since the
result is a data frame, we can use any kind of index to get the first
column, such as `$dna`

or `[[1]]`

or
`[,"dna"]`

. Data frames can be used in many different ways,
as we learned last semester^{3}.

Let’s try again

`pcr(N=30, dna_ini=initial_dna[1])$dna`

```
## [1] 1.000000e+00 2.000000e+00 4.000000e+00 8.000000e+00 1.600000e+01
## [6] 3.200000e+01 6.399998e+01 1.279999e+02 2.559997e+02 5.119987e+02
## [11] 1.023995e+03 2.047979e+03 4.095916e+03 8.191665e+03 1.638266e+04
## [16] 3.276263e+04 6.551454e+04 1.309861e+05 2.618007e+05 5.229161e+05
## [21] 1.043098e+06 2.075315e+06 4.107561e+06 8.046401e+06 1.544536e+07
## [26] 2.850512e+07 4.888482e+07 7.387239e+07 9.317348e+07 9.953399e+07
```

Now we got a vector. We need to do it seven times. We can do it one by one.

```
<- data.frame(V1=pcr(N=30, dna_ini=initial_dna[1])$dna,
conc V2=pcr(N=30, dna_ini=initial_dna[2])$dna,
V3=pcr(N=30, dna_ini=initial_dna[3])$dna,
V4=pcr(N=30, dna_ini=initial_dna[4])$dna,
V5=pcr(N=30, dna_ini=initial_dna[5])$dna,
V6=pcr(N=30, dna_ini=initial_dna[6])$dna,
V7=pcr(N=30, dna_ini=initial_dna[7])$dna)
```

It is not very elegant, but it works. Let’s test it.

```
plot(x=c(1,30), y=c(0, max(conc)), type="n", xlab="PCR cycle",
ylab="DNA Concentration", main="Effect of initial DNA on PCR")
for(i in 1:ncol(conc)) {
points(conc[[i]], pch=i, type="b")
}
```

We are ready. We can move forward.

But we can do better. If we still have time, we can improve this code to be more generic. We would like to have a program that works for any number of initial conditions. We have go step by step.

First we need an “empty” data frame. We cannot make a completely empty data frame, but we can start with a single column with an empty vector. Each simulation has 30 steps, so we need a vector of size 30.

`<- data.frame(V1=rep(NA, 30)) conc `

Now we need a loop for each element of `initial_dna`

:

```
for(i in 1:length(initial_dna)) {
... }
```

Inside the loop we have to fill each of the columns of
`conc`

. Can we do that, even if we have only one column? Yes,
since data frames can grow. Last semester we saw that it was quite easy
to add new columns to a data frame. We just use an index and assign a
vector. If the indexed column does not exist on the data frame, it is
created *automagically*. Let’s do it.

```
for(i in 1:length(initial_dna)) {
<- pcr(N=30, dna_ini=initial_dna[i])$dna
conc[[i]] }
```

Let’s see if this works:

```
plot(x=c(1,30), y=c(0, max(conc)), type="n", xlab="PCR cycle",
ylab="DNA Concentration", main="Effect of initial DNA on PCR")
for(i in 1:ncol(conc)) {
points(conc[[i]], pch=i, type="b")
}
```

This curves can be measured in real PCR experiments. In some cases we
can measure the *optical density (OD)* through the PCR cycle. In
other cases we can use fluorescence. Both methods give us a signal that
is proportional to DNA concentration. If the real curves match our
model, then our model is useful.

Now we can go home.

# 4. Bonus: Quantitative PCR

Bonus! This is optional, but it gives more points and more knowledge. Also, it makes us prouder of our accomplishments. So we do not go home yet.

Here again we have to write code, not a function. We have make a
vector called `CT`

, which contains the *half value* of
each column of the data frame `conc`

. We find the *half
value* with function `locate_half()`

which we already
created. And we have to do it for each column of `conc`

. Like
in the previous question we can do it one by one:

```
<- c(locate_half(conc[[1]]),
CT locate_half(conc[[2]]),
locate_half(conc[[3]]),
locate_half(conc[[4]]),
locate_half(conc[[5]]),
locate_half(conc[[6]]),
locate_half(conc[[7]]))
```

Hey, doing it one by one is **boring** 😒 and it makes
me feel like I’m working for the computer 😖, like in a dystopian movie.
Let’s make the computer work for us. If we have to do the same thing
more than two times, we write a `for(){}`

loop to do it for
us. Life is short, let’s leave the boring stuff for the computers.

We use the same pattern we have done many times. We create an empty
vector of the correct size^{4} and then we use a
loop.

```
<- rep(NA, ncol(conc))
CT for(i in 1:ncol(conc)) {
<- locate_half(conc[[i]])
CT[i] }
```

Then we draw the plot and we see the points in a “straight” line.

`plot(initial_dna ~ CT, log="y")`

This result is important because it means that there is a
relationship between the initial DNA concentration and the number of
cycles required to cross a fixed threshold. The name `CT`

means *cycle threshold*.

If we know this relationship, we can determine with high precision
what was the initial DNA concentration. That is the idea of
*Quantitative PCR*.

Let’s find that relationship We
remember that straight lines can be modeled with *linear
models*^{5}. The name says it all. The plot uses
*log* in the vertical coordinate, so the linear model also needs
logarithmic scale. We can use any logarithm, since all do the trick. In
this case we can use `log10`

since it is easier to
understand.

```
<- lm(log10(initial_dna) ~ CT)
model model
```

```
##
## Call:
## lm(formula = log10(initial_dna) ~ CT)
##
## Coefficients:
## (Intercept) CT
## 8.3241 -0.3006
```

Since *Intercept* is 8.3241 then the concentration
corresponding to CT=0 is \(10^{8.3241}\)=2.1091137 × 10^{8}.
With so much initial DNA the threshold is crossed before we start the
experiment.

The slope constant for CT is -0.3, meaning that when CT increases by
one, the initial DNA concentration decreases \(10^{-0.3}\)=0.5011872 times. That is like
half. In other words, double DNA means one less cycle. This means that
the PCR reaction *duplicates* the number of DNA molecules on each
cycle. That happens only when we have 100% efficiency.

Let’s make a table:

```
<- data.frame(CT = seq(from=0, to=40, by=2))
ans $initial_DNA <- 10^predict(model, newdata = ans)
ans::kable(ans, format.args=list(digits=2)) knitr
```

CT | initial_DNA |
---|---|

0 | 2.1e+08 |

2 | 5.3e+07 |

4 | 1.3e+07 |

6 | 3.3e+06 |

8 | 8.3e+05 |

10 | 2.1e+05 |

12 | 5.2e+04 |

14 | 1.3e+04 |

16 | 3.3e+03 |

18 | 8.2e+02 |

20 | 2.1e+02 |

22 | 5.2e+01 |

24 | 1.3e+01 |

26 | 3.2e+00 |

28 | 8.1e-01 |

30 | 2.0e-01 |

32 | 5.1e-02 |

34 | 1.3e-02 |

36 | 3.2e-03 |

38 | 8.0e-04 |

40 | 2.0e-04 |

This means that, if we can do 40 PCR cycles, we would be able to detect DNA in low concentrations, as low as 0.0002, which is 2 molecules in 10 liters. Amazing.

Now, what is the *margin of error* for this measurement? That
is matter for another day.

It cannot be that easy. The teacher does not asks

*easy*questions. And we are not*lazy*students.↩︎Speed is not very important now, but if one day we do

*big data*then speed is essential. The program has to finish before we get old.↩︎Did we learn that? At least it was in the lectures.↩︎

We can even be wrong about the size of the vector. If we make a small vector it will grow as needed later. But is more efficient to do it right from the beginning.↩︎

So

*linear models*are part of*molecular biology*? Yes, very much. Good that we learned them last semester.↩︎