November 21st, 2016

Interacting with the real world

Data comes from other programs

  • Data enters the computer from instruments
  • Most modern instruments have digital output
  • In some cases it has to be entered manually
  • This is dangerous, humans make many mistakes

For us, data always come from another program

Reading text files

The function used to read text files is

read.table(file, header = FALSE, sep = "", quote = "\"'",
           row.names, col.names, na.strings = "NA",
           stringsAsFactors = default.stringsAsFactors(),
           dec = ".", comment.char = "#", ...)

Please take a look at the help page of read.table().

Reading text files

The output of this function is a data.frame. The only mandatory argument is:

file
the name of the file to read. It can also be an URL

Important options of read.table()

Other important options

header
if TRUE then the first line has the names of the columns
col.names
a character vector that gives names to the columns
row.names
  • the number of the column that has the row names
    • or, a character vector that gives names to the rows

Sometimes R detects rows and columns names automatically (when?)

Important options of read.table()

sep
which characters separate the columns: spaces, tabs, commas
default: one or more spaces
quote
which characters are used to wrap text
default: " and '
dec
symbol used to separate decimals. In US it is ., in Europe is ,
default: .

Important options of read.table()

comment.char
everything in the line after this symbol is ignored
default: #
stringsAsFactors
logical value. If TRUE then all character columns in the file are converted to factors
default: TRUE

There are more options that may be necessary sometimes. We just showed the most often used

Example data

Example data

We read data with

birth <- read.table("http://anaraven.bitbucket.io/static/birth.txt", header=TRUE)

which results in a data frame like this:

head(birth)
    id birth apgar5 sex weight head  age parity weeks
1 4347     1      8   F   1610 41.0 28.5      1    31
2 4346     1      9   F   3580 51.0 35.0      1    39
3 4300     1      9   F   3350 52.0 37.0      1    40
4 4345     1      9   F   3230 50.5 35.0      1    38
5 4349     1      8   F   3650 52.0 36.5      1    40
6 4315     2      8   F   3900 51.0 35.0      1    38

Two or more variables

Two plots in parallel

plot(birth$head)
points(birth$age, pch=2)

The first one defines the scale

Adding straight lines

plot(birth$head)
points(birth$age, pch=2)
abline(h=mean(birth$head), lwd = 3)
abline(h=mean(birth$age), lwd = 3, col = "blue")

A-B-line

This command adds a straight line in a specific position

  • abline(h=1) adds a horizontal line in 1
  • abline(v=2) adds a vertical line in 2
  • abline(a=3, b=4) adds an \(a +b\cdot x\) line

Scatter plots

Comparing two variables

plot(birth$age, birth$apgar5)

Other example

plot(birth$age, birth$head)

Formulas in R

Formulas

Sometimes it is easier to describe the relationship between variables using a formula

Instead of

plot(birth$age, birth$head)

we can write

plot(birth$head ~ birth$age)

or even

plot(head ~ age, data = birth)

Using formulas makes life easier

plot(head ~ age, data = birth)

plot(head ~ age, data = birth, subset = sex=="F")
plot(head ~ age, data = birth, subset = sex=="M")

It is easier to specify the data.frame and which values to plot

Homework

Try these commands at home.

What is wrong with these graphics?