October 11, 2018

Basic objects in R

  • There are several data types:
    • numeric, character, logic, factor
  • They are stored in one of many data structures
    • vectors
    • lists
  • Each element can be accessed using indices
    • numeric vectors (positive or negative)
    • logical vectors
    • character vector

Example of exam questions

Assuming people is a list; What do these commands do?

people[[2]]
people[2]
people[[2]][3]
people[2][3]
people[[1:3]]
people[1:3]
people[["weight"]]
people$weight
people["weight"]

Matrices

Example of Matrix

state.x77
               Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama              3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska                365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona              2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas             2110   3378        1.9    70.66   10.1    39.9    65  51945
California          21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado             2541   4884        0.7    72.06    6.8    63.9   166 103766
Connecticut          3100   5348        1.1    72.48    3.1    56.0   139   4862
Delaware              579   4809        0.9    70.06    6.2    54.6   103   1982
Florida              8277   4815        1.3    70.66   10.7    52.6    11  54090
Georgia              4931   4091        2.0    68.54   13.9    40.6    60  58073
Hawaii                868   4963        1.9    73.60    6.2    61.9     0   6425
Idaho                 813   4119        0.6    71.87    5.3    59.5   126  82677
Illinois            11197   5107        0.9    70.14   10.3    52.6   127  55748
Indiana              5313   4458        0.7    70.88    7.1    52.9   122  36097
Iowa                 2861   4628        0.5    72.56    2.3    59.0   140  55941
Kansas               2280   4669        0.6    72.58    4.5    59.9   114  81787
Kentucky             3387   3712        1.6    70.10   10.6    38.5    95  39650
Louisiana            3806   3545        2.8    68.76   13.2    42.2    12  44930
Maine                1058   3694        0.7    70.39    2.7    54.7   161  30920
Maryland             4122   5299        0.9    70.22    8.5    52.3   101   9891
Massachusetts        5814   4755        1.1    71.83    3.3    58.5   103   7826
Michigan             9111   4751        0.9    70.63   11.1    52.8   125  56817
Minnesota            3921   4675        0.6    72.96    2.3    57.6   160  79289
Mississippi          2341   3098        2.4    68.09   12.5    41.0    50  47296
Missouri             4767   4254        0.8    70.69    9.3    48.8   108  68995
Montana               746   4347        0.6    70.56    5.0    59.2   155 145587
Nebraska             1544   4508        0.6    72.60    2.9    59.3   139  76483
Nevada                590   5149        0.5    69.03   11.5    65.2   188 109889
New Hampshire         812   4281        0.7    71.23    3.3    57.6   174   9027
New Jersey           7333   5237        1.1    70.93    5.2    52.5   115   7521
New Mexico           1144   3601        2.2    70.32    9.7    55.2   120 121412
New York            18076   4903        1.4    70.55   10.9    52.7    82  47831
North Carolina       5441   3875        1.8    69.21   11.1    38.5    80  48798
North Dakota          637   5087        0.8    72.78    1.4    50.3   186  69273
Ohio                10735   4561        0.8    70.82    7.4    53.2   124  40975
Oklahoma             2715   3983        1.1    71.42    6.4    51.6    82  68782
Oregon               2284   4660        0.6    72.13    4.2    60.0    44  96184
Pennsylvania        11860   4449        1.0    70.43    6.1    50.2   126  44966
Rhode Island          931   4558        1.3    71.90    2.4    46.4   127   1049
South Carolina       2816   3635        2.3    67.96   11.6    37.8    65  30225
South Dakota          681   4167        0.5    72.08    1.7    53.3   172  75955
Tennessee            4173   3821        1.7    70.11   11.0    41.8    70  41328
Texas               12237   4188        2.2    70.90   12.2    47.4    35 262134
Utah                 1203   4022        0.6    72.90    4.5    67.3   137  82096
Vermont               472   3907        0.6    71.64    5.5    57.1   168   9267
Virginia             4981   4701        1.4    70.08    9.5    47.8    85  39780
Washington           3559   4864        0.6    71.72    4.3    63.5    32  66570
West Virginia        1799   3617        1.4    69.48    6.7    41.6   100  24070
Wisconsin            4589   4468        0.7    72.48    3.0    54.5   149  54464
Wyoming               376   4566        0.6    70.29    6.9    62.9   173  97203

Matrices

Like vectors but in 2 dimensions

weight
[1] 60 72 57 90 95 72
matrix(weight, nrow=2, ncol=3)
     [,1] [,2] [,3]
[1,]   60   57   95
[2,]   72   90   72

Values go column-by-column

Matrix dimensions

M <- matrix(weight, nrow=2, ncol=3)
dim(M)
[1] 2 3
nrow(M)
[1] 2
ncol(M)
[1] 3

Rows and columns names

colnames(M) <- c("A", "B", "C")
rownames(M) <- c("x", "y")
M
   A  B  C
x 60 57 95
y 72 90 72

Indexing Matrices

  • Objects of type matrix or array use an index for each dimension
  • If an index is omitted, all the range is returned
M[2,  ]
 A  B  C 
72 90 72 
M[ , 3]
 x  y 
95 72 

Indexing Matrices

Notice that sometimes the answer is a vector, other times is a matrix

M[ , 2:3]
   B  C
x 57 95
y 90 72
M[ , 3]
 x  y 
95 72 

Exercise

Copy the first 6 rows of state.x77 into the matrix mat

mat <- state.x77[1:6, ]
mat
           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

Exercises using mat

  • Find the value of the third row, fourth column
  • Get a vector with the fifth column
  • What is the difference between the first and the second row
  • What is the Illiteracy at Colorado?
    • change it to 0
  • Change the names of the columns to Turkish
  • Change the names of rows to abbreviations
    • Hint: state.abb

Data Frames

Data Frames

  • Bidimensional structure, like matrices
  • Each column can be of a different type but same length
  • All columns need a name
ppl <- data.frame(weight=c(60, 72, 57, 90, 95, 72),
               height=c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91),
               names=c("Ali", "Deniz", "Fatma", "Emre",
                       "Volkan", "Onur"),
               gender=factor(c("M","F","F","M","M","M")))

Data Frame

ppl
  weight height  names gender
1     60   1.75    Ali      M
2     72   1.80  Deniz      F
3     57   1.65  Fatma      F
4     90   1.90   Emre      M
5     95   1.74 Volkan      M
6     72   1.91   Onur      M

Each column is a vector

If ppl is a data.frame, then ppl[[1]] is a vector

  • All elements of a column have the same data type
  • Different columns may have different types
    • In a matrix columns have all the same type
  • All columns have the same size
    • In a list the elements can have any size

Accessing single columns

You can index the columns like in a lists

ppl[[1]]
[1] 60 72 57 90 95 72
ppl[["weight"]]
[1] 60 72 57 90 95 72
ppl$weight
[1] 60 72 57 90 95 72

Data frame dimensions

dim(ppl)
[1] 6 4
nrow(ppl)
[1] 6
ncol(ppl)
[1] 4

Rows and columns names

colnames(ppl)
[1] "weight" "height" "names"  "gender"
rownames(ppl)
[1] "1" "2" "3" "4" "5" "6"

Interacting with the real world

Data comes from other programs

  • Data enters the computer from instruments
  • Most modern instruments have digital output
  • In some cases it has to be entered manually
  • This is dangerous, humans make many mistakes

For us, data always come from another program

Reading text files

The function used to read text files is

read.table(file, header = FALSE, sep = "", quote = "\"'",
           row.names, col.names, na.strings = "NA",
           stringsAsFactors = default.stringsAsFactors(),
           dec = ".", comment.char = "#", ...)

Please take a look at the help page of read.table().

Reading text files

The output of this function is a data.frame. The only mandatory argument is:

file
the name of the file to read. It can also be an URL

Other important option

header
if TRUE then the first line has the names of the columns

Other useful options

sep
Which character is used to separate columns. Use "\t" for Tab
stringsAsFactors
Logic option. If it is TRUE (by default), then text are taken as factors

Set it to FALSE to read text as character

Example data

Example data

We read data with

survey <- read.table(
"https://anaraven.bitbucket.io/static/2018/cmb1/survey1.txt",
 header=TRUE, sep="\t")

Result

we get a data frame like this:

head(survey)
  Gender birth_date height_cm weight_kg handness hand_span_cm
1   Male 08/12/2018        NA        NA    Right           20
2   Male 01/02/1993     179.0        67    Right           15
3 Female 09/10/1995     167.0        58    Right           18
4 Female 28/01/1995       1.7        56     Left           18
5   Male 11/08/1992       1.8        94    Right           25
6 Female 01/01/1991     160.0        60    Right           19