Using Data Frames

October 22, 2019

Basic objects in R

There are several data types:
- numeric, character, logic, factor
They are stored in one of many data structures
- vectors
- lists
- matrices
- data frames

Basic Indices

Each element can be accessed using indices
- numeric vectors (positive or negative)
- logical vectors
- character vector

Reading text files

The function used to read text files is

read.table(file, header = FALSE, sep = "", quote = "\"'",
           row.names, col.names, na.strings = "NA",
           stringsAsFactors = TRUE,
           dec = ".", comment.char = "#", ...)

Please take a look at the help page of read.table().

help(read.table)

Reading text files

The output of this function is a data.frame. The only mandatory argument is:

file: the name of the file to read. It can also be an URL

Other useful options

header: if TRUE then the first line has the names of the columns
sep: Which character is used to separate columns. Use "\t" for Tab

Other useful options

stringsAsFactors: Logic option. If it is TRUE (by default), then text are taken as factors; Set it to FALSE to read text as character
dec: the character used in the file for decimal points; use dec="," for numbers in Turkish (European) format

Example data

Today we will use data from

https://anaraven.bitbucket.io/static/2018/cmb1/survey1-tidy.txt

Please download it to your computer and save it in a good place

Example data

We read data with

survey <- read.table("survey1-tidy.txt")

What can we say about this data?

Selecting columns

Data frames always have column names

Each column can be accessed by its name

colnames(survey)

[1] "Gender"       "birth_day"    "birth_month" 
[4] "birth_year"   "height_cm"    "weight_kg"   
[7] "handness"     "hand_span_cm"

Selecting columns

Each column is a vector

survey$handness

 [1] Right Right Left  Right Right Right Right Left 
 [9] Right Right Right Right Right Right Right Right
[17] Right Right Right Right Right Right Right Right
[25] Right Right Right Right Left  Right Right Right
[33] Right Right Right Right Right Right Right Right
[41] Right Right Right Right Right Right Right Right
[49] Left  Right Right
Levels: Left Right

Choosing some rows

We can always compare a vector to a constant

survey$handness=="Left"

 [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [9] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[33] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49]  TRUE FALSE FALSE

(notice that we use == for comparisons)

Selecting rows

We can use the result of this comparison as a row index

survey[survey$handness=="Left", ]

     Gender birth_day birth_month birth_year height_cm
st3  Female        28           1       1995       170
st8  Female        14           1       1997       162
st29   Male        28           7       1998       185
st49 Female         2           5       1999       165
     weight_kg handness hand_span_cm
st3         56     Left           18
st8         75     Left           18
st29        65     Left           22
st49        63     Left           17

Two ways to say the same

survey[survey$handness=="Left", "Gender"]

[1] Female Female Male   Female
Levels: Female Male

survey$Gender[survey$handness=="Left"]

[1] Female Female Male   Female
Levels: Female Male

Same result, different ways

Summary statistics

We recommend that every time you use read.table, immediately you verify it

summary(survey)

    Gender     birth_day      birth_month    
 Female:30   Min.   : 1.00   Min.   : 1.000  
 Male  :21   1st Qu.: 5.00   1st Qu.: 3.500  
             Median :13.00   Median : 6.000  
             Mean   :13.59   Mean   : 6.353  
             3rd Qu.:20.00   3rd Qu.: 9.000  
             Max.   :31.00   Max.   :12.000  
   birth_year     height_cm       weight_kg     
 Min.   :1991   Min.   :155.0   Min.   : 42.50  
 1st Qu.:1997   1st Qu.:163.0   1st Qu.: 55.00  
 Median :1997   Median :171.0   Median : 64.00  
 Mean   :1998   Mean   :170.7   Mean   : 65.56  
 3rd Qu.:1998   3rd Qu.:176.5   3rd Qu.: 74.50  
 Max.   :2018   Max.   :195.0   Max.   :106.00  
  handness   hand_span_cm  
 Left : 4   Min.   : 8.00  
 Right:47   1st Qu.:16.00  
            Median :19.00  
            Mean   :18.98  
            3rd Qu.:21.00  
            Max.   :30.00

Meaning of `summary()`

The result depends on the type of column

For a factor we get

summary(survey$handness)

 Left Right 
    4    47

Other ways of counting

Number of rows

nrow(survey)

[1] 51

Number of rows and columns (dimensions)

dim(survey)

[1] 51  8

Counting each case

This command counts how many of each value

table(survey$handness)

 Left Right 
    4    47

Meaning of `summary()`

The result depends on the type of column

For a numeric column we get

summary(survey$hand_span_cm)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.00   16.00   19.00   18.98   21.00   30.00

Basic objects in R

Basic Indices

Reading text files

Reading text files

Other useful options

Other useful options

Example data

Example data

Selecting columns

Selecting columns

Choosing some rows

Selecting rows

Two ways to say the same

Summary statistics

Meaning of summary()

Other ways of counting

Counting each case

Meaning of summary()

Meaning of `summary()`

Meaning of `summary()`