- There are several data types:
- numeric, character, logic, factor
- They are stored in one of many data structures
- vectors
- lists
- matrices
- data frames
October 22, 2019
The function used to read text files is
read.table(file, header = FALSE, sep = "", quote = "\"'",
row.names, col.names, na.strings = "NA",
stringsAsFactors = TRUE,
dec = ".", comment.char = "#", ...)
Please take a look at the help page of read.table().
help(read.table)
The output of this function is a data.frame. The only mandatory argument is:
"\t" for TabTRUE (by default), then text are taken as factorsSet it to FALSE to read text as character
dec="," for numbers in Turkish (European) formatToday we will use data from
https://anaraven.bitbucket.io/static/2018/cmb1/survey1-tidy.txt
Please download it to your computer and save it in a good place
We read data with
survey <- read.table("survey1-tidy.txt")
What can we say about this data?
Data frames always have column names
Each column can be accessed by its name
colnames(survey)
[1] "Gender" "birth_day" "birth_month" [4] "birth_year" "height_cm" "weight_kg" [7] "handness" "hand_span_cm"
Each column is a vector
survey$handness
[1] Right Right Left Right Right Right Right Left [9] Right Right Right Right Right Right Right Right [17] Right Right Right Right Right Right Right Right [25] Right Right Right Right Left Right Right Right [33] Right Right Right Right Right Right Right Right [41] Right Right Right Right Right Right Right Right [49] Left Right Right Levels: Left Right
We can always compare a vector to a constant
survey$handness=="Left"
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE [9] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [17] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [25] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE [33] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [49] TRUE FALSE FALSE
(notice that we use == for comparisons)
We can use the result of this comparison as a row index
survey[survey$handness=="Left", ]
Gender birth_day birth_month birth_year height_cm
st3 Female 28 1 1995 170
st8 Female 14 1 1997 162
st29 Male 28 7 1998 185
st49 Female 2 5 1999 165
weight_kg handness hand_span_cm
st3 56 Left 18
st8 75 Left 18
st29 65 Left 22
st49 63 Left 17
survey[survey$handness=="Left", "Gender"]
[1] Female Female Male Female Levels: Female Male
survey$Gender[survey$handness=="Left"]
[1] Female Female Male Female Levels: Female Male
Same result, different ways
We recommend that every time you use read.table, immediately you verify it
summary(survey)
Gender birth_day birth_month
Female:30 Min. : 1.00 Min. : 1.000
Male :21 1st Qu.: 5.00 1st Qu.: 3.500
Median :13.00 Median : 6.000
Mean :13.59 Mean : 6.353
3rd Qu.:20.00 3rd Qu.: 9.000
Max. :31.00 Max. :12.000
birth_year height_cm weight_kg
Min. :1991 Min. :155.0 Min. : 42.50
1st Qu.:1997 1st Qu.:163.0 1st Qu.: 55.00
Median :1997 Median :171.0 Median : 64.00
Mean :1998 Mean :170.7 Mean : 65.56
3rd Qu.:1998 3rd Qu.:176.5 3rd Qu.: 74.50
Max. :2018 Max. :195.0 Max. :106.00
handness hand_span_cm
Left : 4 Min. : 8.00
Right:47 1st Qu.:16.00
Median :19.00
Mean :18.98
3rd Qu.:21.00
Max. :30.00
summary()The result depends on the type of column
For a factor we get
summary(survey$handness)
Left Right
4 47
Number of rows
nrow(survey)
[1] 51
Number of rows and columns (dimensions)
dim(survey)
[1] 51 8
This command counts how many of each value
table(survey$handness)
Left Right
4 47
summary()The result depends on the type of column
For a numeric column we get
summary(survey$hand_span_cm)
Min. 1st Qu. Median Mean 3rd Qu. Max. 8.00 16.00 19.00 18.98 21.00 30.00