Part 1 - Nuts and Bolts
Contents
1.1 Getting Started
1.1.1 R Console Input 1.1.2 Working Directory and Files 1.1.3 R Objects and Attributes 1.1.4 Sequence of Numbers1.2 Basic Data Types in R
1.2.1 Vectors & Lists 1.2.2 Missing Values 1.2.3 Subsetting Vectors and Lists 1.2.4 Matrices & Data Frames 1.2.5 Factors1.3 Reading Data to R
1.3.1 Tabular Data & Textual Data Formats 1.3.2 Connections: Interfaces to the Outside World1.1 Getting Started
1.1.1 R Console Input
Input an expression and R will print the result immediately.
When assignment operator “<-“, “->” is used, R will store the result and not print it unless you type the variable name or call print() function. Comment sign: “#”> x <- 5> x ## Or print(x)[1] 5
1.1.2 Working Directory and Files
A bit similar to terminal command line tools.
Basic commands:- getwd(): to get the current working directory
- ls(): current objects in local workspace
- list.files(): to list all the files in the current working directory
- args(function_name): to see what parameters a function take
- setwd(“dir”): to set working directory to a specified directory
More functions about directory and files:
- dir.create(“dir_name”, recursive = FALSE)
- file.create(“file_name”)
- file.exists(“f”) file.info(“f”) file.rename(“f1”, “f2”) file.copy(“f1”, “f2”)
- file.path(“f1”, “f2”, “f3”): relative path: f1/f2/f3
- unlink(“f”, recursive = FALSE): to delete directory and files
Tab completion works in R as well.
1.1.3 R Objects and Attributes
Atomic classes of objects in R:
- Characters
- Numeric (double precision real numbers)
- Integer
- Complex
- Logical
Numbers in R are generally treated as numeric numbers. Specify the “L” suffix if you explicitly want an integer, e.g. 1L. Inf and NaN are also defined in R.
R objects can have attributes: names, dimnames, dimensions, class, length and others. They can be accessed by attributes() function.1.1.4 Sequence of Numbers
Colon operator “:” is the most common one used to create a sequence.
> 1:10 [1] 1 2 3 4 5 6 7 8 9 10> pi:10 [1] 3.141593 4.141593 5.141593 6.141593 7.141593 8.141593 9.141593> 15:1 [1] 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
the bracket [] above indicates that x is a vector (which contains elements of the same class), and the element follows it, which is 1, is the first element of the vector. If it is printed in two lines as below, you’ll see
[1] 15 14 13 12 11 10 9 8 [9] 7 6 5 4 3 2 1
Seq() function does similar work. Advantages are seq() can control increment and length, e.g.
> seq(1, 5, by = 0.5)[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0> seq(1, 10, length = 7)[1] 1.0 2.5 4.0 5.5 7.0 8.5 10.0
Rep() (replicate) is another function to create a sequence.
> rep(0, times = 10)[1] 0 0 0 0 0 0 0 0 0 0> rep(c(0, 1, 2), times = 5)[1] 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2> rep(c(0, 1, 2), each = 5)[1] 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2
1.2 Basic Data Types in R
1.2.1 Vectors & Lists
Vector is the most common object in R. And it can only contain objects of the same class. List is similar to vector but can contain objects of different classes.
The c() function (combine / concatenate) can be used to create vectors.> x <- c(0.5, 0.6)> x <- c("a", "b", "c")> x <- c(1+0i, 3+4i)
The vector() function works as well.
> x <- vector("numeric", length = 10)
then the vector x will be initialized with default value.
Vectors can be used in arithmetic expression. Common arithmetic operators include “+”, “-“, “*”, “/”, “^” (power), sqrt(), abs(), etc. e.g.
> z <- c(1, 2, 3)> z + 100[1] 101 102 103> sqrt(z - 1)[1] 0.000000 1.000000 1.414214
Other operations for vectors include max, min, range (return c(min, max)), length, sum, prod, mean (return average), var (return variance), sort, etc.
When two vectors of the same length are involved in arithmetic expression, R will perform the operations element by element (vectorized operations).
If they are of different lengths, R will cycle in the shorter vector (Note that a single number can be viewed as a vector of length 1). And R will give a warning if the short length does not divide the long length. e.g.> x <- c(1, 2, 3, 4, 5, 6)> y <- c(1, 10, 1, 10, 1, 10)> x + y[1] 2 12 4 14 6 16> y <- c(1, 10, 100)> x + y[1] 2 12 103 5 15 106> y <- c(1, 10, 100, 1000)> x + y[1] 2 12 103 1004 6 16Warning message:In x + y : 长的对象长度不是短的对象长度的整倍数
Logical vectors:
> x <- c("a", "b", "c", "c", "d", "a")> u <- x > "a"> u[1] FALSE TRUE TRUE TRUE TRUE FALSE
Logical operators: >, <, ==, >=, <=, !=, &, |, !, xor()
And we have && and || which only evaluates the first element of each operand.Character vectors can be combined using both c() and paste() functions.
> my_char <- c("My", "name", "is")> paste(my_char, collapse = " ")[1] "My name is"> c(my_char, "Niwatori")[1] "My" "name" "is" "Niwatori"> paste("Hello", "world!", sep = " ")[1] "Hello world!"> paste(1:3, c("X", "Y", "Z"), sep = "")[1] "1X" "2Y" "3Z"
When you try to mix objects of different classes in a vector, implicit coercion will happen to turn objects into the same class. (Coercion principle?)
> c(1.7, "a")[1] "1.7" "a"> c(TRUE, 2)[1] 1 2
Explicit coercion can happen by using as.* function.
> x <- 0:4> as.numeric(x)[1] 0 1 2 3 4> as.character(x)[1] "0" "1" "2" "3" "4"
Lists are similar to vectors except that lists can contain objects of different classes, and every object in the list occupies a single vector.
> list(1, "a", TRUE, 1+4i)[[1]][1] 1[[2]][1] "a"[[3]][1] TRUE[[4]][1] 1+4i
1.2.2 Missing Values
Missing values are denoted by NA (Not Available) or NaN (Not a Number) for undefined mathematical operations.
NaN will occur if you try to compute 0 / 0 or Inf – Inf, where Inf stands for infinity. The function is.na() is used to test objects if they are NA, and is.nan() is used to test for NaN. A NaN value is also NA but not vice versa.> x <- c(1, 2, NA, NaN, 3)> is.na(x)[1] FALSE FALSE TRUE TRUE FALSE> is.nan(x)[1] FALSE FALSE FALSE TRUE FALSE
Note the command “x == NA” does NOT perform identically as “is.na(x)”. For “x == NA”, each element in x is compared with NA, yielding an incomplete expression which returns NA as an indefinite value, i.e.
> x == NA[1] NA NA NA NA NA
To remove missing values, logical vectors with is.na() and complete.cases() functions are often used.
> x <- c(1, 2, NA, 4, NA, 6)> x[!is.na(x)][1] 1 2 4 6> y <- c("a", NA, "c", "d", NA, "f")> good <- complete.cases(x, y)> x[good][1] 1 4 6> y[good][1] "a" "d" "f"
> myd ## A data frame Names First Second Third1 Alice 1 2 32 Bob 2 3 43 Carol NA 4 54 Dave 4 NA 6> good <- complete.cases(myd)> myd[good, ] Names First Second Third1 Alice 1 2 32 Bob 2 3 4
1.2.3 Subsetting Vectors and Lists
For subsetting vectors, single square bracket operator [] is most commonly used.
> x <- c(1, 2, 3, 4, 5, 5, 5, 5, 5, NA, NA, NA, 6, 7, 8, 9)> x[2] ## Positive integer index[1] 2> x[1:5][1] 1 2 3 4 5> x[c(3, 5, 7, 9, 11)][1] 3 5 5 5 NA> y <- x[!is.na(x)] ## Logical index> y[1] 1 2 3 4 5 5 5 5 5 6 7 8 9> y[y > 5][1] 6 7 8 9> x[!is.na(x) & x > 5][1] 6 7 8 9
Which() function will produce the indices of the elements which make the expression true.
You'll get nothing useful if you ask for numbers whose indices are 0 or greater than the bound of the vector. Be cautious! But negative indices do make sense.
> x <- 1:10> x[c(-2, -7)] ## Negative integer index[1] 1 3 4 5 6 8 9 10 ## All numbers except x[2] & x[7]> x[-c(2, 7)] ## Putting the negative sign in front also works[1] 1 3 4 5 6 8 9 10
Modifying subsets:
> x <- c(-2:5, rep(NA, 4))> x [1] -2 -1 0 1 2 3 4 5 NA NA NA NA> x[is.na(x)] <- -1> x [1] -2 -1 0 1 2 3 4 5 -1 -1 -1 -1> x[x < 0] <- -x[x < 0] ## Same as x <- abs(x)> x [1] 2 1 0 1 2 3 4 5 1 1 1 1
R objects can have names for writing readable code.
Names of vectors can be accessed and set with names() function.> x <- c(foo = 1, bar = 2, norf = 3)> x foo bar norf 1 2 3 > names(x)[1] "foo" "bar" "norf"
or can be implemented as
> x <- c(1, 2, 3)> names(x) <- c("foo", "bar", "norf")> x foo bar norf 1 2 3
Now we can subset the vector through names.
> x["bar"]bar 2 > x[c("foo", "bar")]foo bar 1 2
Other operators used for extracting subsets of R objects:
- []: returns an object of the same class, can extract multiple elements
- [[]]: extracts a single element of a list or a data frame, returns an object with a type not necessarily the same as the original
- $: extract a single element of a list or a data frame by name
Examples of [[]] and $ operators for subsetting lists:
> x <- list(1:4, 0.6)> x[[1]][1] 1 2 3 4[[2]][1] 0.6> x[1] ## Returns a list containing a numeric vector[[1]][1] 1 2 3 4> x[[1]] ## Returns simply a numeric vector[1] 1 2 3 4> names(x) <- c("foo", "bar")> x$foo[1] 1 2 3 4$bar[1] 0.6> x$foo ## x$foo == x[["foo"]] == x[[1]][1] 1 2 3 4> x[1:2] ## Returns a list$foo[1] 1 2 3 4$bar[1] 0.6
Differences between [[]] and $ operators when subsetting by names:
- From the commands
x$foo
andx[["foo"]]
, we know that the [[]] operator can be used with computed indices while the $ operator can only be used with literal names. - The $ operator can be used in partial matching while the [[]] operator cannot unless you set exact = FALSE.
The [[]] operator can take a integer sequence to extract a single element from nested lists, equivalent to using bracket operators multiple times.
> x <- list(a = list(2, 3, 4), b = c(5, 6))> x[[1]][1][[1]][1] 2> x[[1]][[1]][1] 2> x[[c(1, 1)]][1] 2> x$a[[1]][1] 2
1.2.4 Matrices & Data Frames
Matrices are vectors with a dimension attribute, which is an integer vector of length 2 (nrow, ncol). So the first way to create a matrix from a vector is to add dimension attribute. Note matrices are constructed column-wise.
> m <- 1:10> dim(m) <- c(2, 5)> m [,1] [,2] [,3] [,4] [,5][1,] 1 3 5 7 9[2,] 2 4 6 8 10
Matrices can also be created using matrix() function.
> m <- matrix(1:6, nrow = 2, ncol = 3)> m [,1] [,2] [,3][1,] 1 3 5[2,] 2 4 6
Matrices can be created by column-binding or row-binding with cbind() or rbind() function.
> x <- 1:3> y <- 10:12> cbind(x, y) x y[1,] 1 10[2,] 2 11[3,] 3 12> rbind(x, y) [,1] [,2] [,3]x 1 2 3y 10 11 12
Matrices can be subsetted with x[i, j] type indices, where i and j can be missing.
When a single element of a matrix is extracted, it is returned as a vector of length 1 rather than a 1 x 1 matrix. This behavior can be turned off by setting drop = FALSE. Similar for extracting a single row or a single column.
> x <- matrix(1:6, 2, 3)> x [,1] [,2] [,3][1,] 1 3 5[2,] 2 4 6> x[1, 2][1] 3> x[1, ][1] 1 3 5> x[1, , drop = FALSE] [,1] [,2] [,3][1,] 1 3 5
Vectorized operations work for matrices as well. Note x*y yields a matrix with entries of x multiplied by entries of y respectively, while x %*% y is the real matrix multiplication.
> x <- matrix(1:4, 2, 2)> y <- x> x * y [,1] [,2][1,] 1 9[2,] 4 16> x %*% y [,1] [,2][1,] 7 15[2,] 10 22
Names of matrices can be set with dimnames() attribute, which must be a list containing names of rows and columns.
> x <- matrix(1:6, nrow = 2, ncol = 3)> x [,1] [,2] [,3][1,] 1 3 5[2,] 2 4 6> dimnames(x) <- list(c("r1", "r2"), c("c1", "c2", "c3"))> x c1 c2 c3r1 1 3 5r2 2 4 6
Similar to matrices, data frames are used to store tabular data as well, but data frames can contain objects of different classes while matrices cannot.
Data frames have attributes called rownames() and colnames(), which will be 1, 2, 3, etc. by default.> my_matrix <- matrix(1:20, nrow = 4, ncol = 5)> patients <- c("Bill", "Gina", "Kelly", "Sean")> cbind(patients, my_matrix) ## Wrong! Implicit coercion from numeric to character[1,] "Bill" "1" "5" "9" "13" "17"[2,] "Gina" "2" "6" "10" "14" "18"[3,] "Kelly" "3" "7" "11" "15" "19"[4,] "Sean" "4" "8" "12" "16" "20"> my_data <- data.frame(patients, my_matrix)> my_data patients X1 X2 X3 X4 X51 Bill 1 5 9 13 172 Gina 2 6 10 14 183 Kelly 3 7 11 15 194 Sean 4 8 12 16 20> colnames(my_data) <- c("patient", "age", "weight", "bp", "rating", "test")> my_data patient age weight bp rating test1 Bill 1 5 9 13 172 Gina 2 6 10 14 183 Kelly 3 7 11 15 194 Sean 4 8 12 16 20
1.2.5 Factors
Factors are used to represent categorical data like a label with a levels attribute.
> x <- factor(c("y", "y", "n", "y", "n"))> x[1] y y n y nLevels: n y> table(x) ## Show how many objects of each levelxn y2 3> unclass(x) ## Strip the classes out of objects[1] 2 2 1 2 1attr(,"levels")[1] "n" "y"
The order of the levels can be set using levels arguments to factor() or modifying levels() attribute. This can be important because the first level sometimes is set as the baseline level, e.g.
> x <- factor(c("y", "y", "n", "y", "n"), levels = c("y", "n"))> x[1] y y n y nLevels: y n
1.3 Reading Data to R
1.3.1 Tabular Data & Textual Data Formats
The most commonly used function to read tabular data is read.table() and read.csv(). The two functions are almost identical except that the separator for the former is the space while for the latter is the comma.
Read.table() function takes quite a few parameters, many of which have default values. But specifying these options instead of using default can make it run faster.
> data <- read.table("foo.txt")
Dump() and dput() function can result in textual format which preserves the metadata though sacrificing some readability and memory. Textual format frees other users from specifying the data all over again, and it makes data potentially recoverable in case of corruption.
> y <- data.frame(a = 1, b = "a")> dput(y)structure(list(a = 1, b = structure(1L, .Label = "a", class = "factor")), .Names = c("a", "b"), row.names = c(NA, -1L), class = "data.frame")> dput(y, file = "test.R")> newy <- dget("test.R")> newy a b1 1 a
Dput() and dget() is used to write and read data in textual format. Dump() and source() have similar functions, but the difference is that they are used for multiple objects.
> x <- "foo"> y <- data.frame(a = 1, b = "a")> dump(c("x", "y"), file = "test.R")> rm(x, y) ## Remove variables x and y> source("test.R")> x[1] "foo"> y a b1 1 a
1.3.2 Connections: Interfaces to the Outside World
Data are read in through connection interfaces.
- file: opens a connection to a file
- url: opens a connection to a webpage
- gzfile: opens a connection to a file compressed with gzip
- etc.
File() function takes a few parameters, among which description, the name of the file, and the open options, are most commonly used. For open options, there are “r”, “w”, “a”, “rb”, “wb”, “ab” for reading, writing and appending only or in binary mode.
Here are two examples:> con <- file("foo.txt", "r")> data <- read.csv(con)> close(con)
is the same as
> data <- read.csv("foo.txt")
Reading webpages:
> con <- url("http://www.baidu.com/", "r")> x <- readlines(con)> head(x)[1] " " ...[2] "