Programming in R

Albart Coster

Introduction

Programming: writing commands in a language understandable for a computer. An example:

text <- "Today is"
today <- date()
print(text)
## [1] "Today is"
print(today)
## [1] "Tue May 03 10:19:46 2016"
(Today <- paste(text,today))
## [1] "Today is Tue May 03 10:19:46 2016"

Make use of effort of others: e.g. using the paste function.

Important components of programming in R

Objects

Everything in R is an object. An object is defined by its class:

a <- 1
class(a)
## [1] "numeric"
b <- "R course"
class(b)
## [1] "character"
paste
## function (..., sep = " ", collapse = NULL) 
## .Internal(paste(list(...), sep, collapse))
## <bytecode: 0x000000000f1aad90>
## <environment: namespace:base>
class(paste)
## [1] "function"

Functions

A function:

functionname(argument1 = a1, argument2 = a2, ..., argumentn = an)

Declare and remove objects

a <- 1
a <- 2
print(a)
## [1] 2
rm(a)
exists('a')
## [1] FALSE
a <- "object a"
b <- a
b == a
## [1] TRUE

Note: the ‘==’ operator. Returns TRUE or FALSE

Conditional operators

Suppose that we want to know if a certain condition is true. We use conditional operators:

a <- 1; b <- 2
a < b
## [1] TRUE
a <= b
## [1] TRUE
a > b
## [1] FALSE
a >= b
## [1] FALSE
a == b
## [1] FALSE
a != b
## [1] TRUE

Conditional operations

Suppose that we want to do something depending on a certain outcome:

{
    if(a < b)
        print(" a is smaller than b")
    else if (a > b)
        print(" a is bigger than b")
    else
        print("a and b are equal")
}
## [1] " a is smaller than b"

Logical OR and AND operators:

{
  if(a == 1 & b == 10)
    print("A is equal to 1 AND B is equal to 10")
  else if(a == 1 | b == 10)
    print("A is equal to 1 OR B is equal to 10")
  else
    print("A is not equal to 1 NOR B equal to 10")
}
## [1] "A is equal to 1 OR B is equal to 10"

Vectors

Advantage of factor over character is related to storage.

Basic functions to make vectors

Function c to concatenate, required to join numbers or characters into a vector:

(v1 <- c(1,2))
## [1] 1 2
(v2 <- c("a","b","cc"))
## [1] "a"  "b"  "cc"
(v3 <- c(v1,v1))
## [1] 1 2 1 2
(v4 <- c(v1,v2))
## [1] "1"  "2"  "a"  "b"  "cc"
(v5 <- c(first = 1,second = 2,last = 1000))
##  first second   last 
##      1      2   1000

Function seq to make a sequence:

(seq1 <- seq(from = 0,to = 10,by = 1))
##  [1]  0  1  2  3  4  5  6  7  8  9 10
(seq2 <- seq(from = 1,to = 2,length.out = 3))
## [1] 1.0 1.5 2.0

Indexing a vector

The ‘[’ operator to access specific elements of a vector:

v1 <- seq(0,10)
v1[1]
## [1] 0
v1[10]
## [1] 9
ii <- seq(1,3)
v1[ii]
## [1] 0 1 2

Note: the last option; access various indices 1,2,3 directly. This is called vectorized computing.

Indexing by name

Give name to elements of a vector and use the names to access elements:

v1
##  [1]  0  1  2  3  4  5  6  7  8  9 10
names(v1) <- letters[1:length(v1)]
v1
##  a  b  c  d  e  f  g  h  i  j  k 
##  0  1  2  3  4  5  6  7  8  9 10
v1["b"]
## b 
## 1
v5["first"]
## first 
##     1

Note: the ':' operator is similar to the seq function. For for help: ?':'.

1:3
## [1] 1 2 3

Character vectors

Character vectors are vectors of character strings:

letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
class(letters)
## [1] "character"
(action <- c("programming","in","R"))
## [1] "programming" "in"          "R"
length(action)
## [1] 3
action[1]
## [1] "programming"
action[1:3]
## [1] "programming" "in"          "R"

Function paste

Function paste to combine characters into one character string:

(p1 <- paste(1,letters[1:10],sep = "_"))
##  [1] "1_a" "1_b" "1_c" "1_d" "1_e" "1_f" "1_g" "1_h" "1_i" "1_j"
(p2 <- paste(1:10,letters[1:10],sep = "."))
##  [1] "1.a"  "2.b"  "3.c"  "4.d"  "5.e"  "6.f"  "7.g"  "8.h"  "9.i"  "10.j"
(p3 <- paste(1:3,letters[1:10],sep = "."))
##  [1] "1.a" "2.b" "3.c" "1.d" "2.e" "3.f" "1.g" "2.h" "3.i" "1.j"

Note p3: first argument is recycled.

To concatenate result into one string:

(p4 <- paste(p3,collapse = "/"))
## [1] "1.a/2.b/3.c/1.d/2.e/3.f/1.g/2.h/3.i/1.j"

Function strsplit

Suppose that we have a string and want to split it at certain positions:

(p5 <- strsplit(p4,split = "/"))
## [[1]]
##  [1] "1.a" "2.b" "3.c" "1.d" "2.e" "3.f" "1.g" "2.h" "3.i" "1.j"
p6 <- strsplit(unlist(p5),split = "\\.")
p6[1:2]
## [[1]]
## [1] "1" "a"
## 
## [[2]]
## [1] "2" "b"
(todaysplitted <- unlist(strsplit(today,split = " ")))
## [1] "Tue"      "May"      "03"       "10:19:46" "2016"

Note: strsplit returns a list. Use unlist to transform the list to a character object.

Functions substr and gsub

Suppose that we want a section of the string:

string <- "this is a string"
(sstring <- substr(x = string,start = 1,stop = 4))
## [1] "this"

Suppose that we want to substitute a part of a string:

(nstring <- gsub(pattern = "is",replacement = "was",
                 x = string))
## [1] "thwas was a string"
(nnstring <- gsub(pattern = "[[:space:]]is[[:space:]]"," was ",string))
## [1] "this was a string"

This also introduces the regular expressions. Have a look at ?regexp or google for this topic.

Operator '%in%'

Suppose that we want to know if elements of one vector are present in another vector: operator '%in%':

v1 <- letters[1:10]
v2 <- letters[5:15]
v1%in%v2
##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

It does not tells us where each element of v1 appears in v2.

Function match

Suppose that we want to know where each element of v1 appears in v2: function match:

v1
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
v2
##  [1] "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
match(v1,v2)
##  [1] NA NA NA NA  1  2  3  4  5  6
match(v1,c(v2,v2))
##  [1] NA NA NA NA  1  2  3  4  5  6

Note: match only returns the first match.

Function which

Function which to obtain the index where a condition is TRUE:

which(v1 == "a")
## [1] 1
vv <- c(v1,v1,v1)
which(vv == "a")
## [1]  1 11 21

Now, we use which to substitute the elements of vv which are a with A:

vv[which(vv=="a")] <- "A"

Extending the match function:

Suppose that we want all the matches. We need to program:

v22 <- c(v2,v2)
indices <- lapply(v1,function(i)
   which(v22%in%i))

A more elegant way: write a new function called matchall, first check if it is not already defined (help.search('matchall'):

matchall <- function(x,y){
  lapply(x,function(i){
    which(y%in%i)})
}
identical(indices,matchall(v1,v22))
## [1] TRUE

This introduces how to program functions.

Time for exercises

Matrices

Function matrix to create a matrix:

(mat1 <- matrix(1:4,nrow = 2,ncol = 2,byrow = T))
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

Access the elements of the matrix with '[' operator:

mat1[1,1]
## [1] 1
mat1[1,]
## [1] 1 2

Matrices have two dimensions: the ',' separates the dimensions.

Replace elements of the matrix with the [ <- operator:

mat1[,2] <- -10
mat1
##      [,1] [,2]
## [1,]    1  -10
## [2,]    3  -10

Combining matrices

Function rbind to combine two matrices under each other (make sure that the number of columns coincides):

mat2 <- matrix(0,nrow = 2,ncol = ncol(mat1))
mat3 <- rbind(mat1,mat2)
dim(mat3)
## [1] 4 2

Function cbind to combine matrices by column (make sure that the number of rows coincides):

mat4 <- cbind(mat1,mat2)
dim(mat4)
## [1] 2 4

Other functions for matrices

Application of matrix functions

Suppose that we have observations x and dependent observations y and we want to know the linear relationship between x and y:

\(\mathbf{y} = \mathbf{x}b,\)

where b is an unknown regression coefficient. The best, least squares, solution is:

\(\hat{b} = \mathbf{(x'x)^{-1}x'y}.\)

Now: program it in R:

(x <- rep(1:3,length.out = 10))
##  [1] 1 2 3 1 2 3 1 2 3 1
b <- 0.5
(y <- b*x + rnorm(10))
##  [1]  1.33391213  2.53923443  3.19224503  0.02218696  1.14560315
##  [6]  1.21403112 -0.79375053 -0.34268710  0.42364929  1.59210687
(bhat <- solve(t(x)%*%x)%*%t(x)%*%y)
##          [,1]
## [1,] 0.542524

Compare the original b to the estimate of b, bhat.

Note: more efficient way to solve linear models with function lm; see slides on regression.

Objects of class data.frame

The columns of a matrix are of one single class. A data.frame is generalized matrix where each column can be of a different class.

Create an object of class data.frame:

animals <- data.frame(id = letters[1:10],
                      weight = rnorm(mean = 100,10),
                      sex = sample(c("F","M"),10,replace = T))
head(animals)
##   id    weight sex
## 1  a 100.11286   M
## 2  b 100.77140   F
## 3  c 100.61593   M
## 4  d  99.80709   M
## 5  e  99.16148   M
## 6  f 100.62874   F

Manipulating data.frames

We want to extract the weights from animals into a vector weigths:

weights <- animals$weight

We want the subset of the the where weight is more than 100 in a new data.frame animals100p:

animals100p <- animals[animals$weight>100,]

We we want to substitute weight when sex == 'M' by NA:

animals$weight[animals$sex == 'M'] <- NA

We want to add a column to animals with the squared weight:

animals$weight2 <- animals$weight^2

Reading and writing data

Function write.csv write a data.frame as a *.csv file:

write.csv(animals,file = 'animals.csv',row.names= FALSE,quote = FALSE)

Function read.csv to read a csv file into a data.frame:

animals1 <- read.csv(file = 'animals.csv',header = TRUE)

Note: we made a file called animals.csv somewhere on the HD and then read this file back in the memory. Test if a file exists:

if(!file.exists('nonexistentfile.csv'))
    print("File does not exists.")
## [1] "File does not exists."

Time for exercises

Repeating

Suppose that we want to repeat an operation many times:

Sum up random numbers until the sum reaches 10:

s <- 0
n <- 0
while(s<10){
  n <- n + 1
  s <- s + runif(1,0,1)
}
print(s)
## [1] 10.36739
print(n)
## [1] 18

We needed 18 steps to get a sum higher than 10.

Loops: for

Suppose that we want to follow a certain sequence of values and do something at each value, looping:

vec1 <- rnorm(10)
vec2 <- numeric(10)
for(i in 1:length(vec1))
{
    vec2[i] <- abs(vec1[i])
}

Note: try to avoid loops. Make code slow and ugly. Much better and faster solution:

vec2 <- abs(vec1)

Loops: while

Suppose that we want to do something until a certain criterion is met:

i <- 0
while(i<10)
    i <- i + 1

Be careful: do not make endless loops:

i <- 1
while(i>0)
    i <- i + 1

Function apply

Suppose that we have a matrix and want to calculate the sums of each column (columns are the \(2^{\text{nd}}\) dimension of a matrix).

Ugly method:

mat <- matrix(rnorm(20),nrow = 2,ncol = 10)
uglyCS <- numeric(ncol(mat))
for(col in 1:ncol(mat)){
    uglyCS[col] <- sum(mat[,col])
}

Elegant method:

elegantCS <- apply(mat,2,sum)
identical(uglyCS,elegantCS)
## [1] TRUE

Function apply applies a function to a matrix. Here the function sum is applied to each column of mat (\(2^{nd}\) dimension), second argument of apply is 2. Try to calculate the sums of the rows of mat.

Intermezzo: the ChickWeight dataset

The ChickWeight dataset is from an experiment of the effect of diet on the weight of chicken. The variables are:

summary(ChickWeight)
##      weight           Time           Chick     Diet    shape  
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220   0: 96  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120   1: 98  
##  Median :103.0   Median :10.00   20     : 12   3:120   2:102  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118   3: 92  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12           4: 96  
##  Max.   :373.0   Max.   :21.00   19     : 12           5: 94  
##                                  (Other):506

Function tapply

Suppose that we want to obtain mean weight per Diet and per Time period from ChickWeight. We need:

Ugly method:

mns <- list()
for(t in unique(ChickWeight$Time)){
    for(d in unique(ChickWeight$Diet)){
        ii <- with(ChickWeight,which(Diet%in%d&Time%in%t))
        mns[[paste(t,d)]] <-
            mean(ChickWeight$weight[ii])
    }}

Elegant method:

tt <- tapply(ChickWeight$weight,list(ChickWeight$Diet,
                                     ChickWeight$Time),mean)
tt[,1:4]
##      0     2        4        6
## 1 41.4 47.25 56.47368 66.78947
## 2 40.7 49.40 59.80000 75.40000
## 3 40.8 50.40 62.20000 77.90000
## 4 41.0 51.80 64.50000 83.90000

Here we applied the function mean to Diet at each unique time. Again, with less writing using function with:

tt <- with(ChickWeight,tapply(weight,list(Diet,Time),mean))
tt[,1:4]
##      0     2        4        6
## 1 41.4 47.25 56.47368 66.78947
## 2 40.7 49.40 59.80000 75.40000
## 3 40.8 50.40 62.20000 77.90000
## 4 41.0 51.80 64.50000 83.90000

Programming functions

Tools in R are functions. Defining a function is not difficult.

Example: a function to summarize data into a mean and a standard deviation, meansd:

meansd <- function (x, nDec = 2,na.rm = TRUE)
{
  mn <- mean(x, na.rm = na.rm)
  if (!is.na(mn))
    mn <- round(mn, nDec)
  sd <- sd(x, na.rm = na.rm)
  if (!is.na(sd))
    sd <- round(sd, nDec)
  return(paste(mn, " (", sd, ")", sep = ""))
}
meansd(1:10)
## [1] "5.5 (3.03)"

Application of meansd

Suppose that we want to summarize weight for each combination of diet and time into mean and standard deviation:

ttFancy <- tapply(ChickWeight$weight,list(ChickWeight$Diet,ChickWeight$Time),
                  meansd,1)
                  ttFancy[,10:12]
##   18             20             21            
## 1 "158.9 (49.2)" "170.4 (55.4)" "177.8 (58.7)"
## 2 "187.7 (63.3)" "205.6 (70.3)" "214.7 (78.1)"
## 3 "233.1 (57.6)" "258.9 (65.2)" "270.3 (71.6)"
## 4 "202.9 (33.6)" "233.9 (37.6)" "238.6 (43.3)"

Time for exercises