John Bastiaansen
In this session, we will cover the following topics:
We have seen how to create new data
c(), seq(), :, …matrix()data.frame()But: in most of cases we already have the data:
We want to get the external data into R
read.table:args(read.table)## function (file, header = FALSE, sep = "", quote = "\"'", dec = ".",
## numerals = c("allow.loss", "warn.loss", "no.loss"), row.names,
## col.names, as.is = !stringsAsFactors, na.strings = "NA",
## colClasses = NA, nrows = -1, skip = 0, check.names = TRUE,
## fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE,
## comment.char = "#", allowEscapes = FALSE, flush = FALSE,
## stringsAsFactors = default.stringsAsFactors(), fileEncoding = "",
## encoding = "unknown", text, skipNul = FALSE)
## NULL
df <- read.table("filename.txt", header = TRUE)
head(df)## Va V2 V3 V4 Vee
## 1 1 5 9 13 17
## 2 2 6 10 14 18
## 3 3 7 11 15 19
## 4 4 8 12 16 20
Importing comma separated data (.csv files)
df <- read.csv("filename.csv", header = TRUE)
head(df)## X V1 V2 V3 V4 V5
## 1 1 1 5 9 13 17
## 2 2 2 6 10 14 18
## 3 3 3 7 11 15 19
## 4 4 4 8 12 16 20
Reading data from the internet:
df <- read.table("http://rcursus.dairyconsult.nl/lime.prn",header = TRUE)
head(df)## type rate pH
## 1 AL 0 5.735
## 2 AL 0 5.730
## 3 AL 0 5.750
## 4 AL 1 5.845
## 5 AL 0 5.735
## 6 AL 0 5.770
These functions will result in objects of class data.frame
TRUE and FALSE are logical constants, you can also use abbreviations T and F'' and double-quotes "" symbols are valid as long as the left quote is the same as the right quoteargs(functionName) will show the arguments for functionName with default values, try args(read.table)read.table("filename.txt" , TRUE , "\t" )read.table(file="filename.txt" , header=T , sep="\t" , ...)read.table("filename.txt" , header=T, "\t" , ...)| Bracket | Use |
|---|---|
() |
For function calls (print(l)) and to set priorities (a = 3*(1+2)) |
[] |
For indexing vectors (a[1]), matrices |
(m[1,1]), or lists (l[[1]]) |
|
{} |
Block delimiter for grouping commands: (if(!exists(a)){a = 1; print(a)}) |
ChickWeight dataset from R:
ChickWeight is a data.frame?ChickWeight for a descriptiondata()data(ChickWeight)
names(ChickWeight)## [1] "weight" "Time" "Chick" "Diet"
ChickWeight dataAn overview of the complete dataset:
data.frame :dim(ChickWeight)## [1] 578 4
names(ChickWeight)## [1] "weight" "Time" "Chick" "Diet"
ChickWeight dataA graphical overview of the dataset:
boxplot(weight~Time,data = ChickWeight, xlab="Time", ylab="weight")How to make plots like this will be explained in the course section Graphics
First we use the function summary() to get a quick overview of the data:
summary(ChickWeight)## weight Time Chick Diet
## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
Later on we will use all the data in ChickWeight, but for now we only want the weights of each chick on day 21:
CW21 <- subset(ChickWeight, Time == 21)
summary(CW21)## weight Time Chick Diet
## Min. : 74.0 Min. :21 13 : 1 1:16
## 1st Qu.:167.0 1st Qu.:21 9 : 1 2:10
## Median :205.0 Median :21 20 : 1 3:10
## Mean :218.7 Mean :21 10 : 1 4: 9
## 3rd Qu.:266.0 3rd Qu.:21 17 : 1
## Max. :373.0 Max. :21 19 : 1
## (Other):39
To make CW21 we used conditional subsetting (Time == 21).
Available comparison operators are:
< less than> greater than== equal to<= less than or equal to>= greater than or equal to!= NOT equal to (! symbol indicates negation)is.na(x) tests if x has missing valuesLogical operators to combine expressions are:
& logical AND| logical OR! logical NOT (negation)Often, the use of indexes will be more convenient than using the function subset():
CW21[CW21$weight >= 250, ]## weight Time Chick Diet
## 84 305 21 7 1
## 167 266 21 14 1
## 232 331 21 21 2
## 280 265 21 25 2
## 292 251 21 26 2
## 328 309 21 29 2
## 352 256 21 31 3
## 364 305 21 32 3
## 388 341 21 34 3
## 400 373 21 35 3
## 436 290 21 38 3
## 448 272 21 39 3
## 460 321 21 40 3
## 484 281 21 42 4
## 554 322 21 48 4
## 578 264 21 50 4
To understand how subsetting with indexes works we break it down in two steps
sel <- CW21$Diet == "1"
sel## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE
sel is a logical vector that is TRUE for all the rows of CW21 that match our criteria
CW21.subset1 <- CW21[sel, ]
dim(CW21.subset1)## [1] 16 4
Sometimes, we want to subset based on multiple variables:
CW21.subset2 <- CW21[(CW21$Diet == "1" & CW21$weight >= 250), ]
CW21.subset2## weight Time Chick Diet
## 84 305 21 7 1
## 167 266 21 14 1
is.data.frame(CW21)## [1] TRUE
dim(CW21)## [1] 45 4
head(CW21)## weight Time Chick Diet
## 12 205 21 1 1
## 24 215 21 2 1
## 36 202 21 3 1
## 48 157 21 4 1
## 60 223 21 5 1
## 72 157 21 6 1
names(CW21)## [1] "weight" "Time" "Chick" "Diet"
table(CW21$Diet)##
## 1 2 3 4
## 16 10 10 9
The sample variance is defined as
\(\sigma^2 = \frac{1}{n-1}\sum_{i =1}^n\left(x_i - \frac{1}{n}\sum_{i = 1}^nx_i\right)^2\)
So we can calculate
x <- 1:100
sum((x - mean(x))^2)/(length(x) - 1)## [1] 841.6667
But of course a function exists to calculate the sample variance:
var(x)## [1] 841.6667
The same is true for many other functions: mean, sd, median, quantile, max, min, range, … We will try some of these functions in the exercises.
In R, a missing value is represented as NA
NA’s will be carried through in computationsNA will always return NA as the resultWorking with NA can give unexpected results:
aa <- 1:10
bb <- c(1:10, NA)
mean(aa)## [1] 5.5
mean(bb)## [1] NA
mean(bb, na.rm = TRUE)## [1] 5.5
Many functions have an argument na.rm
R functions are available for many distributions. For most distributions, four functions are available, the density, probability, quantile, and random function. These functions are identified by the starting letter being d, p, q, r:
dnorm(x, mean=0, sd=1, log = FALSE)
pnorm(q, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean=0, sd=1)plot(dnorm(seq(-4,4,by=0.1))~seq(-4,4,by=0.1), ylab = 'density')x1 <- seq(-4, 4, 0.1)
x2 <- c(1e-04, 0.001, 0.01, 0.05, 0.1, 0.3, 0.5,0.7, 0.9, 0.95, 0.99, 0.999, 0.9999)
par(mfrow = c(1, 4), mar = c(2, 2, 2, 2))
plot(x1, dnorm(x1), main = "density")
plot(x1, pnorm(x1), main = "distribution function")
plot(x1, rnorm(x1), main = "random numbers")
plot(x2, qnorm(x2), main = "quantile function")A large number of distributions are available:
| Distribution | Function |
|---|---|
| beta | dbeta(x, shape1, shape2, ncp=0, log=F) |
| binomial | dbinom}(x, size, prob, log=F) |
| chi-squared | dchisq(x, df, ncp, log=F) |
| exponential | dexp(x, rate, log=F) |
| F | df(x, df1, df2, ncp, log=F) |
| gamma | dgamma(x, shape, rate, scale, log=F) |
| log-normal | dlnorm(x, meanlog, sdlog, log=F) |
| logistic | dlogis(x, location, scale, log=F) |
| normal | dnorm(x, mean, sd, log=F) |
| Poisson | dpois(x, lambda, log=F) |
| t | dt(x, df, ncp, log=F) |
| uniform | dunif(x, min, max, log=F) |
| weibull | dweibull(x, shape, scale, log=F) |
Using the distribution functions as statistical tables for:
qnorm(0.05)## [1] -1.644854
qt(0.05, df = 50)## [1] -1.675905
pnorm(-1.645)## [1] 0.04998491
pt(-1.676, df = 50)## [1] 0.04999064
Hypothesis testing of problems with one variable: the t-test:
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)Applying a t-test to the weights on days 21:
t.test(CW21$weight)##
## One Sample t-test
##
## data: CW21$weight
## t = 20.515, df = 44, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 197.2048 240.1730
## sample estimates:
## mean of x
## 218.6889
This way we test if weights are different from zero, which they obviously are.
We can also compare the mean of different groups: Are Chicks on Diet 3 heavier than Chicks on Diet 1? For this question we compare 2 samples in the t.test:
t.test(CW21$weight[CW21$Diet == "3"], CW21$weight[CW21$Diet =="1"], var.equal = T)##
## Two Sample t-test
##
## data: CW21$weight[CW21$Diet == "3"] and CW21$weight[CW21$Diet == "1"]
## t = 3.5955, df = 24, p-value = 0.001454
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 39.42419 145.67581
## sample estimates:
## mean of x mean of y
## 270.30 177.75
Use function order()
ord <- order(CW21$weight)
CW21.ordered <- CW21[ord, ]
head(CW21.ordered)## weight Time Chick Diet
## 268 74 21 24 2
## 155 96 21 13 1
## 107 98 21 9 1
## 220 117 21 20 1
## 119 124 21 10 1
## 194 142 21 17 1
First we get the weights at day 10 from ChickWeight and copy them to another data.frame CW10:
CW10 <- ChickWeight[ChickWeight$Time == 10, c(1,3)]
names(CW10) <- c("weight10", "Chick")
head(CW10)## weight10 Chick
## 6 93 1
## 18 103 2
## 30 99 3
## 42 87 4
## 54 106 5
## 66 124 6
Now, we want to merge the two datasets, to have two colums for each Chick with the weights at day 10 and day 21:
names(CW21)## [1] "weight" "Time" "Chick" "Diet"
names(CW10)## [1] "weight10" "Chick"
The only overlapping variable is “Chick”
CW <- merge(CW21, CW10, by = "Chick", all = TRUE)
head(CW)## Chick weight Time Diet weight10
## 1 16 NA NA <NA> 51
## 2 15 NA NA <NA> 68
## 3 13 96 21 1 67
## 4 9 98 21 1 96
## 5 20 117 21 1 73
## 6 10 124 21 1 81
When you have finished the analysis, the complete R workspace can be saved:
save.image("myR.RData")load("myR.RData") to load in a later R sessionOr you may want to export the results in a single object as a table:
write.table(CW, "ChickWeight21and10.txt", sep = "\t")The resulting file can be read into your R session using read.table function, or used outside of R
history() to see what you have enteredrm() to remove objects