Data manipulation in R

John Bastiaansen

Data Manipulation and Basic Statistics in R

In this session, we will cover the following topics:

Import data

We have seen how to create new data

But: in most of cases we already have the data:

Import data

We want to get the external data into R

args(read.table)
## function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", 
##     numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, 
##     col.names, as.is = !stringsAsFactors, na.strings = "NA", 
##     colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, 
##     fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, 
##     comment.char = "#", allowEscapes = FALSE, flush = FALSE, 
##     stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", 
##     encoding = "unknown", text, skipNul = FALSE) 
## NULL
df <- read.table("filename.txt", header = TRUE)
head(df)
##   Va V2 V3 V4 Vee
## 1  1  5  9 13  17
## 2  2  6 10 14  18
## 3  3  7 11 15  19
## 4  4  8 12 16  20

Import data

Importing comma separated data (.csv files)

df <- read.csv("filename.csv", header = TRUE)
head(df)
##   X V1 V2 V3 V4 V5
## 1 1  1  5  9 13 17
## 2 2  2  6 10 14 18
## 3 3  3  7 11 15 19
## 4 4  4  8 12 16 20

Reading data from the internet:

df <- read.table("http://rcursus.dairyconsult.nl/lime.prn",header = TRUE)
head(df)
##   type rate    pH
## 1   AL    0 5.735
## 2   AL    0 5.730
## 3   AL    0 5.750
## 4   AL    1 5.845
## 5   AL    0 5.735
## 6   AL    0 5.770

These functions will result in objects of class data.frame

Syntax notes

Brackets in R

Bracket Use
() For function calls (print(l)) and to set priorities (a = 3*(1+2))
[] For indexing vectors (a[1]), matrices
(m[1,1]), or lists (l[[1]])
{} Block delimiter for grouping commands: (if(!exists(a)){a = 1; print(a)})

Build in data

data()
data(ChickWeight)
names(ChickWeight)
## [1] "weight" "Time"   "Chick"  "Diet"

The ChickWeight data

An overview of the complete dataset:

dim(ChickWeight)
## [1] 578   4
names(ChickWeight)
## [1] "weight" "Time"   "Chick"  "Diet"

The ChickWeight data

A graphical overview of the dataset:

boxplot(weight~Time,data = ChickWeight, xlab="Time", ylab="weight")

Boxplots of the weight per timepoint.

How to make plots like this will be explained in the course section Graphics

Subset data

First we use the function summary() to get a quick overview of the data:

summary(ChickWeight)
##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506

Later on we will use all the data in ChickWeight, but for now we only want the weights of each chick on day 21:

CW21 <- subset(ChickWeight, Time == 21)
summary(CW21)
##      weight           Time        Chick    Diet  
##  Min.   : 74.0   Min.   :21   13     : 1   1:16  
##  1st Qu.:167.0   1st Qu.:21   9      : 1   2:10  
##  Median :205.0   Median :21   20     : 1   3:10  
##  Mean   :218.7   Mean   :21   10     : 1   4: 9  
##  3rd Qu.:266.0   3rd Qu.:21   17     : 1         
##  Max.   :373.0   Max.   :21   19     : 1         
##                               (Other):39

Subset data

To make CW21 we used conditional subsetting (Time == 21).

Subset data

Often, the use of indexes will be more convenient than using the function subset():

CW21[CW21$weight >= 250, ]
##     weight Time Chick Diet
## 84     305   21     7    1
## 167    266   21    14    1
## 232    331   21    21    2
## 280    265   21    25    2
## 292    251   21    26    2
## 328    309   21    29    2
## 352    256   21    31    3
## 364    305   21    32    3
## 388    341   21    34    3
## 400    373   21    35    3
## 436    290   21    38    3
## 448    272   21    39    3
## 460    321   21    40    3
## 484    281   21    42    4
## 554    322   21    48    4
## 578    264   21    50    4

Subset data

To understand how subsetting with indexes works we break it down in two steps

sel <- CW21$Diet == "1"
sel
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE

sel is a logical vector that is TRUE for all the rows of CW21 that match our criteria

CW21.subset1 <- CW21[sel, ]
dim(CW21.subset1)
## [1] 16  4

Subsetting on multiple variables

Sometimes, we want to subset based on multiple variables:

CW21.subset2 <- CW21[(CW21$Diet == "1" & CW21$weight >= 250), ]
CW21.subset2
##     weight Time Chick Diet
## 84     305   21     7    1
## 167    266   21    14    1

Explore data

is.data.frame(CW21)
## [1] TRUE
dim(CW21)
## [1] 45  4
head(CW21)
##    weight Time Chick Diet
## 12    205   21     1    1
## 24    215   21     2    1
## 36    202   21     3    1
## 48    157   21     4    1
## 60    223   21     5    1
## 72    157   21     6    1
names(CW21)
## [1] "weight" "Time"   "Chick"  "Diet"
table(CW21$Diet)
## 
##  1  2  3  4 
## 16 10 10  9

Variance and other summary statistics

The sample variance is defined as

\(\sigma^2 = \frac{1}{n-1}\sum_{i =1}^n\left(x_i - \frac{1}{n}\sum_{i = 1}^nx_i\right)^2\)

So we can calculate

x <- 1:100
sum((x - mean(x))^2)/(length(x) - 1)
## [1] 841.6667

But of course a function exists to calculate the sample variance:

var(x)
## [1] 841.6667

The same is true for many other functions: mean, sd, median, quantile, max, min, range, … We will try some of these functions in the exercises.

Missing values

aa <- 1:10
bb <- c(1:10, NA)
mean(aa)
## [1] 5.5
mean(bb)
## [1] NA
mean(bb, na.rm = TRUE)
## [1] 5.5

Many functions have an argument na.rm

Distributions

R functions are available for many distributions. For most distributions, four functions are available, the density, probability, quantile, and random function. These functions are identified by the starting letter being d, p, q, r:

dnorm(x, mean=0, sd=1, log = FALSE)
pnorm(q, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean=0, sd=1)
plot(dnorm(seq(-4,4,by=0.1))~seq(-4,4,by=0.1), ylab = 'density')

The normal distribution

x1 <- seq(-4, 4, 0.1)
x2 <- c(1e-04, 0.001, 0.01, 0.05, 0.1, 0.3, 0.5,0.7, 0.9, 0.95, 0.99, 0.999, 0.9999)
par(mfrow = c(1, 4), mar = c(2, 2, 2, 2))
plot(x1, dnorm(x1), main = "density")
plot(x1, pnorm(x1), main = "distribution function")
plot(x1, rnorm(x1), main = "random numbers")
plot(x2, qnorm(x2), main = "quantile function")

Other distributions

A large number of distributions are available:

Distribution Function
beta dbeta(x, shape1, shape2, ncp=0, log=F)
binomial dbinom}(x, size, prob, log=F)
chi-squared dchisq(x, df, ncp, log=F)
exponential dexp(x, rate, log=F)
F df(x, df1, df2, ncp, log=F)
gamma dgamma(x, shape, rate, scale, log=F)
log-normal dlnorm(x, meanlog, sdlog, log=F)
logistic dlogis(x, location, scale, log=F)
normal dnorm(x, mean, sd, log=F)
Poisson dpois(x, lambda, log=F)
t dt(x, df, ncp, log=F)
uniform dunif(x, min, max, log=F)
weibull dweibull(x, shape, scale, log=F)

Using distributions

Using the distribution functions as statistical tables for:

qnorm(0.05)
## [1] -1.644854
qt(0.05, df = 50)
## [1] -1.675905
pnorm(-1.645)
## [1] 0.04998491
pt(-1.676, df = 50)
## [1] 0.04999064

t-test

Hypothesis testing of problems with one variable: the t-test:

t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95, ...)

t-test

Applying a t-test to the weights on days 21:

t.test(CW21$weight)
## 
##  One Sample t-test
## 
## data:  CW21$weight
## t = 20.515, df = 44, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  197.2048 240.1730
## sample estimates:
## mean of x 
##  218.6889

This way we test if weights are different from zero, which they obviously are.

t-test

We can also compare the mean of different groups: Are Chicks on Diet 3 heavier than Chicks on Diet 1? For this question we compare 2 samples in the t.test:

t.test(CW21$weight[CW21$Diet == "3"], CW21$weight[CW21$Diet =="1"], var.equal = T)
## 
##  Two Sample t-test
## 
## data:  CW21$weight[CW21$Diet == "3"] and CW21$weight[CW21$Diet == "1"]
## t = 3.5955, df = 24, p-value = 0.001454
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   39.42419 145.67581
## sample estimates:
## mean of x mean of y 
##    270.30    177.75

Order data

Use function order()

ord <- order(CW21$weight)
CW21.ordered <- CW21[ord, ]
head(CW21.ordered)
##     weight Time Chick Diet
## 268     74   21    24    2
## 155     96   21    13    1
## 107     98   21     9    1
## 220    117   21    20    1
## 119    124   21    10    1
## 194    142   21    17    1

Merging data

First we get the weights at day 10 from ChickWeight and copy them to another data.frame CW10:

CW10 <- ChickWeight[ChickWeight$Time == 10, c(1,3)]
names(CW10) <- c("weight10", "Chick")
head(CW10)
##    weight10 Chick
## 6        93     1
## 18      103     2
## 30       99     3
## 42       87     4
## 54      106     5
## 66      124     6

Merging data

Now, we want to merge the two datasets, to have two colums for each Chick with the weights at day 10 and day 21:

names(CW21)
## [1] "weight" "Time"   "Chick"  "Diet"
names(CW10)
## [1] "weight10" "Chick"

The only overlapping variable is “Chick”

CW <- merge(CW21, CW10, by = "Chick", all = TRUE)
head(CW)
##   Chick weight Time Diet weight10
## 1    16     NA   NA <NA>       51
## 2    15     NA   NA <NA>       68
## 3    13     96   21    1       67
## 4     9     98   21    1       96
## 5    20    117   21    1       73
## 6    10    124   21    1       81

Save & export data

When you have finished the analysis, the complete R workspace can be saved:

Or you may want to export the results in a single object as a table:

write.table(CW, "ChickWeight21and10.txt", sep = "\t")

The resulting file can be read into your R session using read.table function, or used outside of R

Additional topics: some tips

Return to main page