Import data

We have seen how to create new data

Create vectors: c(), seq(), :, …
Create matrices: matrix()
Create data.frame: data.frame()
…

But: in most of cases we already have the data:

in a file, .csv or .txt, …
in tables, or other files such as SPSS/SAS results

Import data

We want to get the external data into R

We can use function read.table:

args(read.table)

## function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", 
##     numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, 
##     col.names, as.is = !stringsAsFactors, na.strings = "NA", 
##     colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, 
##     fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, 
##     comment.char = "#", allowEscapes = FALSE, flush = FALSE, 
##     stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", 
##     encoding = "unknown", text, skipNul = FALSE) 
## NULL

You have to set the working directory to where the file (filename.txt) is stored:

df <- read.table("filename.txt", header = TRUE)
head(df)

##   Va V2 V3 V4 Vee
## 1  1  5  9 13  17
## 2  2  6 10 14  18
## 3  3  7 11 15  19
## 4  4  8 12 16  20

Import data

Importing comma separated data (.csv files)

df <- read.csv("filename.csv", header = TRUE)
head(df)

##   X V1 V2 V3 V4 V5
## 1 1  1  5  9 13 17
## 2 2  2  6 10 14 18
## 3 3  3  7 11 15 19
## 4 4  4  8 12 16 20

Reading data from the internet:

df <- read.table("http://rcursus.dairyconsult.nl/lime.prn",header = TRUE)
head(df)

##   type rate    pH
## 1   AL    0 5.735
## 2   AL    0 5.730
## 3   AL    0 5.750
## 4   AL    1 5.845
## 5   AL    0 5.735
## 6   AL    0 5.770

These functions will result in objects of class data.frame

Syntax notes

Spaces in commands do NOT matter (except for readability), but CAPITALIZATION does matter
TRUE and FALSE are logical constants, you can also use abbreviations T and F
Both single- '' and double-quotes "" symbols are valid as long as the left quote is the same as the right quote
args(functionName) will show the arguments for functionName with default values, try args(read.table)
Argument matching
1. match on position: read.table("filename.txt" , TRUE , "\t" )
2. match by name: read.table(file="filename.txt" , header=T , sep="\t" , ...)
3. Both argument matching types can be mixed in the same function read.table("filename.txt" , header=T, "\t" , ...)

Brackets in R

Bracket	Use
`()`	For function calls (`print(l)`) and to set priorities (`a = 3*(1+2)`)
`[]`	For indexing vectors (`a[1]`), matrices
	(`m[1,1]`), or lists (`l[[1]]`)
`{}`	Block delimiter for grouping commands: (`if(!exists(a)){a = 1; print(a)}`)

Build in data

In the following sessions, we will use the ChickWeight dataset from R:
- Data from an experiment on the effect of diet on early growth of chickens
- Object ChickWeight is a data.frame
- Check ?ChickWeight for a description
The following functions will help you to find data that comes with the R package datasets:

data()

data(ChickWeight)
names(ChickWeight)

## [1] "weight" "Time"   "Chick"  "Diet"

The `ChickWeight` data

An overview of the complete dataset:

The dimensions of the data.frame :

dim(ChickWeight)

## [1] 578   4

The names of the columns:

names(ChickWeight)

## [1] "weight" "Time"   "Chick"  "Diet"

The `ChickWeight` data

A graphical overview of the dataset:

boxplot(weight~Time,data = ChickWeight, xlab="Time", ylab="weight")

Boxplots of the weight per timepoint.

How to make plots like this will be explained in the course section Graphics

Subset data

First we use the function summary() to get a quick overview of the data:

summary(ChickWeight)

##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506

Later on we will use all the data in ChickWeight, but for now we only want the weights of each chick on day 21:

CW21 <- subset(ChickWeight, Time == 21)
summary(CW21)

##      weight           Time        Chick    Diet  
##  Min.   : 74.0   Min.   :21   13     : 1   1:16  
##  1st Qu.:167.0   1st Qu.:21   9      : 1   2:10  
##  Median :205.0   Median :21   20     : 1   3:10  
##  Mean   :218.7   Mean   :21   10     : 1   4: 9  
##  3rd Qu.:266.0   3rd Qu.:21   17     : 1         
##  Max.   :373.0   Max.   :21   19     : 1         
##                               (Other):39

Subset data

To make CW21 we used conditional subsetting (Time == 21).

Available comparison operators are:
- < less than
- > greater than
- == equal to
- <= less than or equal to
- >= greater than or equal to
- != NOT equal to (! symbol indicates negation)
- is.na(x) tests if x has missing values
Logical operators to combine expressions are:
- & logical AND
- | logical OR
- ! logical NOT (negation)

Subset data

Often, the use of indexes will be more convenient than using the function subset():

CW21[CW21$weight >= 250, ]

##     weight Time Chick Diet
## 84     305   21     7    1
## 167    266   21    14    1
## 232    331   21    21    2
## 280    265   21    25    2
## 292    251   21    26    2
## 328    309   21    29    2
## 352    256   21    31    3
## 364    305   21    32    3
## 388    341   21    34    3
## 400    373   21    35    3
## 436    290   21    38    3
## 448    272   21    39    3
## 460    321   21    40    3
## 484    281   21    42    4
## 554    322   21    48    4
## 578    264   21    50    4

Subset data

To understand how subsetting with indexes works we break it down in two steps

sel <- CW21$Diet == "1"
sel

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE

sel is a logical vector that is TRUE for all the rows of CW21 that match our criteria

CW21.subset1 <- CW21[sel, ]
dim(CW21.subset1)

## [1] 16  4

Subsetting on multiple variables

Sometimes, we want to subset based on multiple variables:

CW21.subset2 <- CW21[(CW21$Diet == "1" & CW21$weight >= 250), ]
CW21.subset2

##     weight Time Chick Diet
## 84     305   21     7    1
## 167    266   21    14    1

Explore data

What type of object are we dealing with

is.data.frame(CW21)

## [1] TRUE

How much data do we have

dim(CW21)

## [1] 45  4

See the first 6 rows of the data

head(CW21)

##    weight Time Chick Diet
## 12    205   21     1    1
## 24    215   21     2    1
## 36    202   21     3    1
## 48    157   21     4    1
## 60    223   21     5    1
## 72    157   21     6    1

Get the names of the columns of our data

names(CW21)

## [1] "weight" "Time"   "Chick"  "Diet"

How is Diet distributed:

table(CW21$Diet)

## 
##  1  2  3  4 
## 16 10 10  9

Variance and other summary statistics

The sample variance is defined as

\(\sigma^2 = \frac{1}{n-1}\sum_{i =1}^n\left(x_i - \frac{1}{n}\sum_{i = 1}^nx_i\right)^2\)

So we can calculate

x <- 1:100
sum((x - mean(x))^2)/(length(x) - 1)

## [1] 841.6667

But of course a function exists to calculate the sample variance:

var(x)

## [1] 841.6667

The same is true for many other functions: mean, sd, median, quantile, max, min, range, … We will try some of these functions in the exercises.

Missing values

In R, a missing value is represented as NA
- NA’s will be carried through in computations
- Operations on NA will always return NA as the result
Working with NA can give unexpected results:

aa <- 1:10
bb <- c(1:10, NA)
mean(aa)

## [1] 5.5

mean(bb)

## [1] NA

mean(bb, na.rm = TRUE)

## [1] 5.5

Many functions have an argument na.rm

Distributions

R functions are available for many distributions. For most distributions, four functions are available, the density, probability, quantile, and random function. These functions are identified by the starting letter being d, p, q, r:

dnorm(x, mean=0, sd=1, log = FALSE)
pnorm(q, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean=0, sd=1)

plot(dnorm(seq(-4,4,by=0.1))~seq(-4,4,by=0.1), ylab = 'density')

The normal distribution

x1 <- seq(-4, 4, 0.1)
x2 <- c(1e-04, 0.001, 0.01, 0.05, 0.1, 0.3, 0.5,0.7, 0.9, 0.95, 0.99, 0.999, 0.9999)
par(mfrow = c(1, 4), mar = c(2, 2, 2, 2))
plot(x1, dnorm(x1), main = "density")
plot(x1, pnorm(x1), main = "distribution function")
plot(x1, rnorm(x1), main = "random numbers")
plot(x2, qnorm(x2), main = "quantile function")

Other distributions

A large number of distributions are available:

Distribution	Function
beta	dbeta(x, shape1, shape2, ncp=0, log=F)
binomial	dbinom}(x, size, prob, log=F)
chi-squared	dchisq(x, df, ncp, log=F)
exponential	dexp(x, rate, log=F)
F	df(x, df1, df2, ncp, log=F)
gamma	dgamma(x, shape, rate, scale, log=F)
log-normal	dlnorm(x, meanlog, sdlog, log=F)
logistic	dlogis(x, location, scale, log=F)
normal	dnorm(x, mean, sd, log=F)
Poisson	dpois(x, lambda, log=F)
t	dt(x, df, ncp, log=F)
uniform	dunif(x, min, max, log=F)
weibull	dweibull(x, shape, scale, log=F)

Using distributions

Using the distribution functions as statistical tables for:

quantile fuction to obtain significance thresholds:

qnorm(0.05)

## [1] -1.644854

qt(0.05, df = 50)

## [1] -1.675905

probability function to perform testing:

pnorm(-1.645)

## [1] 0.04998491

pt(-1.676, df = 50)

## [1] 0.04999064

t-test

Hypothesis testing of problems with one variable: the t-test:

t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95, ...)

t-test

Applying a t-test to the weights on days 21:

t.test(CW21$weight)

## 
##  One Sample t-test
## 
## data:  CW21$weight
## t = 20.515, df = 44, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  197.2048 240.1730
## sample estimates:
## mean of x 
##  218.6889

This way we test if weights are different from zero, which they obviously are.

t-test

We can also compare the mean of different groups: Are Chicks on Diet 3 heavier than Chicks on Diet 1? For this question we compare 2 samples in the t.test:

t.test(CW21$weight[CW21$Diet == "3"], CW21$weight[CW21$Diet =="1"], var.equal = T)

## 
##  Two Sample t-test
## 
## data:  CW21$weight[CW21$Diet == "3"] and CW21$weight[CW21$Diet == "1"]
## t = 3.5955, df = 24, p-value = 0.001454
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   39.42419 145.67581
## sample estimates:
## mean of x mean of y 
##    270.30    177.75

Order data

Use function order()

ord <- order(CW21$weight)
CW21.ordered <- CW21[ord, ]
head(CW21.ordered)

##     weight Time Chick Diet
## 268     74   21    24    2
## 155     96   21    13    1
## 107     98   21     9    1
## 220    117   21    20    1
## 119    124   21    10    1
## 194    142   21    17    1

Merging data

First we get the weights at day 10 from ChickWeight and copy them to another data.frame CW10:

CW10 <- ChickWeight[ChickWeight$Time == 10, c(1,3)]
names(CW10) <- c("weight10", "Chick")
head(CW10)

##    weight10 Chick
## 6        93     1
## 18      103     2
## 30       99     3
## 42       87     4
## 54      106     5
## 66      124     6

Merging data

Now, we want to merge the two datasets, to have two colums for each Chick with the weights at day 10 and day 21:

names(CW21)

## [1] "weight" "Time"   "Chick"  "Diet"

names(CW10)

## [1] "weight10" "Chick"

The only overlapping variable is “Chick”

CW <- merge(CW21, CW10, by = "Chick", all = TRUE)
head(CW)

##   Chick weight Time Diet weight10
## 1    16     NA   NA <NA>       51
## 2    15     NA   NA <NA>       68
## 3    13     96   21    1       67
## 4     9     98   21    1       96
## 5    20    117   21    1       73
## 6    10    124   21    1       81

Save & export data

When you have finished the analysis, the complete R workspace can be saved:

save.image("myR.RData")
load("myR.RData") to load in a later R session

Or you may want to export the results in a single object as a table:

write.table(CW, "ChickWeight21and10.txt", sep = "\t")

The resulting file can be read into your R session using read.table function, or used outside of R

Data manipulation in R

Data Manipulation and Basic Statistics in R

Import data

Import data

Import data

Syntax notes

Brackets in R

Build in data

The `ChickWeight` data

The `ChickWeight` data

Subset data

Subset data

Subset data

Subset data

Subsetting on multiple variables

Explore data

Variance and other summary statistics

Missing values

Distributions

The normal distribution

Other distributions

Using distributions

t-test

t-test

t-test

Order data

Merging data

Merging data

Save & export data

Additional topics: some tips

Return to main page

Data manipulation in R

Data Manipulation and Basic Statistics in R

Import data

Import data

Import data

Syntax notes

Brackets in R

Build in data

The ChickWeight data

The ChickWeight data

Subset data

Subset data

Subset data

Subset data

Subsetting on multiple variables

Explore data

Variance and other summary statistics

Missing values

Distributions

The normal distribution

Other distributions

Using distributions

t-test

t-test

t-test

Order data

Merging data

Merging data

Save & export data

Additional topics: some tips

Return to main page

The `ChickWeight` data

The `ChickWeight` data