+ - 0:00:00
Notes for current slide
Notes for next slide

Workshop 2: Loading and Manipulating Data

QCBS R Workshop Series

Qu?bec Centre for Biodiversity Science

1 / 106

Learning Objectives

  1. Creating an R project
  2. Writing a script
  3. Loading, exploring and saving data
  4. Learn to manipulate data frames with tidyr, dplyr, maggritr
2 / 106

RStudio projects

  • What is this?

    • Within RStudio, Projects make it easy to separate and keep your work organized.
    • All files, scripts, documentation related to a specific project are bound together
  • Encourages reproducibility and easy sharing.

3 / 106

Create a new project

Use the Create project command (available in the Projects menu and the global toolbar)

4 / 106

Keep your files organized!

One project = one folder :)

5 / 106

Preparing data for R

Dataset should be stored as comma-separated value files (.csv) in the working directory

  • Can be created from almost all applications (Excel, LibreOffice, GoogleDocs)
  • File -> Save as .csv

6 / 106

Choose filenames names wisely

Good!

  • rawDatasetAgo2017.csv
  • co2_concentrations_QB.csv
  • 01_figIntro.R

Bad!

  • final.csv (Uninformative!)
  • safnnejs.csv (Random!)
  • 1-4.csv (Avoid using only numbers!)
  • Dont.separate.names.with.dots.csv (Because it may lead to reading file errors!)
7 / 106

Choose variable names wisely

Good!

  • Measurements
  • SpeciesNames
  • Site

Bad!

  • a
  • 3
  • supercomplicatedverylongname
8 / 106

Things to consider with your data input

  • No text in numeric columns
  • Do not include spaces!
  • NA (not available) is allowed
  • Blank entries will automatically be replaced with NA
  • Name your variables informatively
  • Look for typos
  • Avoid numeric values for data that are not numeric
  • Use CONSISTENT formats for dates, numbers, metrics, etc.
  • No notes in .CSV files
  • No additional headings
  • No merged cells
9 / 106

Bad data examples:

10 / 106

Horrible data examples:

Bad data preservation practices

11 / 106

Preparing data in R

It is possible to do all your data preparation work within R

  • Saves time for large datasets
  • Keeps original data intact
  • Keeps track of the manipulation and transformation you did
  • Can switch between long- and wide-formats easily (more on this in future workshops)

  • Useful tips on data preparation can be found here: https://www.zoology.ubc.ca/~schluter/R/data/

12 / 106

Writing a Script

13 / 106

R Scripts

  • What is this?

    • A text file that contains all of the commands that you will use
  • Once written and saved, your script file allows you to make changes and re-run analyses with minimal effort!

    • Just highlight text and click "Run" or press Command + Enter (Apple) or Ctrl + Enter (Windows and Linux)
14 / 106

Create an R script

15 / 106

Create an R script

16 / 106

Commands and comments

The # symbol in a script tells R to ignore anything remaining on this line of the script when running commands

# This is a comment not a command
17 / 106

Why should I use #?

  • Annotating someone’s script is a good way to learn
  • Remember what you did
  • Tell collaborators what you did
  • Good step towards reproducible science
  • Be as detailed as possible
18 / 106

It is recommended that you start your script with a header using comments:

19 / 106

Section Headings

You can make a section heading in R Studio four # signs

# You can coment using this, but look below how to create section headers:
## Heading name ####

This allows you to move quickly between sections and hide sections

20 / 106

Housekeeping

The first command at the top of all scripts may be: rm(list = ls()). This command:

  • Clears R memory
  • Prevents errors related to the use of older data
21 / 106

Clearing the workspace

# Clear the R workspace
rm(list = ls())
?rm
?ls
22 / 106

Housekeeping

Demo – Try to add some test data to R and then see how rm(list = ls()) removes it

A<-"Test" # Put some data in workspace
A <- "Test" # Note that you can use spaces!
A = "Test" # <- or = can be used equally
#Note that it is best practice to use "<-" for assigment instead of "="
A
# [1] "Test"
rm(list=ls())
A
# Error in eval(expr, envir, enclos): object 'A' not found
23 / 106

Remember

  • R is ready for commands when you see the chevron > in the console. If you don't see it, press ESC
  • R is case sensitive!
24 / 106

Loading, exploring and saving data

25 / 106

Download today's data

You can download the data and the script for this workshop from the wiki:

http://qcbs.ca/wiki/r/workshop2

Save the files in the folder where your created your R project.

26 / 106

Working Directory

Tells R where your scripts and data are. You need to set the right working directory to load a data file. Type getwd() in the console to see your working directory:

getwd()

If this is not the directory you would like to work with, you can set your own using:

setwd()

Within the parenthesis, you will write the extension of the directory you would like to work with. See the example below:

setwd("C:/Users/Luigi/Documents") # We use slashes "/", and not backslashes "\"
27 / 106

Display contents of the directory

You can display contents of the working directory using dir():

dir()
# [1] "co2_broken.csv" "co2_good.csv"
# [3] "images" "qcbsR-fonts.css"
# [5] "qcbsR-header.html" "qcbsR-macros.js"
# [7] "qcbsR.css" "script_workshop02-en.R"
# [9] "workshop02-en.Rmd"

It helps to:

  • Check that the file you plan to open is present in the folder that R is currently working in
  • Check for correct spelling (e.g. "co2_good.csv" instead of "CO2_good.csv")
28 / 106

Importing Data

Import data into R using read.csv:

CO2 <- read.csv("co2_good.csv", header=TRUE)

Note that this will:

  • Create a new object in R called CO2;
  • The filename is written within quotation marks ('file' or "file");
  • If you needed to fetch the filename from another directory that is not yours, you would have to write the full extension (e.g., "C:/Users/Mario/Downloads/co2_good.csv")
  • header=TRUE tells R that the first line of your dataset contains column names
29 / 106

Importing Data

Recall to find out what arguments the function requires, use help “?” I

?read.csv

Note that if your operating system or .CSV editor (e.g. Excel) is in French, you may have to use read.csv2

?read.csv2
30 / 106

Importing Data

Notice that RStudio now provides information on the CO2 data in your workspace.

The workspace refers to all the objects that you create during an R session.

31 / 106

Looking at Data

R Command Action
CO2 look at the whole dataframe
head(CO2) look at the first few rows
tail(CO2) look at the last few rows
names(CO2) names of the columns in the dataframe
attributes(CO2) attributes of the dataframe
dim(CO2) dimensions of the dataframe
ncol(CO2) number of columns
nrow(CO2) number of rows
32 / 106

Inspecting the data

str(CO2)
# 'data.frame': 84 obs. of 5 variables:
# $ Plant : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...
# $ Type : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ Treatment: Factor w/ 2 levels "chilled","nonchilled": 2 2 2 2 2 2 2 2 2 2 ...
# $ conc : int 95 175 250 350 500 675 1000 95 175 250 ...
# $ uptake : num 16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...

This shows the structure of the dataframe. Very useful to check data type (mode) of all columns to make sure R loaded data properly.

Note: the CO2 dataset includes repeated measurements of CO2 uptake from 6 plants from Quebec and 6 plants from Mississippi at several levels of CO2 concentration. Half the plants of each type were chilled overnight before the experiment was conducted.

33 / 106

Inspecting the data

Common problems:

  • Factors loaded as text (character) or vice versa;
  • Factors includes too many (and unexpected) levels because of typos
  • Data (integer or numeric) is loaded as character because of typos (e.g. a space or a "," instead of a "." to separate decimal numbers)
34 / 106

Looking at Data

data()

head()

tail()

str()

names()

attributes()

dim()

ncol()

nrow()

Load the data with:

CO2 <- read.csv("co2_good.csv", header = FALSE)

Check data types with str() again. What is wrong here? Do not forget to re-load data with header = T afterwards

35 / 106

Reminder from Workshop 1: Accessing data

Imagine a data frame called mydata:

mydata[1,] # Extracts the first row
mydata[2,3] # Extracts the content of row 2 / column 3
mydata[,1] # Extracts the first column
mydata[,1][2] # [...] can be also be used recursively
mydata$Variable1 # Also extracts the first column
36 / 106

Renaming variables

Variables names can be changed within R.

# First lets make a copy of the dataset to play with
CO2copy <- CO2
# names() gives you the names of the variables present in the data frame
names(CO2copy)
# [1] "Plant" "Type" "Treatment" "conc" "uptake"
# Changing from English to French names (make sure you have the same levels!)
names(CO2copy) <- c("Plante","Categorie", "Traitement", "conc", "absortion")
names(CO2copy)
# [1] "Plante" "Categorie" "Traitement" "conc" "absortion"
37 / 106

Creating new variables

Variables and strings can be concatenated together. The function paste() is very useful for concatenating. See ?paste and ?paste0.

# Let's create an unique id for our samples:
# Don't forget to use "" for strings
CO2copy$uniqueID <- paste0(CO2copy$Plante,
"_",CO2copy$Categorie,
"_", CO2copy$Traitement)
# observe the results
head(CO2copy$uniqueID)
# [1] "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled"
# [4] "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled"
38 / 106

Creating new variables

Creating new variables works for numbers and mathematical operations as well!

# Let's standardize our variable "absortion" to relative values
CO2copy$absortionRel <- CO2copy$absortion/max(CO2copy$absortion)
# Observe the results
head(CO2copy$absortionRel)
39 / 106

Subsetting data

There are many ways to subset a data frame

# Let's keep working with our CO2copy data frame
# Select only "Plante" and "absortionRel" columns. (Don't forget the ","!)
CO2copy[,c("Plante", "absortionRel")]
# Subset data frame from rows from 1 to 50
CO2copy[1:50,]
40 / 106

Subsetting data

# Select observations matching only the nonchilled Traitement.
CO2copy[CO2copy$Traitement == "nonchilled",]
# Select observations with absortion higher or equal to 20
CO2copy[CO2copy$absortion >= 20, ]
# Select observations with absortion higher or equal to 20
CO2copy[CO2copy$Traitement == "nonchilled" & CO2copy$absortion >= 20, ]
# We are done playing with the Dataset copy, lets erase it.
CO2copy <- NULL

Go here to check all the logical operators you can use

41 / 106

Data exploration

A good way to start your data exploration is to look at some basic statistics of your dataset.

Use the summary() function to do that!

summary(CO2)

This is also useful to spot some errors you might have missed!

42 / 106

Data exploration

You can also use other functions to calculate basic statistics on parts of your data frame. Let's try the mean(), sd() and hist() functions:

# Calculate the mean and the standard deviation of the CO2 concentration:
# Assign them to new variables
meanConc <- mean(CO2$conc)
sdConc <- sd(CO2$conc)
# print() prints any given value to the R console
print(paste("the mean of concentration is:", meanConc))
print(paste("the standard deviation of concentration is:", sdConc))
# Let's plot a histogram to explore the distribution of "uptake"
hist(CO2$uptake)
# Increasing the number of bins to observe better the pattern
hist(CO2$uptake, breaks = 40)
43 / 106

Data exploration

Use apply() to calculate the means of the last two columns of the data frame (i.e. the columns that contain continuous data).

?apply # Let's see how apply works!
apply(CO2[,4:5],
MARGIN = 2,
FUN = mean)
# conc uptake
# 435.0000 27.2131
44 / 106

Saving your Workspace

# Saving an R workspace file that stores all your objects
save.image(file="co2_project_Data.RData")
# Clear your memory
rm(list = ls())
# Reload your data
load("co2_project_Data.RData")
head(CO2) # Looking good!
# Plant Type Treatment conc uptake
# 1 Qn1 Quebec nonchilled 95 16.0
# 2 Qn1 Quebec nonchilled 175 30.4
# 3 Qn1 Quebec nonchilled 250 34.8
# 4 Qn1 Quebec nonchilled 350 37.2
# 5 Qn1 Quebec nonchilled 500 35.3
# 6 Qn1 Quebec nonchilled 675 39.2
45 / 106

Exporting data

R disposes of write functions that allow you to write objects directly to files in your computer. Let us use the write.csv function to save our CO2 data into a .CSV file:

write.csv(CO2, file = "co2_new.csv")

Note that our arguments are both:

CO2 Object (name)

"co2_new.csv" File to write (name)

46 / 106

Challenge: Use your data

  • Try to load, explore, plot, and save your own data in R;*
  • If it does not load properly, try to make the appropriate changes;
  • When you are finished, try opening your exported data in Excel, Numbers, Notepad or TextEdit.

If you don’t have your own data, work with your neighbour Remember to clean your workspace

47 / 106

Fixing a broken data frame

Getting your data working properly can be tougher than you think!

Sometimes, one may find compatibility issues.

For example, sharing data from an Apple computer to Windows, or between computers set up in different continents can lead to incompatible files (e.g. different decimal separators).

Let's practice how to solve these common errors.

48 / 106

Harder Challenge

Read the file co2_broken.csv

CO2 <- read.csv("co2_broken.csv")
head(CO2)
# NOTE..It.rain.a.lot.in.Quebec.during.sampling
# 1 falling on my notebook numerous values can't be read rain
# 2 Plant\tType\tTreatment\tconc\tuptake
# 3 Qn1\tQuebec\tnonchilled\t95\t16
# 4 Qn1\tQuebec\tnonchilled\t175\t30.4
# 5 Qn1\tQuebec\tnonchilled\t250\tcannot_read_notes
# 6 Qn1\tQuebec\tnonchilled\t350\t37.2
# due.to.excessive X X.1 X.2 X.3
# 1 NA NA NA NA NA
# 2 NA NA NA NA NA
# 3 NA NA NA NA NA
# 4 NA NA NA NA NA
# 5 NA NA NA NA NA
# 6 NA NA NA NA NA
49 / 106

Harder Challenge

  • This is probably what the data you are using looks like;
  • You can fix the data frame in R or other software;
  • Please give it a try before looking at the script provided;
  • Work with your neighbours and have FUN!.
CO2[1:4,]
# NOTE..It.rain.a.lot.in.Quebec.during.sampling
# 1 falling on my notebook numerous values can't be read rain
# 2 Plant\tType\tTreatment\tconc\tuptake
# 3 Qn1\tQuebec\tnonchilled\t95\t16
# 4 Qn1\tQuebec\tnonchilled\t175\t30.4
# due.to.excessive X X.1 X.2 X.3
# 1 NA NA NA NA NA
# 2 NA NA NA NA NA
# 3 NA NA NA NA NA
# 4 NA NA NA NA NA
50 / 106

Fixing co2_broken

Some useful functions:

  • ?read.csv - look at some of the options for how to load a .csv
  • head() - first few rows
  • str() - structure of data
  • class() - class of the object
  • unique() - unique observations
  • levels() - levels of a factor
  • which() - ask a question to your data frame
  • droplevels() - get rid of undesired levels after subsetting factors

HINT There are 4 problems!

51 / 106

Broken data

ERROR 1 The data appears to be lumped into one column

head(CO2)
# NOTE..It.rain.a.lot.in.Quebec.during.sampling
# 1 falling on my notebook numerous values can't be read rain
# 2 Plant\tType\tTreatment\tconc\tuptake
# 3 Qn1\tQuebec\tnonchilled\t95\t16
# 4 Qn1\tQuebec\tnonchilled\t175\t30.4
# 5 Qn1\tQuebec\tnonchilled\t250\tcannot_read_notes
# 6 Qn1\tQuebec\tnonchilled\t350\t37.2
# due.to.excessive X X.1 X.2 X.3
# 1 NA NA NA NA NA
# 2 NA NA NA NA NA
# 3 NA NA NA NA NA
# 4 NA NA NA NA NA
# 5 NA NA NA NA NA
# 6 NA NA NA NA NA
52 / 106

Broken data

ERROR 1 - Solution

  • Re-import the data, but specify the separation among entries
  • The sep argument tells R what character separates the values on each line of the file
  • Here, "TAB" was used instead of ","
CO2 <- read.csv("co2_broken.csv",sep = "")
53 / 106

Broken data

ERROR 2 The data does not start until the third line of the file, so you end up with notes on the file as the headings.

head(CO2)
# NOTE. It rain a lot in. Quebec
# 1 falling on my notebook numerous values can't
# 2 Plant Type Treatment conc uptake
# 3 Qn1 Quebec nonchilled 95 16
# 4 Qn1 Quebec nonchilled 175 30.4
# 5 Qn1 Quebec nonchilled 250 cannot_read_notes
# 6 Qn1 Quebec nonchilled 350 37.2
# during sampling. due to excessive X....
# 1 be read rain,,,, NA NA NA
# 2 NA NA NA
# 3 NA NA NA
# 4 NA NA NA
# 5 NA NA NA
# 6 NA NA NA
54 / 106

Broken data

ERROR 2 - Solution

Skip two lines when loading the file using the "skip" argument:

CO2 <- read.csv("co2_broken.csv", sep = "", skip = 2)
head(CO2)
# Plant Type Treatment conc uptake
# 1 Qn1 Quebec nonchilled 95 16
# 2 Qn1 Quebec nonchilled 175 30.4
# 3 Qn1 Quebec nonchilled 250 cannot_read_notes
# 4 Qn1 Quebec nonchilled 350 37.2
# 5 Qn1 Quebec nonchilled 500 35.3
# 6 Qn1 Quebec nonchilled cannot_read_notes 39.2
55 / 106

Broken data

ERROR 3

conc and uptake variables are considered factors instead of numbers, because there are comments in the numeric columns

str(CO2)
# 'data.frame': 84 obs. of 5 variables:
# $ Plant : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...
# $ Type : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ Treatment: Factor w/ 4 levels "chiled","chilled",..: 4 4 4 4 4 4 4 4 4 3 ...
# $ conc : Factor w/ 8 levels "1000","175","250",..: 7 2 3 4 5 8 1 7 2 3 ...
# $ uptake : Factor w/ 77 levels "10.5","10.6",..: 15 39 76 54 50 61 63 9 32 53 ...
  • Due to missing values entered as "cannot_read_notes" and "na"
  • Recall that R only recognizes "NA" (capital)
56 / 106

?read.csv

57 / 106

Broken data

ERROR 3 - Solution

Tell R that all of NA, "na", and "cannot_read_notes" should be considered NA. Then because all other values in those columns are numbers, conc and uptake will be loaded as numeric/integer.

CO2 <- read.csv("co2_broken.csv", sep = "", skip = 2,
na.strings = c("NA","na","cannot_read_notes"))
str(CO2)
# 'data.frame': 84 obs. of 5 variables:
# $ Plant : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...
# $ Type : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ Treatment: Factor w/ 4 levels "chiled","chilled",..: 4 4 4 4 4 4 4 4 4 3 ...
# $ conc : int 95 175 250 350 500 NA 1000 95 175 250 ...
# $ uptake : num 16 30.4 NA 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
58 / 106

Broken data

ERROR 4

There are only 2 treatments (chilled and nonchilled) but there are spelling errors causing it to look like 4 different treatments.

str(CO2)
levels(CO2$Treatment)
# [1] "chiled" "chilled" "nnchilled" "nonchilled"
unique(CO2$Treatment)
# [1] nonchilled nnchilled chilled chiled
# Levels: chiled chilled nnchilled nonchilled
59 / 106

Broken data

ERROR 4 - Solution

# Identify all rows that contain "nnchilled" and replace with "nonchilled"
CO2$Treatment[CO2$Treatment=="nnchilled"] <- "nonchilled"
# Identify all rows that contain "chiled" and replace with "chilled"
CO2$Treatment[CO2$Treatment=="chiled"] <- "chilled"
# Drop unused levels from factor
CO2 <- droplevels(CO2)
str(CO2)
# 'data.frame': 84 obs. of 5 variables:
# $ Plant : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...
# $ Type : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ Treatment: Factor w/ 2 levels "chilled","nonchilled": 2 2 2 2 2 2 2 2 2 2 ...
# $ conc : int 95 175 250 350 500 NA 1000 95 175 250 ...
# $ uptake : num 16 30.4 NA 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...

Fixed!

60 / 106

Learn to manipulate data with tidyr, dyplr, maggritr

61 / 106

Using tidyr to reshape data frames

library(tidyr)
62 / 106

Data formats

Wide format

# Species DBH Height
# 1 Oak 12 56
# 2 Elm 20 85
# 3 Ash 13 55

Long format

# Species Measurement Value
# 1 Oak DBH 12
# 2 Elm DBH 20
# 3 Ash DBH 13
# 4 Oak Height 56
# 5 Elm Height 85
# 6 Ash Height 55
63 / 106

long vs wide format

Wide data format has a separate column for each variable or each factor in your study

Long data format has a column for possible variables and a column for the values of those variables


Wide data frame can be used for some basic plotting in ggplot2, but more complex plots require long format (example to come)

dplyr, lm(), glm(), gam() all require long data format

64 / 106

Tidying your data

Tidying allows you to manipulate the structure of your data while preserving all original information

gather() - convert from wide to long format

spread() - convert from long to wide format

scale:90%

65 / 106

tidyr installation

install.packages("tidyr")
library(tidyr)
66 / 106

gather columns into rows

gather(data, key, value, ...)

  • data A data frame (e.g. wide)
  • key name of the new column containing variable names (e.g. Measurement)
  • value name of the new column containing variable values (e.g. Value)
  • ... name or numeric index of the columns we wish to gather (e.g. DBH, Height)
67 / 106

gather columns into rows

wide <- data.frame(Species = c("Oak", "Elm", "Ash"),
DBH = c(12, 20, 13), Height = c(56, 85, 55))
wide
# Species DBH Height
# 1 Oak 12 56
# 2 Elm 20 85
# 3 Ash 13 55
long = gather(wide, Measurement, Value, DBH, Height)
long
# Species Measurement Value
# 1 Oak DBH 12
# 2 Elm DBH 20
# 3 Ash DBH 13
# 4 Oak Height 56
# 5 Elm Height 85
# 6 Ash Height 55
68 / 106

spread rows into columns

spread(data, key, value)

  • data A data frame (e.g. long)
  • key Name of the column containing variable names (e.g. Measurement)
  • value Name of the column containing variable values (e.g. Value)
69 / 106

spread rows into columns

long
# Species Measurement Value
# 1 Oak DBH 12
# 2 Elm DBH 20
# 3 Ash DBH 13
# 4 Oak Height 56
# 5 Elm Height 85
# 6 Ash Height 55
wide2 = spread(long, Measurement, Value)
wide2
# Species DBH Height
# 1 Ash 13 55
# 2 Elm 20 85
# 3 Oak 12 56
70 / 106

separate columns

separate() splits a columns by a character string separator

separate(data, col, into, sep)

  • data A data frame (e.g. long)
  • col Name of the column you wish to separate
  • into Names of new variables to create
  • sep Character which indicates where to separate
71 / 106

Using separate() example

Create a fictional dataset about fish and plankton

set.seed(8)
messy <- data.frame(id = 1:4,
trt = sample(rep(c('control', 'farm'), each = 2)),
zooplankton.T1 = runif(4),
fish.T1 = runif(4),
zooplankton.T2 = runif(4),
fish.T2 = runif(4))
messy
# id trt zooplankton.T1 fish.T1 zooplankton.T2 fish.T2
# 1 1 farm 0.7189275 0.64449114 0.544962116 0.2644589
# 2 2 farm 0.2908734 0.45704489 0.138224346 0.2765322
# 3 3 control 0.9322698 0.08930101 0.927812252 0.5211070
# 4 4 control 0.7691470 0.43239137 0.001301721 0.2236889
72 / 106

Using separate() example

First convert the messy data frame from wide to long format

messy.long <- gather(messy, taxa, count, -id, -trt)
head(messy.long)
# id trt taxa count
# 1 1 farm zooplankton.T1 0.7189275
# 2 2 farm zooplankton.T1 0.2908734
# 3 3 control zooplankton.T1 0.9322698
# 4 4 control zooplankton.T1 0.7691470
# 5 1 farm fish.T1 0.6444911
# 6 2 farm fish.T1 0.4570449
73 / 106

Using separate() example

Then we want to split the 2 sampling time (T1 and T2).

messy.long.sep <- separate(messy.long, taxa,
into = c("species", "time"), sep = "\\.")
head(messy.long.sep)
# id trt species time count
# 1 1 farm zooplankton T1 0.7189275
# 2 2 farm zooplankton T1 0.2908734
# 3 3 control zooplankton T1 0.9322698
# 4 4 control zooplankton T1 0.7691470
# 5 1 farm fish T1 0.6444911
# 6 2 farm fish T1 0.4570449

The argument sep = "\\." tells R to splits the character string around the period (.). We cannot type directly "." because it is a regular expression that matches any single character.

74 / 106

Recap of tidyr

A package that reshapes the layout of data sets.

Converting from wide to long format using gather()

Converting from long format to wide format using spread()

Split and merge columns with unite() and separate()

Data Wrangling with dplyr and tidyr Cheat Sheet

75 / 106

Challenge with tidyr

  1. Using the airquality dataset, gather all the columns (except Month and Day) into rows.

  2. Then spread the resulting data frame to return to the original data format.

?airquality
data(airquality)
76 / 106

Solution

  1. Using the airquality dataset, gather all the columns (except Month and Day) into rows.
air.long <- gather(airquality, variable, value, -Month, -Day)
head(air.long)
# Month Day variable value
# 1 5 1 Ozone 41
# 2 5 2 Ozone 36
# 3 5 3 Ozone 12
# 4 5 4 Ozone 18
# 5 5 5 Ozone NA
# 6 5 6 Ozone 28

Note that the syntax used here indicates that we wish to gather ALL the columns exept Month and Day. It is equivalent to: gather(airquality, value, Ozone, Solar.R, Temp, Wind)

77 / 106

Solution

  1. Then spread the resulting data frame to return to the original data format.
air.wide <- spread(air.long, variable, value)
head(air.wide)
# Month Day Ozone Solar.R Temp Wind
# 1 5 1 41 190 67 7.4
# 2 5 2 36 118 72 8.0
# 3 5 3 12 149 74 12.6
# 4 5 4 18 313 62 11.5
# 5 5 5 NA NA 56 14.3
# 6 5 6 28 NA 66 14.9
78 / 106

Data manipulation with dplyr

79 / 106

Intro to dplyr

  • Package that contains a set of functions (or “verbs”) for data manipulation such as filtering rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data;
  • Easy and intuitive functions
  • Fast and efficient
  • Can interface with external databases and translate your R code into SQL queries

Some corresponding R base functions: split(), subset(), apply(), sapply(), lapply(), tapply() and aggregate()

80 / 106

Intro to dplyr

install.packages("dplyr")
library(dplyr)
81 / 106

Basic functions in dplyr

These 4 core functions tackle the most common manipulations when working with data frames

  • select(): select columns from a data frame
  • filter(): filter rows according to defined criteria
  • arrange(): re-order data based on criteria (e.g. ascending, descending)
  • mutate(): create or transform values in a column
82 / 106

select columns

select(data, ...)

  • ... Can be column names or positions or complex expressions separated by commas

    select(data, column1, column2) select columns 1 and 2 select(data, c(2:4,6)) select columns 2 to 4 and 6 select(data, -column1) select all columns except column 1 select(data, start_with(x.)) select all columns that start with "x."

83 / 106

select columns

84 / 106

select columns

Example: suppose we are only interested in the variation of Ozone over time within the airquality dataset

ozone <- select(airquality, Ozone, Month, Day)
head(ozone)
# Ozone Month Day
# 1 41 5 1
# 2 36 5 2
# 3 12 5 3
# 4 18 5 4
# 5 NA 5 5
# 6 28 5 6
85 / 106

filter rows

Extract a subset of rows that meet one or more specific conditions

filter(dataframe, logical statement 1, logical statement 2, ...)

86 / 106

filter rows

Example: we are interested in analyses that focus on the month of August during high temperature events

august <- filter(airquality, Month == 8, Temp >= 90)
# same as: filter(airquality, Month == 8 & Temp >= 90)
head(august)
# Ozone Solar.R Wind Temp Month Day
# 1 89 229 10.3 90 8 8
# 2 110 207 8.0 90 8 9
# 3 NA 222 8.6 92 8 10
# 4 76 203 9.7 97 8 28
# 5 118 225 2.3 94 8 29
# 6 84 237 6.3 96 8 30
87 / 106

Sort rows with arrange

Re-order rows by a particular column, by default in ascending order

Use desc() for descending order.

arrange(data, variable1, desc(variable2), ...)

88 / 106

Sort rows with arrange

Example:

  1. Let's use the following code to create a scrambled version of the airquality dataset
air_mess <- sample_frac(airquality, 1)
head(air_mess)
# Ozone Solar.R Wind Temp Month Day
# 1 23 115 7.4 76 8 18
# 2 28 273 11.5 82 8 13
# 3 8 19 20.1 61 5 9
# 4 135 269 4.1 84 7 1
# 5 23 299 8.6 65 5 7
# 6 30 322 11.5 68 5 19
89 / 106

Sort rows with arrange

Example:

  1. Now let's arrange the data frame back into chronological order, sorting by Month then Day
air_chron <- arrange(air_mess, Month, Day)
head(air_chron)
# Ozone Solar.R Wind Temp Month Day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6

Try : arrange(air_mess, Day, Month) and see the difference.

90 / 106

Create new columns using mutate

Compute and add new columns

mutate(data, newVar1 = expression1, newVar2 = expression2, ...)

91 / 106

Create new columns using mutate

Example: we want to convert the temperature variable form degrees Fahrenheit to degrees Celsius

airquality_C <- mutate(airquality, Temp_C = (Temp-32)*(5/9))
head(airquality_C)
# Ozone Solar.R Wind Temp Month Day Temp_C
# 1 41 190 7.4 67 5 1 19.44444
# 2 36 118 8.0 72 5 2 22.22222
# 3 12 149 12.6 74 5 3 23.33333
# 4 18 313 11.5 62 5 4 16.66667
# 5 NA NA 14.3 56 5 5 13.33333
# 6 28 NA 14.9 66 5 6 18.88889
92 / 106

magrittr

Usually data manipulation require multiple steps, the magrittr package offers a pipe operator %>% which allows us to link multiple operations

93 / 106

magrittr

install.packages("magrittr")
require(magrittr)
94 / 106

magrittr

Suppose we want to analyse only the month of June, then convert the temperature variable to degrees Celsius. We can create the required data frame by combining 2 dplyr verbs we learned

june_C <- mutate(filter(airquality, Month == 6),
Temp_C = (Temp-32)*(5/9))

As we add more operations, wrapping functions one inside the other becomes increasingly illegible. But, step by step would be redundant and write a lot of objects to the workspace.

95 / 106

magrittr

Alternatively, we can use maggritr's pipe operator to link these successive operations

june_C <- airquality %>%
filter(Month == 6) %>%
mutate(Temp_C = (Temp-32)*(5/9))

Advantages :

  • less redundant code
  • easy to read and write because functions are executed in order
96 / 106

dplyr::group_by and summarise

The dplyr verbs become especially powerful when they are are combined using the pipe operator %>%. The following dplyr functions allow us to split our data frame into groups on which we can perform operations individually

group_by() : group data frame by a factor for downstream operations (usually summarise)

summarise() : summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. min(), max(), mean(), etc…)

97 / 106

dplyr - Split-Apply-Combine

The group_by function is key to the Split-Apply-Combine strategy

98 / 106

dplyr - Split-Apply-Combine

99 / 106

dplyr - Split-Apply-Combine

Example: we are interested in the mean temperature and standard deviation within each month if the airquality dataset

month_sum <- airquality %>%
group_by(Month) %>%
summarise(mean_temp = mean(Temp),
sd_temp = sd(Temp))
month_sum
# # A tibble: 5 x 3
# Month mean_temp sd_temp
# <int> <dbl> <dbl>
# 1 5 65.5 6.85
# 2 6 79.1 6.60
# 3 7 83.9 4.32
# 4 8 84.0 6.59
# 5 9 76.9 8.36
100 / 106

Challenge with dplyrand magrittr

Using the ChickWeight dataset, create a summary table which displays the difference in weight between the maximum and minimum weight of each chick in the study.

Employ dplyr verbs and the %>% operator.

101 / 106

Solution

  1. Use group_by() to divide the dataset by "Chick"
  2. Use summarise() to calculate the weight gain within each group
weight_diff <- ChickWeight %>%
group_by(Chick) %>%
summarise(weight_diff = max(weight) - min(weight))
head(weight_diff)
# # A tibble: 6 x 2
# Chick weight_diff
# <ord> <dbl>
# 1 18 4
# 2 16 16
# 3 15 27
# 4 13 55
# 5 9 58
# 6 20 76
102 / 106

Ninja challenge

Using the ChickWeight dataset, create a summary table which displays, for each diet, the average individual difference in weight between the end and the beginning of the study.

Employ dplyr verbs and the %>% operator.

(Hint: first() and last() may be useful here.)

103 / 106

Ninja solution

diet_summ <- ChickWeight %>%
group_by(Diet, Chick) %>%
summarise(weight_gain = last(weight) - first(weight)) %>%
group_by(Diet) %>%
summarise(mean_gain = mean(weight_gain))
diet_summ
# # A tibble: 4 x 2
# Diet mean_gain
# <fct> <dbl>
# 1 1 115.
# 2 2 174
# 3 3 230.
# 4 4 188.
104 / 106

More on data manipulation

Learn more on dplyr

dplyr and tidyr cheatsheet

105 / 106

Thank you for attending!

106 / 106

Learning Objectives

  1. Creating an R project
  2. Writing a script
  3. Loading, exploring and saving data
  4. Learn to manipulate data frames with tidyr, dplyr, maggritr
2 / 106
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow