Workshop 2: Loading and Manipulating DataQCBS R Workshop SeriesQu?bec Centre for Biodiversity Science1 / 106

Learning ObjectivesCreating an R project
Writing a script
Loading, exploring and saving data
Learn to manipulate data frames with tidyr, dplyr, maggritr
2 / 106

RStudio projects

What is this?
- Within RStudio, Projects make it easy to separate and keep your work organized.
- All files, scripts, documentation related to a specific project are bound together
Encourages reproducibility and easy sharing.

3 / 106

Create a new project

Use the Create project command (available in the Projects menu and the global toolbar)

4 / 106

Keep your files organized!

One project = one folder :)

5 / 106

Preparing data for R

Dataset should be stored as comma-separated value files (.csv) in the working directory

Can be created from almost all applications (Excel, LibreOffice, GoogleDocs)
File -> Save as .csv…

6 / 106

Choose filenames names wisely

Good!

rawDatasetAgo2017.csv
co2_concentrations_QB.csv
01_figIntro.R

Bad!

final.csv (Uninformative!)
safnnejs.csv (Random!)
1-4.csv (Avoid using only numbers!)
Dont.separate.names.with.dots.csv (Because it may lead to reading file errors!)

7 / 106

Choose variable names wisely

Good!

Measurements
SpeciesNames
Site

Bad!

a
3
supercomplicatedverylongname

8 / 106

Things to consider with your data inputNo text in numeric columns
Do not include spaces!
NA (not available) is allowed
Blank entries will automatically be replaced with NA
Name your variables informatively
Look for typos

Avoid numeric values for data that are not numeric
Use CONSISTENT formats for dates, numbers, metrics, etc.
No notes in .CSV files
No additional headings
No merged cells

9 / 106

Bad data examples:

10 / 106

Horrible data examples:

Bad data preservation practices

11 / 106

Preparing data in R

It is possible to do all your data preparation work within R

Saves time for large datasets
Keeps original data intact
Keeps track of the manipulation and transformation you did
Can switch between long- and wide-formats easily (more on this in future workshops)
Useful tips on data preparation can be found here: https://www.zoology.ubc.ca/~schluter/R/data/

12 / 106

Writing a Script13 / 106

R Scripts

What is this?
- A text file that contains all of the commands that you will use
Once written and saved, your script file allows you to make changes and re-run analyses with minimal effort!
- Just highlight text and click "Run" or press Command + Enter (Apple) or Ctrl + Enter (Windows and Linux)

14 / 106

Create an R script

15 / 106

Create an R script

16 / 106

Commands and comments

The # symbol in a script tells R to ignore anything remaining on this line of the script when running commands

# This is a comment not a command

17 / 106

Why should I use #?Annotating someone’s script is a good way to learn
Remember what you did
Tell collaborators what you did
Good step towards reproducible science
Be as detailed as possible
18 / 106

It is recommended that you start your script with a header using comments:

19 / 106

Section Headings

You can make a section heading in R Studio four # signs

# You can coment using this, but look below how to create section headers:
## Heading name ####

This allows you to move quickly between sections and hide sections

20 / 106

Housekeeping

The first command at the top of all scripts may be: rm(list = ls()). This command:

Clears R memory
Prevents errors related to the use of older data

21 / 106

Clearing the workspace

# Clear the R workspace
rm(list = ls())
?rm
?ls

22 / 106

Housekeeping

Demo – Try to add some test data to R and then see how rm(list = ls()) removes it

A<-"Test" # Put some data in workspace
A <- "Test" # Note that you can use spaces!
A = "Test" # <- or = can be used equally
#Note that it is best practice to use "<-" for assigment instead of "="
A
# [1] "Test"
rm(list=ls())
A
# Error in eval(expr, envir, enclos): object 'A' not found

23 / 106

RememberR is ready for commands when you see the chevron > in the console. If you don't see it, press ESC
R is case sensitive!
24 / 106

Loading, exploring and saving data25 / 106

Download today's data

You can download the data and the script for this workshop from the wiki:

http://qcbs.ca/wiki/r/workshop2

Save the files in the folder where your created your R project.

26 / 106

Working Directory

Tells R where your scripts and data are. You need to set the right working directory to load a data file. Type getwd() in the console to see your working directory:

getwd()

If this is not the directory you would like to work with, you can set your own using:

setwd()

Within the parenthesis, you will write the extension of the directory you would like to work with. See the example below:

setwd("C:/Users/Luigi/Documents") # We use slashes "/", and not backslashes "\"

27 / 106

Display contents of the directory

You can display contents of the working directory using dir():

dir()
# [1] "co2_broken.csv"         "co2_good.csv"          
# [3] "images"                 "qcbsR-fonts.css"       
# [5] "qcbsR-header.html"      "qcbsR-macros.js"       
# [7] "qcbsR.css"              "script_workshop02-en.R"
# [9] "workshop02-en.Rmd"

It helps to:

Check that the file you plan to open is present in the folder that R is currently working in
Check for correct spelling (e.g. "co2_good.csv" instead of "CO2_good.csv")

28 / 106

Importing Data

Import data into R using read.csv:

CO2 <- read.csv("co2_good.csv", header=TRUE)

Note that this will:

Create a new object in R called CO2;
The filename is written within quotation marks ('file' or "file");
If you needed to fetch the filename from another directory that is not yours, you would have to write the full extension (e.g., "C:/Users/Mario/Downloads/co2_good.csv")
header=TRUE tells R that the first line of your dataset contains column names

29 / 106

Importing Data

Recall to find out what arguments the function requires, use help “?” I

?read.csv

Note that if your operating system or .CSV editor (e.g. Excel) is in French, you may have to use read.csv2

?read.csv2

30 / 106

Importing Data

Notice that RStudio now provides information on the CO2 data in your workspace.

The workspace refers to all the objects that you create during an R session.

31 / 106

Looking at Data

R Command
Action


CO2
look at the whole dataframe

head(CO2)
look at the first few rows

tail(CO2)
look at the last few rows

names(CO2)
names of the columns in the dataframe

attributes(CO2)
attributes of the dataframe

dim(CO2)
dimensions of the dataframe

ncol(CO2)
number of columns

nrow(CO2)
number of rows

32 / 106

R Command	Action
`CO2`	look at the whole dataframe
`head(CO2)`	look at the first few rows
`tail(CO2)`	look at the last few rows
`names(CO2)`	names of the columns in the dataframe
`attributes(CO2)`	attributes of the dataframe
`dim(CO2)`	dimensions of the dataframe
`ncol(CO2)`	number of columns
`nrow(CO2)`	number of rows

Inspecting the data

str(CO2)
# 'data.frame':    84 obs. of  5 variables:
#  $ Plant    : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...
#  $ Type     : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...
#  $ Treatment: Factor w/ 2 levels "chilled","nonchilled": 2 2 2 2 2 2 2 2 2 2 ...
#  $ conc     : int  95 175 250 350 500 675 1000 95 175 250 ...
#  $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...

This shows the structure of the dataframe. Very useful to check data type (mode) of all columns to make sure R loaded data properly.

Note: the CO2 dataset includes repeated measurements of CO2 uptake from 6 plants from Quebec and 6 plants from Mississippi at several levels of CO2 concentration. Half the plants of each type were chilled overnight before the experiment was conducted.

33 / 106

Inspecting the data

Common problems:

Factors loaded as text (character) or vice versa;
Factors includes too many (and unexpected) levels because of typos
Data (integer or numeric) is loaded as character because of typos (e.g. a space or a "," instead of a "." to separate decimal numbers)

34 / 106

Looking at Data

data()

head()

tail()

str()

names()

attributes()

dim()

ncol()

nrow()

Load the data with:

CO2 <- read.csv("co2_good.csv", header = FALSE)

Check data types with str() again. What is wrong here? Do not forget to re-load data with header = T afterwards

35 / 106

Reminder from Workshop 1: Accessing data

Imagine a data frame called mydata:

mydata[1,] # Extracts the first row
mydata[2,3] # Extracts the content of row 2 / column 3
mydata[,1] # Extracts the first column
mydata[,1][2] # [...] can be also be used recursively
mydata$Variable1 # Also extracts the first column

36 / 106

Renaming variables

Variables names can be changed within R.

# First lets make a copy of the dataset to play with
CO2copy <- CO2
# names() gives you the names of the variables present in the data frame
names(CO2copy)
# [1] "Plant"     "Type"      "Treatment" "conc"      "uptake"
# Changing from English to French names (make sure you have the same levels!)
names(CO2copy) <- c("Plante","Categorie", "Traitement", "conc", "absortion")
names(CO2copy)
# [1] "Plante"     "Categorie"  "Traitement" "conc"       "absortion"

37 / 106

Creating new variables

Variables and strings can be concatenated together. The function paste() is very useful for concatenating. See ?paste and ?paste0.

# Let's create an unique id for our samples:
# Don't forget to use "" for strings
CO2copy$uniqueID <- paste0(CO2copy$Plante,
                           "_",CO2copy$Categorie,
                           "_", CO2copy$Traitement)
# observe the results
head(CO2copy$uniqueID)
# [1] "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled"
# [4] "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled"

38 / 106

Creating new variables

Creating new variables works for numbers and mathematical operations as well!

# Let's standardize our variable "absortion" to relative values
CO2copy$absortionRel <- CO2copy$absortion/max(CO2copy$absortion)  
# Observe the results
head(CO2copy$absortionRel)

39 / 106

Subsetting data

There are many ways to subset a data frame

# Let's keep working with our CO2copy data frame
# Select only "Plante" and "absortionRel" columns. (Don't forget the ","!)
CO2copy[,c("Plante", "absortionRel")]
# Subset data frame from rows from 1 to 50
CO2copy[1:50,]

40 / 106

Subsetting data

# Select observations matching only the nonchilled Traitement.
CO2copy[CO2copy$Traitement == "nonchilled",]
# Select observations with absortion higher or equal to 20
CO2copy[CO2copy$absortion >= 20, ]  
# Select observations with absortion higher or equal to 20
CO2copy[CO2copy$Traitement  == "nonchilled" & CO2copy$absortion >= 20, ]
# We are done playing with the Dataset copy, lets erase it.
CO2copy <- NULL

Go here to check all the logical operators you can use

41 / 106

Data exploration

A good way to start your data exploration is to look at some basic statistics of your dataset.

Use the summary() function to do that!

summary(CO2)

This is also useful to spot some errors you might have missed!

42 / 106

Data exploration

You can also use other functions to calculate basic statistics on parts of your data frame. Let's try the mean(), sd() and hist() functions:

# Calculate the mean and the standard deviation of the CO2 concentration:
# Assign them to new variables
meanConc <- mean(CO2$conc)
sdConc <- sd(CO2$conc)
# print() prints any given value to the R console
print(paste("the mean of concentration is:", meanConc))
print(paste("the standard deviation of concentration is:", sdConc))

# Let's plot a histogram to explore the distribution of "uptake"
hist(CO2$uptake)
# Increasing the number of bins to observe better the pattern
hist(CO2$uptake, breaks = 40)

43 / 106

Data exploration

Use apply() to calculate the means of the last two columns of the data frame (i.e. the columns that contain continuous data).

?apply # Let's see how apply works!

apply(CO2[,4:5],
      MARGIN = 2,
      FUN = mean)
#     conc   uptake 
# 435.0000  27.2131

44 / 106

Saving your Workspace

# Saving an R workspace file that stores all your objects
save.image(file="co2_project_Data.RData")
# Clear your memory
rm(list = ls())
# Reload your data
load("co2_project_Data.RData")
head(CO2) # Looking good!

#   Plant   Type  Treatment conc uptake
# 1   Qn1 Quebec nonchilled   95   16.0
# 2   Qn1 Quebec nonchilled  175   30.4
# 3   Qn1 Quebec nonchilled  250   34.8
# 4   Qn1 Quebec nonchilled  350   37.2
# 5   Qn1 Quebec nonchilled  500   35.3
# 6   Qn1 Quebec nonchilled  675   39.2

45 / 106

Exporting data

R disposes of write functions that allow you to write objects directly to files in your computer. Let us use the write.csv function to save our CO2 data into a .CSV file:

write.csv(CO2, file = "co2_new.csv")

Note that our arguments are both:

CO2 Object (name)

"co2_new.csv" File to write (name)

46 / 106

Challenge: Use your data

Try to load, explore, plot, and save your own data in R;*
If it does not load properly, try to make the appropriate changes;
When you are finished, try opening your exported data in Excel, Numbers, Notepad or TextEdit.

If you don’t have your own data, work with your neighbour Remember to clean your workspace

47 / 106

Fixing a broken data frame

Getting your data working properly can be tougher than you think!

Sometimes, one may find compatibility issues.

For example, sharing data from an Apple computer to Windows, or between computers set up in different continents can lead to incompatible files (e.g. different decimal separators).

Let's practice how to solve these common errors.

48 / 106

Harder Challenge

Read the file co2_broken.csv

CO2 <- read.csv("co2_broken.csv")
head(CO2)
#               NOTE..It.rain.a.lot.in.Quebec.during.sampling
# 1 falling on my notebook numerous values can't be read rain
# 2                      Plant\tType\tTreatment\tconc\tuptake
# 3                           Qn1\tQuebec\tnonchilled\t95\t16
# 4                        Qn1\tQuebec\tnonchilled\t175\t30.4
# 5           Qn1\tQuebec\tnonchilled\t250\tcannot_read_notes
# 6                        Qn1\tQuebec\tnonchilled\t350\t37.2
#   due.to.excessive  X X.1 X.2 X.3
# 1               NA NA  NA  NA  NA
# 2               NA NA  NA  NA  NA
# 3               NA NA  NA  NA  NA
# 4               NA NA  NA  NA  NA
# 5               NA NA  NA  NA  NA
# 6               NA NA  NA  NA  NA

49 / 106

Harder Challenge

This is probably what the data you are using looks like;
You can fix the data frame in R or other software;
Please give it a try before looking at the script provided;
Work with your neighbours and have FUN!.

CO2[1:4,]
#               NOTE..It.rain.a.lot.in.Quebec.during.sampling
# 1 falling on my notebook numerous values can't be read rain
# 2                      Plant\tType\tTreatment\tconc\tuptake
# 3                           Qn1\tQuebec\tnonchilled\t95\t16
# 4                        Qn1\tQuebec\tnonchilled\t175\t30.4
#   due.to.excessive  X X.1 X.2 X.3
# 1               NA NA  NA  NA  NA
# 2               NA NA  NA  NA  NA
# 3               NA NA  NA  NA  NA
# 4               NA NA  NA  NA  NA

50 / 106

Fixing co2_broken

Some useful functions:

?read.csv - look at some of the options for how to load a .csv
head() - first few rows
str() - structure of data
class() - class of the object
unique() - unique observations
levels() - levels of a factor
which() - ask a question to your data frame
droplevels() - get rid of undesired levels after subsetting factors

HINT There are 4 problems!

51 / 106

Broken data

ERROR 1 The data appears to be lumped into one column

head(CO2)
#               NOTE..It.rain.a.lot.in.Quebec.during.sampling
# 1 falling on my notebook numerous values can't be read rain
# 2                      Plant\tType\tTreatment\tconc\tuptake
# 3                           Qn1\tQuebec\tnonchilled\t95\t16
# 4                        Qn1\tQuebec\tnonchilled\t175\t30.4
# 5           Qn1\tQuebec\tnonchilled\t250\tcannot_read_notes
# 6                        Qn1\tQuebec\tnonchilled\t350\t37.2
#   due.to.excessive  X X.1 X.2 X.3
# 1               NA NA  NA  NA  NA
# 2               NA NA  NA  NA  NA
# 3               NA NA  NA  NA  NA
# 4               NA NA  NA  NA  NA
# 5               NA NA  NA  NA  NA
# 6               NA NA  NA  NA  NA

52 / 106

Broken data

ERROR 1 - Solution

Re-import the data, but specify the separation among entries
The sep argument tells R what character separates the values on each line of the file
Here, "TAB" was used instead of ","

CO2 <- read.csv("co2_broken.csv",sep = "")

53 / 106

Broken data

ERROR 2 The data does not start until the third line of the file, so you end up with notes on the file as the headings.

head(CO2)
#     NOTE.     It       rain        a               lot    in. Quebec
# 1 falling     on         my notebook          numerous values  can't
# 2   Plant   Type  Treatment     conc            uptake              
# 3     Qn1 Quebec nonchilled       95                16              
# 4     Qn1 Quebec nonchilled      175              30.4              
# 5     Qn1 Quebec nonchilled      250 cannot_read_notes              
# 6     Qn1 Quebec nonchilled      350              37.2              
#   during sampling.      due to excessive X....
# 1     be      read rain,,,, NA        NA    NA
# 2                           NA        NA    NA
# 3                           NA        NA    NA
# 4                           NA        NA    NA
# 5                           NA        NA    NA
# 6                           NA        NA    NA

54 / 106

Broken data

ERROR 2 - Solution

Skip two lines when loading the file using the "skip" argument:

CO2 <- read.csv("co2_broken.csv", sep = "", skip = 2)
head(CO2)
#   Plant   Type  Treatment              conc            uptake
# 1   Qn1 Quebec nonchilled                95                16
# 2   Qn1 Quebec nonchilled               175              30.4
# 3   Qn1 Quebec nonchilled               250 cannot_read_notes
# 4   Qn1 Quebec nonchilled               350              37.2
# 5   Qn1 Quebec nonchilled               500              35.3
# 6   Qn1 Quebec nonchilled cannot_read_notes              39.2

55 / 106

Broken data

ERROR 3

conc and uptake variables are considered factors instead of numbers, because there are comments in the numeric columns

str(CO2)
# 'data.frame':    84 obs. of  5 variables:
#  $ Plant    : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...
#  $ Type     : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...
#  $ Treatment: Factor w/ 4 levels "chiled","chilled",..: 4 4 4 4 4 4 4 4 4 3 ...
#  $ conc     : Factor w/ 8 levels "1000","175","250",..: 7 2 3 4 5 8 1 7 2 3 ...
#  $ uptake   : Factor w/ 77 levels "10.5","10.6",..: 15 39 76 54 50 61 63 9 32 53 ...

Due to missing values entered as "cannot_read_notes" and "na"
Recall that R only recognizes "NA" (capital)

56 / 106

`?read.csv`

57 / 106

Broken data

ERROR 3 - Solution

Tell R that all of NA, "na", and "cannot_read_notes" should be considered NA. Then because all other values in those columns are numbers, conc and uptake will be loaded as numeric/integer.

CO2 <- read.csv("co2_broken.csv", sep = "", skip = 2,
                na.strings = c("NA","na","cannot_read_notes"))
str(CO2)
# 'data.frame':    84 obs. of  5 variables:
#  $ Plant    : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...
#  $ Type     : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...
#  $ Treatment: Factor w/ 4 levels "chiled","chilled",..: 4 4 4 4 4 4 4 4 4 3 ...
#  $ conc     : int  95 175 250 350 500 NA 1000 95 175 250 ...
#  $ uptake   : num  16 30.4 NA 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...

58 / 106

Broken data

ERROR 4

There are only 2 treatments (chilled and nonchilled) but there are spelling errors causing it to look like 4 different treatments.

str(CO2)

levels(CO2$Treatment)
# [1] "chiled"     "chilled"    "nnchilled"  "nonchilled"
unique(CO2$Treatment)
# [1] nonchilled nnchilled  chilled    chiled    
# Levels: chiled chilled nnchilled nonchilled

59 / 106

Broken data

ERROR 4 - Solution

# Identify all rows that contain "nnchilled" and replace with "nonchilled"
CO2$Treatment[CO2$Treatment=="nnchilled"] <- "nonchilled"
# Identify all rows that contain "chiled" and replace with "chilled"
CO2$Treatment[CO2$Treatment=="chiled"] <- "chilled"
# Drop unused levels from factor
CO2 <- droplevels(CO2)
str(CO2)
# 'data.frame':    84 obs. of  5 variables:
#  $ Plant    : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...
#  $ Type     : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...
#  $ Treatment: Factor w/ 2 levels "chilled","nonchilled": 2 2 2 2 2 2 2 2 2 2 ...
#  $ conc     : int  95 175 250 350 500 NA 1000 95 175 250 ...
#  $ uptake   : num  16 30.4 NA 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...

Fixed!

60 / 106

Learn to manipulate data with tidyr, dyplr, maggritr61 / 106

Using `tidyr` to reshape data frames

library(tidyr)

62 / 106

Data formats

Wide format

#   Species DBH Height
# 1     Oak  12     56
# 2     Elm  20     85
# 3     Ash  13     55

Long format

#   Species Measurement Value
# 1     Oak         DBH    12
# 2     Elm         DBH    20
# 3     Ash         DBH    13
# 4     Oak      Height    56
# 5     Elm      Height    85
# 6     Ash      Height    55

63 / 106

long vs wide format

Wide data format has a separate column for each variable or each factor in your study

Long data format has a column for possible variables and a column for the values of those variables

Wide data frame can be used for some basic plotting in ggplot2, but more complex plots require long format (example to come)

dplyr, lm(), glm(), gam() all require long data format

64 / 106

Tidying your data

Tidying allows you to manipulate the structure of your data while preserving all original information

gather() - convert from wide to long format

spread() - convert from long to wide format

scale:90%

65 / 106

`tidyr` installation

install.packages("tidyr")
library(tidyr)

66 / 106

`gather` columns into rows

gather(data, key, value, ...)

data A data frame (e.g. wide)
key name of the new column containing variable names (e.g. Measurement)
value name of the new column containing variable values (e.g. Value)
... name or numeric index of the columns we wish to gather (e.g. DBH, Height)

67 / 106

`gather` columns into rows

wide <- data.frame(Species = c("Oak", "Elm", "Ash"),
                   DBH = c(12, 20, 13), Height = c(56, 85, 55))
wide
#   Species DBH Height
# 1     Oak  12     56
# 2     Elm  20     85
# 3     Ash  13     55

long = gather(wide, Measurement, Value, DBH, Height)
long
#   Species Measurement Value
# 1     Oak         DBH    12
# 2     Elm         DBH    20
# 3     Ash         DBH    13
# 4     Oak      Height    56
# 5     Elm      Height    85
# 6     Ash      Height    55

68 / 106

`spread` rows into columns

spread(data, key, value)

data A data frame (e.g. long)
key Name of the column containing variable names (e.g. Measurement)
value Name of the column containing variable values (e.g. Value)

69 / 106

`spread` rows into columns

long
#   Species Measurement Value
# 1     Oak         DBH    12
# 2     Elm         DBH    20
# 3     Ash         DBH    13
# 4     Oak      Height    56
# 5     Elm      Height    85
# 6     Ash      Height    55
wide2 = spread(long, Measurement, Value)
wide2
#   Species DBH Height
# 1     Ash  13     55
# 2     Elm  20     85
# 3     Oak  12     56

70 / 106

`separate` columns

separate() splits a columns by a character string separator

separate(data, col, into, sep)

data A data frame (e.g. long)
col Name of the column you wish to separate
into Names of new variables to create
sep Character which indicates where to separate

71 / 106

Using `separate()` example

Create a fictional dataset about fish and plankton

set.seed(8)
messy <- data.frame(id = 1:4,
                    trt = sample(rep(c('control', 'farm'), each = 2)),
                    zooplankton.T1 = runif(4),
                    fish.T1 = runif(4),
                    zooplankton.T2 = runif(4),
                    fish.T2 = runif(4))
messy
#   id     trt zooplankton.T1    fish.T1 zooplankton.T2   fish.T2
# 1  1    farm      0.7189275 0.64449114    0.544962116 0.2644589
# 2  2    farm      0.2908734 0.45704489    0.138224346 0.2765322
# 3  3 control      0.9322698 0.08930101    0.927812252 0.5211070
# 4  4 control      0.7691470 0.43239137    0.001301721 0.2236889

72 / 106

Using `separate()` example

First convert the messy data frame from wide to long format

messy.long <- gather(messy, taxa, count, -id, -trt)
head(messy.long)
#   id     trt           taxa     count
# 1  1    farm zooplankton.T1 0.7189275
# 2  2    farm zooplankton.T1 0.2908734
# 3  3 control zooplankton.T1 0.9322698
# 4  4 control zooplankton.T1 0.7691470
# 5  1    farm        fish.T1 0.6444911
# 6  2    farm        fish.T1 0.4570449

73 / 106

Using `separate()` example

Then we want to split the 2 sampling time (T1 and T2).

messy.long.sep <- separate(messy.long, taxa,
                           into = c("species", "time"), sep = "\\.")
head(messy.long.sep)
#   id     trt     species time     count
# 1  1    farm zooplankton   T1 0.7189275
# 2  2    farm zooplankton   T1 0.2908734
# 3  3 control zooplankton   T1 0.9322698
# 4  4 control zooplankton   T1 0.7691470
# 5  1    farm        fish   T1 0.6444911
# 6  2    farm        fish   T1 0.4570449

The argument sep = "\\." tells R to splits the character string around the period (.). We cannot type directly "." because it is a regular expression that matches any single character.

74 / 106

Recap of `tidyr`

A package that reshapes the layout of data sets.

Converting from wide to long format using gather()

Converting from long format to wide format using spread()

Split and merge columns with unite() and separate()

Data Wrangling with dplyr and tidyr Cheat Sheet

75 / 106

Challenge with `tidyr`

Using the airquality dataset, gather all the columns (except Month and Day) into rows.
Then spread the resulting data frame to return to the original data format.

?airquality
data(airquality)

76 / 106

Solution

Using the airquality dataset, gather all the columns (except Month and Day) into rows.

air.long <- gather(airquality, variable, value, -Month, -Day)
head(air.long)
#   Month Day variable value
# 1     5   1    Ozone    41
# 2     5   2    Ozone    36
# 3     5   3    Ozone    12
# 4     5   4    Ozone    18
# 5     5   5    Ozone    NA
# 6     5   6    Ozone    28

Note that the syntax used here indicates that we wish to gather ALL the columns exept Month and Day. It is equivalent to: gather(airquality, value, Ozone, Solar.R, Temp, Wind)

77 / 106

Solution

Then spread the resulting data frame to return to the original data format.

air.wide <- spread(air.long, variable, value)
head(air.wide)
#   Month Day Ozone Solar.R Temp Wind
# 1     5   1    41     190   67  7.4
# 2     5   2    36     118   72  8.0
# 3     5   3    12     149   74 12.6
# 4     5   4    18     313   62 11.5
# 5     5   5    NA      NA   56 14.3
# 6     5   6    28      NA   66 14.9

78 / 106

Data manipulation with `dplyr`

79 / 106

Intro to `dplyr`

Package that contains a set of functions (or “verbs”) for data manipulation such as filtering rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data;
Easy and intuitive functions
Fast and efficient
Can interface with external databases and translate your R code into SQL queries

Some corresponding R base functions: split(), subset(), apply(), sapply(), lapply(), tapply() and aggregate()

80 / 106

Intro to `dplyr`

install.packages("dplyr")
library(dplyr)

81 / 106

Basic functions in `dplyr`

These 4 core functions tackle the most common manipulations when working with data frames

select(): select columns from a data frame
filter(): filter rows according to defined criteria
arrange(): re-order data based on criteria (e.g. ascending, descending)
mutate(): create or transform values in a column

82 / 106

`select` columns

select(data, ...)

... Can be column names or positions or complex expressions separated by commas

select(data, column1, column2) select columns 1 and 2 select(data, c(2:4,6)) select columns 2 to 4 and 6 select(data, -column1) select all columns except column 1 select(data, start_with(x.)) select all columns that start with "x."

83 / 106

`select` columns

84 / 106

`select` columns

Example: suppose we are only interested in the variation of Ozone over time within the airquality dataset

ozone <- select(airquality, Ozone, Month, Day)
head(ozone)
#   Ozone Month Day
# 1    41     5   1
# 2    36     5   2
# 3    12     5   3
# 4    18     5   4
# 5    NA     5   5
# 6    28     5   6

85 / 106

`filter` rows

Extract a subset of rows that meet one or more specific conditions

filter(dataframe, logical statement 1, logical statement 2, ...)

86 / 106

`filter` rows

Example: we are interested in analyses that focus on the month of August during high temperature events

august <- filter(airquality, Month == 8, Temp >= 90)
# same as: filter(airquality, Month == 8 & Temp >= 90)
head(august)
#   Ozone Solar.R Wind Temp Month Day
# 1    89     229 10.3   90     8   8
# 2   110     207  8.0   90     8   9
# 3    NA     222  8.6   92     8  10
# 4    76     203  9.7   97     8  28
# 5   118     225  2.3   94     8  29
# 6    84     237  6.3   96     8  30

87 / 106

Sort rows with `arrange`

Re-order rows by a particular column, by default in ascending order

Use desc() for descending order.

arrange(data, variable1, desc(variable2), ...)

88 / 106

Sort rows with `arrange`

Example:

Let's use the following code to create a scrambled version of the airquality dataset

air_mess <- sample_frac(airquality, 1)
head(air_mess)
#   Ozone Solar.R Wind Temp Month Day
# 1    23     115  7.4   76     8  18
# 2    28     273 11.5   82     8  13
# 3     8      19 20.1   61     5   9
# 4   135     269  4.1   84     7   1
# 5    23     299  8.6   65     5   7
# 6    30     322 11.5   68     5  19

89 / 106

Sort rows with `arrange`

Example:

Now let's arrange the data frame back into chronological order, sorting by Month then Day

air_chron <- arrange(air_mess, Month, Day)
head(air_chron)
#   Ozone Solar.R Wind Temp Month Day
# 1    41     190  7.4   67     5   1
# 2    36     118  8.0   72     5   2
# 3    12     149 12.6   74     5   3
# 4    18     313 11.5   62     5   4
# 5    NA      NA 14.3   56     5   5
# 6    28      NA 14.9   66     5   6

Try : arrange(air_mess, Day, Month) and see the difference.

90 / 106

Create new columns using `mutate`

Compute and add new columns

mutate(data, newVar1 = expression1, newVar2 = expression2, ...)

91 / 106

Create new columns using `mutate`

Example: we want to convert the temperature variable form degrees Fahrenheit to degrees Celsius

airquality_C <- mutate(airquality, Temp_C = (Temp-32)*(5/9))
head(airquality_C)
#   Ozone Solar.R Wind Temp Month Day   Temp_C
# 1    41     190  7.4   67     5   1 19.44444
# 2    36     118  8.0   72     5   2 22.22222
# 3    12     149 12.6   74     5   3 23.33333
# 4    18     313 11.5   62     5   4 16.66667
# 5    NA      NA 14.3   56     5   5 13.33333
# 6    28      NA 14.9   66     5   6 18.88889

92 / 106

`magrittr`

Usually data manipulation require multiple steps, the magrittr package offers a pipe operator %>% which allows us to link multiple operations

93 / 106

`magrittr`

install.packages("magrittr")
require(magrittr)

94 / 106

`magrittr`

Suppose we want to analyse only the month of June, then convert the temperature variable to degrees Celsius. We can create the required data frame by combining 2 dplyr verbs we learned

june_C <- mutate(filter(airquality, Month == 6),
                 Temp_C = (Temp-32)*(5/9))

As we add more operations, wrapping functions one inside the other becomes increasingly illegible. But, step by step would be redundant and write a lot of objects to the workspace.

95 / 106

`magrittr`

Alternatively, we can use maggritr's pipe operator to link these successive operations

june_C <- airquality %>%
    filter(Month == 6) %>%
    mutate(Temp_C = (Temp-32)*(5/9))

Advantages :

less redundant code
easy to read and write because functions are executed in order

96 / 106

`dplyr::group_by` and `summarise`

The dplyr verbs become especially powerful when they are are combined using the pipe operator %>%. The following dplyr functions allow us to split our data frame into groups on which we can perform operations individually

group_by() : group data frame by a factor for downstream operations (usually summarise)

summarise() : summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. min(), max(), mean(), etc…)

97 / 106

`dplyr` - Split-Apply-Combine

The group_by function is key to the Split-Apply-Combine strategy

98 / 106

`dplyr` - Split-Apply-Combine

99 / 106

`dplyr` - Split-Apply-Combine

Example: we are interested in the mean temperature and standard deviation within each month if the airquality dataset

month_sum <- airquality %>%
      group_by(Month) %>%
      summarise(mean_temp = mean(Temp),
                sd_temp = sd(Temp))
month_sum
# # A tibble: 5 x 3
#   Month mean_temp sd_temp
#   <int>     <dbl>   <dbl>
# 1     5      65.5    6.85
# 2     6      79.1    6.60
# 3     7      83.9    4.32
# 4     8      84.0    6.59
# 5     9      76.9    8.36

100 / 106

Challenge with `dplyr`and `magrittr`

Using the ChickWeight dataset, create a summary table which displays the difference in weight between the maximum and minimum weight of each chick in the study.

Employ dplyr verbs and the %>% operator.

101 / 106

Solution

Use group_by() to divide the dataset by "Chick"
Use summarise() to calculate the weight gain within each group

weight_diff <- ChickWeight %>%
      group_by(Chick) %>%
      summarise(weight_diff = max(weight) - min(weight))
head(weight_diff)
# # A tibble: 6 x 2
#   Chick weight_diff
#   <ord>       <dbl>
# 1 18              4
# 2 16             16
# 3 15             27
# 4 13             55
# 5 9              58
# 6 20             76

102 / 106

Ninja challenge

Using the ChickWeight dataset, create a summary table which displays, for each diet, the average individual difference in weight between the end and the beginning of the study.

Employ dplyr verbs and the %>% operator.

(Hint: first() and last() may be useful here.)

103 / 106

Ninja solution

diet_summ <- ChickWeight %>%
      group_by(Diet, Chick) %>%
      summarise(weight_gain = last(weight) - first(weight)) %>%
      group_by(Diet) %>%
      summarise(mean_gain = mean(weight_gain))
diet_summ
# # A tibble: 4 x 2
#   Diet  mean_gain
#   <fct>     <dbl>
# 1 1          115.
# 2 2          174 
# 3 3          230.
# 4 4          188.

104 / 106

More on data manipulation

Learn more on dplyr

dplyr and tidyr cheatsheet

105 / 106

Thank you for attending!

106 / 106

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Workshop 2: Loading and Manipulating Data

QCBS R Workshop Series

Qu?bec Centre for Biodiversity Science

Learning Objectives

RStudio projects

Create a new project

Keep your files organized!

Preparing data for R

Choose filenames names wisely

Choose variable names wisely

Things to consider with your data input

Bad data examples:

Horrible data examples:

Preparing data in R

Writing a Script

R Scripts

Create an R script

Create an R script

Commands and comments

Why should I use #?

Header

Section Headings

Housekeeping

Clearing the workspace

Housekeeping

Remember

Loading, exploring and saving data

Download today's data

Working Directory

Display contents of the directory

Importing Data

Importing Data

Importing Data

Looking at Data

Inspecting the data

Inspecting the data

Looking at Data

Reminder from Workshop 1: Accessing data

Renaming variables

Creating new variables

Creating new variables

Subsetting data

Subsetting data

Data exploration

Data exploration

Data exploration

Saving your Workspace

Exporting data

Challenge: Use your data

Fixing a broken data frame

Harder Challenge

Harder Challenge

Fixing co2_broken

Broken data

Broken data

Broken data

Broken data

Broken data

?read.csv

Broken data

Broken data

Broken data

Learn to manipulate data with tidyr, dyplr, maggritr

Using tidyr to reshape data frames

Data formats

long vs wide format

Tidying your data

tidyr installation

gather columns into rows

gather columns into rows

spread rows into columns

spread rows into columns

separate columns

Using separate() example

Using separate() example

Using separate() example

Recap of tidyr

Challenge with tidyr

Solution

Solution

Why should I use `#`?

`?read.csv`

Using `tidyr` to reshape data frames

`tidyr` installation

`gather` columns into rows

`gather` columns into rows

`spread` rows into columns

`spread` rows into columns

`separate` columns

Using `separate()` example

Using `separate()` example

Using `separate()` example

Recap of `tidyr`

Challenge with `tidyr`

Data manipulation with `dplyr`

Intro to `dplyr`

Intro to `dplyr`

Basic functions in `dplyr`

`select` columns

`select` columns

`select` columns

`filter` rows

`filter` rows

Sort rows with `arrange`

Sort rows with `arrange`

Sort rows with `arrange`

Create new columns using `mutate`

Create new columns using `mutate`

`magrittr`

`magrittr`

`magrittr`

`magrittr`

`dplyr::group_by` and `summarise`

`dplyr` - Split-Apply-Combine

`dplyr` - Split-Apply-Combine

`dplyr` - Split-Apply-Combine

Challenge with `dplyr`and `magrittr`