What is this?
Encourages reproducibility and easy sharing.
Use the Create project command (available in the Projects menu and the global toolbar)
One project = one folder :)
Dataset should be stored as comma-separated value files (.csv) in the working directory
Good!
Bad!
Good!
Bad!
Bad data preservation practices
It is possible to do all your data preparation work within R
Can switch between long- and wide-formats easily (more on this in future workshops)
Useful tips on data preparation can be found here: https://www.zoology.ubc.ca/~schluter/R/data/
What is this?
Once written and saved, your script file allows you to make changes and re-run analyses with minimal effort!
The #
symbol in a script tells R to ignore anything remaining on this line of the script when running commands
# This is a comment not a command
#
?It is recommended that you start your script with a header using comments:
You can make a section heading in R Studio four #
signs
# You can coment using this, but look below how to create section headers:## Heading name ####
This allows you to move quickly between sections and hide sections
The first command at the top of all scripts may be: rm(list = ls())
.
This command:
# Clear the R workspacerm(list = ls())?rm?ls
Demo – Try to add some test data to R and then see how rm(list = ls())
removes it
A<-"Test" # Put some data in workspaceA <- "Test" # Note that you can use spaces!A = "Test" # <- or = can be used equally#Note that it is best practice to use "<-" for assigment instead of "="A# [1] "Test"rm(list=ls())A# Error in eval(expr, envir, enclos): object 'A' not found
>
in the console. If you don't see it, press ESCYou can download the data and the script for this workshop from the wiki:
http://qcbs.ca/wiki/r/workshop2
Save the files in the folder where your created your R project.
Tells R where your scripts and data are. You need to set the right working directory to load a data file. Type getwd()
in the console to see your working directory:
getwd()
If this is not the directory you would like to work with, you can set your own using:
setwd()
Within the parenthesis, you will write the extension of the directory you would like to work with. See the example below:
setwd("C:/Users/Luigi/Documents") # We use slashes "/", and not backslashes "\"
You can display contents of the working directory using dir()
:
dir()# [1] "co2_broken.csv" "co2_good.csv" # [3] "images" "qcbsR-fonts.css" # [5] "qcbsR-header.html" "qcbsR-macros.js" # [7] "qcbsR.css" "script_workshop02-en.R"# [9] "workshop02-en.Rmd"
It helps to:
Import data into R using read.csv
:
CO2 <- read.csv("co2_good.csv", header=TRUE)
Note that this will:
CO2
;"C:/Users/Mario/Downloads/co2_good.csv"
)header=TRUE
tells R that the first line of your dataset contains column namesRecall to find out what arguments the function requires, use help “?” I
?read.csv
Note that if your operating system or .CSV editor (e.g. Excel) is in French, you may have to use read.csv2
?read.csv2
Notice that RStudio now provides information on the CO2 data in your workspace.
The workspace refers to all the objects that you create during an R session.
R Command | Action |
---|---|
CO2 |
look at the whole dataframe |
head(CO2) |
look at the first few rows |
tail(CO2) |
look at the last few rows |
names(CO2) |
names of the columns in the dataframe |
attributes(CO2) |
attributes of the dataframe |
dim(CO2) |
dimensions of the dataframe |
ncol(CO2) |
number of columns |
nrow(CO2) |
number of rows |
str(CO2)# 'data.frame': 84 obs. of 5 variables:# $ Plant : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...# $ Type : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...# $ Treatment: Factor w/ 2 levels "chilled","nonchilled": 2 2 2 2 2 2 2 2 2 2 ...# $ conc : int 95 175 250 350 500 675 1000 95 175 250 ...# $ uptake : num 16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
This shows the structure of the dataframe. Very useful to check data type (mode) of all columns to make sure R loaded data properly.
Note: the CO2 dataset includes repeated measurements of CO2 uptake from 6 plants from Quebec and 6 plants from Mississippi at several levels of CO2 concentration. Half the plants of each type were chilled overnight before the experiment was conducted.
Common problems:
data()
head()
tail()
str()
names()
attributes()
dim()
ncol()
nrow()
Load the data with:
CO2 <- read.csv("co2_good.csv", header = FALSE)
Check data types with str()
again. What is wrong here?
Do not forget to re-load data with header = T
afterwards
Imagine a data frame called mydata
:
mydata[1,] # Extracts the first rowmydata[2,3] # Extracts the content of row 2 / column 3mydata[,1] # Extracts the first columnmydata[,1][2] # [...] can be also be used recursivelymydata$Variable1 # Also extracts the first column
Variables names can be changed within R.
# First lets make a copy of the dataset to play withCO2copy <- CO2# names() gives you the names of the variables present in the data framenames(CO2copy)# [1] "Plant" "Type" "Treatment" "conc" "uptake"# Changing from English to French names (make sure you have the same levels!)names(CO2copy) <- c("Plante","Categorie", "Traitement", "conc", "absortion")names(CO2copy)# [1] "Plante" "Categorie" "Traitement" "conc" "absortion"
Variables and strings can be concatenated together.
The function paste()
is very useful for concatenating.
See ?paste
and ?paste0
.
# Let's create an unique id for our samples:# Don't forget to use "" for stringsCO2copy$uniqueID <- paste0(CO2copy$Plante, "_",CO2copy$Categorie, "_", CO2copy$Traitement)# observe the resultshead(CO2copy$uniqueID)# [1] "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled"# [4] "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled" "Qn1_Quebec_nonchilled"
Creating new variables works for numbers and mathematical operations as well!
# Let's standardize our variable "absortion" to relative valuesCO2copy$absortionRel <- CO2copy$absortion/max(CO2copy$absortion) # Observe the resultshead(CO2copy$absortionRel)
There are many ways to subset a data frame
# Let's keep working with our CO2copy data frame# Select only "Plante" and "absortionRel" columns. (Don't forget the ","!)CO2copy[,c("Plante", "absortionRel")]# Subset data frame from rows from 1 to 50CO2copy[1:50,]
# Select observations matching only the nonchilled Traitement.CO2copy[CO2copy$Traitement == "nonchilled",]# Select observations with absortion higher or equal to 20CO2copy[CO2copy$absortion >= 20, ] # Select observations with absortion higher or equal to 20CO2copy[CO2copy$Traitement == "nonchilled" & CO2copy$absortion >= 20, ]# We are done playing with the Dataset copy, lets erase it.CO2copy <- NULL
Go here to check all the logical operators you can use
A good way to start your data exploration is to look at some basic statistics of your dataset.
Use the summary()
function to do that!
summary(CO2)
This is also useful to spot some errors you might have missed!
You can also use other functions to calculate basic statistics on parts of your data frame. Let's try the mean()
, sd()
and hist()
functions:
# Calculate the mean and the standard deviation of the CO2 concentration:# Assign them to new variablesmeanConc <- mean(CO2$conc)sdConc <- sd(CO2$conc)# print() prints any given value to the R consoleprint(paste("the mean of concentration is:", meanConc))print(paste("the standard deviation of concentration is:", sdConc))
# Let's plot a histogram to explore the distribution of "uptake"hist(CO2$uptake)# Increasing the number of bins to observe better the patternhist(CO2$uptake, breaks = 40)
Use apply()
to calculate the means of the last two columns of the data frame (i.e. the columns that contain continuous data).
?apply # Let's see how apply works!
apply(CO2[,4:5], MARGIN = 2, FUN = mean)# conc uptake # 435.0000 27.2131
# Saving an R workspace file that stores all your objectssave.image(file="co2_project_Data.RData")# Clear your memoryrm(list = ls())# Reload your dataload("co2_project_Data.RData")head(CO2) # Looking good!
# Plant Type Treatment conc uptake# 1 Qn1 Quebec nonchilled 95 16.0# 2 Qn1 Quebec nonchilled 175 30.4# 3 Qn1 Quebec nonchilled 250 34.8# 4 Qn1 Quebec nonchilled 350 37.2# 5 Qn1 Quebec nonchilled 500 35.3# 6 Qn1 Quebec nonchilled 675 39.2
R disposes of write
functions that allow you to write objects directly to files in your computer. Let us use the write.csv
function to save our CO2 data into a .CSV file:
write.csv(CO2, file = "co2_new.csv")
Note that our arguments are both:
CO2
Object (name)
"co2_new.csv"
File to write (name)
If you don’t have your own data, work with your neighbour Remember to clean your workspace
Getting your data working properly can be tougher than you think!
Sometimes, one may find compatibility issues.
For example, sharing data from an Apple computer to Windows, or between computers set up in different continents can lead to incompatible files (e.g. different decimal separators).
Let's practice how to solve these common errors.
Read the file co2_broken.csv
CO2 <- read.csv("co2_broken.csv")head(CO2)# NOTE..It.rain.a.lot.in.Quebec.during.sampling# 1 falling on my notebook numerous values can't be read rain# 2 Plant\tType\tTreatment\tconc\tuptake# 3 Qn1\tQuebec\tnonchilled\t95\t16# 4 Qn1\tQuebec\tnonchilled\t175\t30.4# 5 Qn1\tQuebec\tnonchilled\t250\tcannot_read_notes# 6 Qn1\tQuebec\tnonchilled\t350\t37.2# due.to.excessive X X.1 X.2 X.3# 1 NA NA NA NA NA# 2 NA NA NA NA NA# 3 NA NA NA NA NA# 4 NA NA NA NA NA# 5 NA NA NA NA NA# 6 NA NA NA NA NA
CO2[1:4,]# NOTE..It.rain.a.lot.in.Quebec.during.sampling# 1 falling on my notebook numerous values can't be read rain# 2 Plant\tType\tTreatment\tconc\tuptake# 3 Qn1\tQuebec\tnonchilled\t95\t16# 4 Qn1\tQuebec\tnonchilled\t175\t30.4# due.to.excessive X X.1 X.2 X.3# 1 NA NA NA NA NA# 2 NA NA NA NA NA# 3 NA NA NA NA NA# 4 NA NA NA NA NA
Some useful functions:
?read.csv
- look at some of the options for how to load a .csvhead()
- first few rowsstr()
- structure of dataclass()
- class of the objectunique()
- unique observationslevels()
- levels of a factorwhich()
- ask a question to your data framedroplevels()
- get rid of undesired levels after subsetting factorsHINT There are 4 problems!
ERROR 1 The data appears to be lumped into one column
head(CO2)# NOTE..It.rain.a.lot.in.Quebec.during.sampling# 1 falling on my notebook numerous values can't be read rain# 2 Plant\tType\tTreatment\tconc\tuptake# 3 Qn1\tQuebec\tnonchilled\t95\t16# 4 Qn1\tQuebec\tnonchilled\t175\t30.4# 5 Qn1\tQuebec\tnonchilled\t250\tcannot_read_notes# 6 Qn1\tQuebec\tnonchilled\t350\t37.2# due.to.excessive X X.1 X.2 X.3# 1 NA NA NA NA NA# 2 NA NA NA NA NA# 3 NA NA NA NA NA# 4 NA NA NA NA NA# 5 NA NA NA NA NA# 6 NA NA NA NA NA
ERROR 1 - Solution
CO2 <- read.csv("co2_broken.csv",sep = "")
ERROR 2 The data does not start until the third line of the file, so you end up with notes on the file as the headings.
head(CO2)# NOTE. It rain a lot in. Quebec# 1 falling on my notebook numerous values can't# 2 Plant Type Treatment conc uptake # 3 Qn1 Quebec nonchilled 95 16 # 4 Qn1 Quebec nonchilled 175 30.4 # 5 Qn1 Quebec nonchilled 250 cannot_read_notes # 6 Qn1 Quebec nonchilled 350 37.2 # during sampling. due to excessive X....# 1 be read rain,,,, NA NA NA# 2 NA NA NA# 3 NA NA NA# 4 NA NA NA# 5 NA NA NA# 6 NA NA NA
ERROR 2 - Solution
Skip two lines when loading the file using the "skip" argument:
CO2 <- read.csv("co2_broken.csv", sep = "", skip = 2)head(CO2)# Plant Type Treatment conc uptake# 1 Qn1 Quebec nonchilled 95 16# 2 Qn1 Quebec nonchilled 175 30.4# 3 Qn1 Quebec nonchilled 250 cannot_read_notes# 4 Qn1 Quebec nonchilled 350 37.2# 5 Qn1 Quebec nonchilled 500 35.3# 6 Qn1 Quebec nonchilled cannot_read_notes 39.2
ERROR 3
conc
and uptake
variables are considered factors instead of numbers, because there are comments in the numeric columns
str(CO2)# 'data.frame': 84 obs. of 5 variables:# $ Plant : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...# $ Type : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...# $ Treatment: Factor w/ 4 levels "chiled","chilled",..: 4 4 4 4 4 4 4 4 4 3 ...# $ conc : Factor w/ 8 levels "1000","175","250",..: 7 2 3 4 5 8 1 7 2 3 ...# $ uptake : Factor w/ 77 levels "10.5","10.6",..: 15 39 76 54 50 61 63 9 32 53 ...
?read.csv
ERROR 3 - Solution
Tell R that all of NA, "na", and "cannot_read_notes" should be considered NA. Then because all other values in those columns are numbers, conc
and uptake
will be loaded as numeric/integer.
CO2 <- read.csv("co2_broken.csv", sep = "", skip = 2, na.strings = c("NA","na","cannot_read_notes"))str(CO2)# 'data.frame': 84 obs. of 5 variables:# $ Plant : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...# $ Type : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...# $ Treatment: Factor w/ 4 levels "chiled","chilled",..: 4 4 4 4 4 4 4 4 4 3 ...# $ conc : int 95 175 250 350 500 NA 1000 95 175 250 ...# $ uptake : num 16 30.4 NA 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
ERROR 4
There are only 2 treatments (chilled and nonchilled) but there are spelling errors causing it to look like 4 different treatments.
str(CO2)
levels(CO2$Treatment)# [1] "chiled" "chilled" "nnchilled" "nonchilled"unique(CO2$Treatment)# [1] nonchilled nnchilled chilled chiled # Levels: chiled chilled nnchilled nonchilled
ERROR 4 - Solution
# Identify all rows that contain "nnchilled" and replace with "nonchilled"CO2$Treatment[CO2$Treatment=="nnchilled"] <- "nonchilled"# Identify all rows that contain "chiled" and replace with "chilled"CO2$Treatment[CO2$Treatment=="chiled"] <- "chilled"# Drop unused levels from factorCO2 <- droplevels(CO2)str(CO2)# 'data.frame': 84 obs. of 5 variables:# $ Plant : Factor w/ 12 levels "Mc1","Mc2","Mc3",..: 10 10 10 10 10 10 10 11 11 11 ...# $ Type : Factor w/ 2 levels "Mississippi",..: 2 2 2 2 2 2 2 2 2 2 ...# $ Treatment: Factor w/ 2 levels "chilled","nonchilled": 2 2 2 2 2 2 2 2 2 2 ...# $ conc : int 95 175 250 350 500 NA 1000 95 175 250 ...# $ uptake : num 16 30.4 NA 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
Fixed!
tidyr
to reshape data frameslibrary(tidyr)
Wide format
# Species DBH Height# 1 Oak 12 56# 2 Elm 20 85# 3 Ash 13 55
Long format
# Species Measurement Value# 1 Oak DBH 12# 2 Elm DBH 20# 3 Ash DBH 13# 4 Oak Height 56# 5 Elm Height 85# 6 Ash Height 55
Wide data format has a separate column for each variable or each factor in your study
Long data format has a column for possible variables and a column for the values of those variables
Wide data frame can be used for some basic plotting in ggplot2
, but more complex plots require long format (example to come)
dplyr
, lm()
, glm()
, gam()
all require long data format
Tidying allows you to manipulate the structure of your data while preserving all original information
gather()
- convert from wide to long format
spread()
- convert from long to wide format
tidyr
installationinstall.packages("tidyr")library(tidyr)
gather
columns into rowsgather(data, key, value, ...)
data
A data frame (e.g. wide
) key
name of the new column containing variable names (e.g. Measurement
) value
name of the new column containing variable values (e.g. Value
) ...
name or numeric index of the columns we wish to gather (e.g. DBH
, Height
)
gather
columns into rowswide <- data.frame(Species = c("Oak", "Elm", "Ash"), DBH = c(12, 20, 13), Height = c(56, 85, 55))wide# Species DBH Height# 1 Oak 12 56# 2 Elm 20 85# 3 Ash 13 55
long = gather(wide, Measurement, Value, DBH, Height)long# Species Measurement Value# 1 Oak DBH 12# 2 Elm DBH 20# 3 Ash DBH 13# 4 Oak Height 56# 5 Elm Height 85# 6 Ash Height 55
spread
rows into columnsspread(data, key, value)
data
A data frame (e.g. long
) key
Name of the column containing variable names (e.g. Measurement
) value
Name of the column containing variable values (e.g. Value
)spread
rows into columnslong# Species Measurement Value# 1 Oak DBH 12# 2 Elm DBH 20# 3 Ash DBH 13# 4 Oak Height 56# 5 Elm Height 85# 6 Ash Height 55wide2 = spread(long, Measurement, Value)wide2# Species DBH Height# 1 Ash 13 55# 2 Elm 20 85# 3 Oak 12 56
separate
columnsseparate()
splits a columns by a character string separator
separate(data, col, into, sep)
data
A data frame (e.g. long
) col
Name of the column you wish to separateinto
Names of new variables to createsep
Character which indicates where to separate separate()
exampleCreate a fictional dataset about fish and plankton
set.seed(8)messy <- data.frame(id = 1:4, trt = sample(rep(c('control', 'farm'), each = 2)), zooplankton.T1 = runif(4), fish.T1 = runif(4), zooplankton.T2 = runif(4), fish.T2 = runif(4))messy# id trt zooplankton.T1 fish.T1 zooplankton.T2 fish.T2# 1 1 farm 0.7189275 0.64449114 0.544962116 0.2644589# 2 2 farm 0.2908734 0.45704489 0.138224346 0.2765322# 3 3 control 0.9322698 0.08930101 0.927812252 0.5211070# 4 4 control 0.7691470 0.43239137 0.001301721 0.2236889
separate()
exampleFirst convert the messy data frame from wide to long format
messy.long <- gather(messy, taxa, count, -id, -trt)head(messy.long)# id trt taxa count# 1 1 farm zooplankton.T1 0.7189275# 2 2 farm zooplankton.T1 0.2908734# 3 3 control zooplankton.T1 0.9322698# 4 4 control zooplankton.T1 0.7691470# 5 1 farm fish.T1 0.6444911# 6 2 farm fish.T1 0.4570449
separate()
exampleThen we want to split the 2 sampling time (T1 and T2).
messy.long.sep <- separate(messy.long, taxa, into = c("species", "time"), sep = "\\.")head(messy.long.sep)# id trt species time count# 1 1 farm zooplankton T1 0.7189275# 2 2 farm zooplankton T1 0.2908734# 3 3 control zooplankton T1 0.9322698# 4 4 control zooplankton T1 0.7691470# 5 1 farm fish T1 0.6444911# 6 2 farm fish T1 0.4570449
The argument sep = "\\."
tells R to splits the character string around the period (.). We cannot type directly "."
because it is a regular expression that matches any single character.
tidyr
A package that reshapes the layout of data sets.
Converting from wide to long format using gather()
Converting from long format to wide format using spread()
Split and merge columns with unite()
and separate()
Data Wrangling with dplyr and tidyr Cheat Sheet
tidyr
Using the airquality dataset, gather
all the columns (except Month and Day) into rows.
Then spread
the resulting data frame to return to the original data format.
?airqualitydata(airquality)
gather
all the columns (except Month and Day) into rows.air.long <- gather(airquality, variable, value, -Month, -Day)head(air.long)# Month Day variable value# 1 5 1 Ozone 41# 2 5 2 Ozone 36# 3 5 3 Ozone 12# 4 5 4 Ozone 18# 5 5 5 Ozone NA# 6 5 6 Ozone 28
Note that the syntax used here indicates that we wish to gather ALL the columns exept Month and Day. It is equivalent to: gather(airquality, value, Ozone, Solar.R, Temp, Wind)
spread
the resulting data frame to return to the original data format.air.wide <- spread(air.long, variable, value)head(air.wide)# Month Day Ozone Solar.R Temp Wind# 1 5 1 41 190 67 7.4# 2 5 2 36 118 72 8.0# 3 5 3 12 149 74 12.6# 4 5 4 18 313 62 11.5# 5 5 5 NA NA 56 14.3# 6 5 6 28 NA 66 14.9
dplyr
dplyr
Some corresponding R base functions:
split()
, subset()
, apply()
, sapply()
, lapply()
, tapply()
and aggregate()
dplyr
install.packages("dplyr")library(dplyr)
dplyr
These 4 core functions tackle the most common manipulations when working with data frames
select()
: select columns from a data framefilter()
: filter rows according to defined criteriaarrange()
: re-order data based on criteria (e.g. ascending, descending)mutate()
: create or transform values in a columnselect
columnsselect(data, ...)
...
Can be column names or positions or complex expressions separated by commas
select(data, column1, column2)
select columns 1 and 2
select(data, c(2:4,6))
select columns 2 to 4 and 6
select(data, -column1)
select all columns except column 1
select(data, start_with(x.))
select all columns that start with "x."
select
columnsselect
columnsExample: suppose we are only interested in the variation of Ozone over time within the airquality dataset
ozone <- select(airquality, Ozone, Month, Day)head(ozone)# Ozone Month Day# 1 41 5 1# 2 36 5 2# 3 12 5 3# 4 18 5 4# 5 NA 5 5# 6 28 5 6
filter
rowsExtract a subset of rows that meet one or more specific conditions
filter(dataframe, logical statement 1, logical statement 2, ...)
filter
rowsExample: we are interested in analyses that focus on the month of August during high temperature events
august <- filter(airquality, Month == 8, Temp >= 90)# same as: filter(airquality, Month == 8 & Temp >= 90)head(august)# Ozone Solar.R Wind Temp Month Day# 1 89 229 10.3 90 8 8# 2 110 207 8.0 90 8 9# 3 NA 222 8.6 92 8 10# 4 76 203 9.7 97 8 28# 5 118 225 2.3 94 8 29# 6 84 237 6.3 96 8 30
arrange
Re-order rows by a particular column, by default in ascending order
Use desc()
for descending order.
arrange(data, variable1, desc(variable2), ...)
arrange
Example:
air_mess <- sample_frac(airquality, 1)head(air_mess)# Ozone Solar.R Wind Temp Month Day# 1 23 115 7.4 76 8 18# 2 28 273 11.5 82 8 13# 3 8 19 20.1 61 5 9# 4 135 269 4.1 84 7 1# 5 23 299 8.6 65 5 7# 6 30 322 11.5 68 5 19
arrange
Example:
air_chron <- arrange(air_mess, Month, Day)head(air_chron)# Ozone Solar.R Wind Temp Month Day# 1 41 190 7.4 67 5 1# 2 36 118 8.0 72 5 2# 3 12 149 12.6 74 5 3# 4 18 313 11.5 62 5 4# 5 NA NA 14.3 56 5 5# 6 28 NA 14.9 66 5 6
Try : arrange(air_mess, Day, Month)
and see the difference.
mutate
Compute and add new columns
mutate(data, newVar1 = expression1, newVar2 = expression2, ...)
mutate
Example: we want to convert the temperature variable form degrees Fahrenheit to degrees Celsius
airquality_C <- mutate(airquality, Temp_C = (Temp-32)*(5/9))head(airquality_C)# Ozone Solar.R Wind Temp Month Day Temp_C# 1 41 190 7.4 67 5 1 19.44444# 2 36 118 8.0 72 5 2 22.22222# 3 12 149 12.6 74 5 3 23.33333# 4 18 313 11.5 62 5 4 16.66667# 5 NA NA 14.3 56 5 5 13.33333# 6 28 NA 14.9 66 5 6 18.88889
magrittr
Usually data manipulation require multiple steps, the magrittr package offers a pipe operator %>%
which allows us to link multiple operations
magrittr
install.packages("magrittr")require(magrittr)
magrittr
Suppose we want to analyse only the month of June, then convert the temperature variable to degrees Celsius. We can create the required data frame by combining 2 dplyr verbs we learned
june_C <- mutate(filter(airquality, Month == 6), Temp_C = (Temp-32)*(5/9))
As we add more operations, wrapping functions one inside the other becomes increasingly illegible. But, step by step would be redundant and write a lot of objects to the workspace.
magrittr
Alternatively, we can use maggritr's pipe operator to link these successive operations
june_C <- airquality %>% filter(Month == 6) %>% mutate(Temp_C = (Temp-32)*(5/9))
Advantages :
dplyr::group_by
and summarise
The dplyr
verbs become especially powerful when they are are combined using the pipe operator %>%
.
The following dplyr
functions allow us to split our data frame into groups on which we can perform operations individually
group_by()
: group data frame by a factor for downstream operations (usually summarise)
summarise()
: summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. min()
, max()
, mean()
, etc…)
dplyr
- Split-Apply-CombineThe group_by
function is key to the Split-Apply-Combine strategy
dplyr
- Split-Apply-Combinedplyr
- Split-Apply-CombineExample: we are interested in the mean temperature and standard deviation within each month if the airquality dataset
month_sum <- airquality %>% group_by(Month) %>% summarise(mean_temp = mean(Temp), sd_temp = sd(Temp))month_sum# # A tibble: 5 x 3# Month mean_temp sd_temp# <int> <dbl> <dbl># 1 5 65.5 6.85# 2 6 79.1 6.60# 3 7 83.9 4.32# 4 8 84.0 6.59# 5 9 76.9 8.36
dplyr
and magrittr
Using the ChickWeight
dataset, create a summary table which displays the difference in weight between the maximum and minimum weight of each chick in the study.
Employ dplyr
verbs and the %>%
operator.
weight_diff <- ChickWeight %>% group_by(Chick) %>% summarise(weight_diff = max(weight) - min(weight))head(weight_diff)# # A tibble: 6 x 2# Chick weight_diff# <ord> <dbl># 1 18 4# 2 16 16# 3 15 27# 4 13 55# 5 9 58# 6 20 76
Using the ChickWeight
dataset, create a summary table which displays, for each diet, the average individual difference in weight between the end and the beginning of the study.
Employ dplyr
verbs and the %>%
operator.
(Hint: first()
and last()
may be useful here.)
diet_summ <- ChickWeight %>% group_by(Diet, Chick) %>% summarise(weight_gain = last(weight) - first(weight)) %>% group_by(Diet) %>% summarise(mean_gain = mean(weight_gain))diet_summ# # A tibble: 4 x 2# Diet mean_gain# <fct> <dbl># 1 1 115.# 2 2 174 # 3 3 230.# 4 4 188.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |