Introduction to R and Rstudio
Session - Cleaning data with {dplyr}
data:image/s3,"s3://crabby-images/80aff/80aff8af108028a70dec4d9a5648506005866cbf" alt="Cartoon image with the word dplyr: go wrangling above. There are two fluffy characters with the bigger called dplyr being ridden by a smaller character with a hat like a cowboy. The cowboy is rounding up three others called data that look less friendly and are being caught with the cowboy's whip"
Artwork by @allison_horst
Wrangling
Is the reshaping or transforming of data into a format which is easier to work with
This is often the largest part of many analyses and data science
A note on tidy data
Tidyverse functions work best with tidy data:
- Each variable forms a column.
- Each observation forms a row.
(Broadly, this means long rather than wide tables)
{dplyr} package
- {dplyr} is a language for data manipulation
- Most wrangling puzzles can be solved with knowledge of just a few {dplyr} verbs or functions
- Many of the concepts of these functions exist in SQL but {dplyr} (and other packages) can extend this further
Some functions/verbs to start with
Some key verbs will help us gain a deeper understanding of our data sets.
Note summarise()
can also be spelt summarize()
dplyr::arrange()
dplyr::filter()
dplyr::mutate()
dplyr::summarise()
Building with steps
These verbs aren’t used independently of each other.
Each can be a step in the code, like a recipe but can also be repeated.
A recipe starts with:
potato then
peel then
slice into medium sized pieces then
boil for 25 minutes then
mash
Recipe as code
The potato is the object in R terms and the steps are verbs or functions
Take a potato
then
peel
then
slice
into medium sized pieces then
boil
for 25 minutes then
mash
potato |>
peel() |>
slice(size = "medium") |>
boil(time = 25) |>
mash()
the |>
can be replaced with the word ‘then’ in this recipe scenario
Pipe
Shortcut key Ctrl+Shift+m
You might be familiar with the pipe %>%
from {magrittr} and in {tidyverse} but the new pipe |>
doesn’t require any packages to run
Q1. Which organisation provided the highest number of Mental Health beds?
arrange()
Reorder rows based on selected variable
beds_data |>
arrange(beds_av)
Descending data
We need descending order:
beds_data |>
arrange(desc(beds_av))
desc()
works for text and numeric variables
Q2. Which 2 organisations provided the highest number of MH beds in September 2018?
- We’ll use
arrange()
as before to get the “highest number”
- But we require only observations with the date “September 2018”
filter()
The expression inside brackets should return TRUE or FALSE. We are choosing rows where this expression is TRUE.
beds_data |>
filter(date == "2018-09-01")
A negative test of equality
To exclude and test where the expression is NOT equal !=
beds_data |>
filter(date != "2018-09-01")
Ordered and filtered
filter()
first to reduce the number of rows to apply the next code to
beds_data |>
filter(date == "2018-09-01") |>
arrange(desc(beds_av))
Find the top 2 organisations
This isn’t a key function but useful and there are many other functions for slice...
beds_data |>
arrange(desc(beds_av)) |>
filter(date == "2018-09-01") |>
slice_head(n = 2)
Q3. Which organisations had the highest percentage bed occupancy in September 2018?
- We’ll use
arrange()
as before to find “highest”
- We’ll use
filter()
as before to restrict by date “September 2018”
- But we don’t have a percentage variable in the data
Create new variables
= in this context is an alias not a test of equality
beds_data |>
mutate(perc_occ = occ_av / beds_av) |>
filter(date == "2018-09-01") |>
arrange(desc(perc_occ))
Q4. What was the mean number of beds (for the dataset)?
- Let’s first look at how we’d produce summary statistics like a mean
- And then see how this can be applied to groups of data like organisations
summarise()
Collapses a single summary value
beds_data |>
summarise(mean_beds = mean(beds_av))
Missing values
We’ll need to remove NA (not available) values to get a suitable mean. TRUE
can also be T
beds_data |>
summarise(mean_beds = mean(beds_av,
na.rm = TRUE))
Have a go!
Instead of mean()
use median()
object |>
summarise(new_name = function_name(column_name,
na.rm = ???))
Use a sum()
statistic twice
object |>
summarise(col_1 = function_name(beds_av,
na.rm = ???),
col_2 = function_name(occ_av,
na.rm = ???)
)
Answer for summary statistics
median()
beds_data |>
summarise(median_beds = median(beds_av,
na.rm = TRUE))
# A tibble: 1 × 1
median_beds
<dbl>
1 241
sum()
beds_data |>
summarise(total_beds = sum(beds_av, na.rm = TRUE),
total_occupacy = sum(occ_av, na.rm = TRUE))
# A tibble: 1 × 2
total_beds total_occupacy
<dbl> <dbl>
1 412480 368434
Applying summarise()
to groups
- Now we know how to use
summarise()
- We’ll produce a summary value for each value of date
group_by() - temporary grouping
New for 2023 grouping can be added into the functions directly so is temporary.
Also used in filter()
and slice()
functions.
beds_data |>
summarise(mean_beds = mean(beds_av, na.rm = TRUE),
.by = date)
group_by() - persistent grouping
group_by()
is a function that you may see in other code.
It does nothing to the output alone.
The change occurs behind the scenes.
beds_data |>
group_by(date)
ungroup()
data:image/s3,"s3://crabby-images/03966/039664f66e4eea2b2c58f8038c3843920e11159a" alt="Cartoon of fuzzy creatures created by Allison Horts with party hats on. Two are together and happy but one is behind holding a present and looking sad. The words."
Break?
Option to take this break before an exercise or after
Q5. Which organisations have the highest mean % bed occupancy?
-
summarise()
using sum() for total_beds
and total_occupancy
.
- Grouping in
summarise()
by organisations using .by =
.
-
mutate()
the new 2 column data frame to create a percentage using the totals (occ / beds)
- Order to find highest by using
arrange()
Solutions
beds_data |>
summarise(total_beds = sum(beds_av, na.rm = TRUE),
total_occupancy = sum(occ_av, na.rm = TRUE),
.by = org_name) |>
mutate(perc_occ = total_occupancy / total_beds) |>
arrange(desc(perc_occ))
Answer using the group_by() function
beds_data |>
group_by(org_name) |>
summarise(total_beds = sum(beds_av, na.rm = TRUE),
total_occupancy = sum(occ_av, na.rm = TRUE)) |>
mutate(perc_occ = total_occupancy / total_beds) |>
arrange(desc(perc_occ))