Introduction to R and Rstudio

Session - Cleaning data with {dplyr}

Zoë Turner
Cartoon image with the word dplyr: go wrangling above. There are two fluffy characters with the bigger called dplyr being ridden by a smaller character with a hat like a cowboy. The cowboy is rounding up three others called data that look less friendly and are being caught with the cowboy's whip

Artwork by @allison_horst

Wrangling

Is the reshaping or transforming of data into a format which is easier to work with

This is often the largest part of many analyses and data science

A note on tidy data

Tidyverse functions work best with tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.

(Broadly, this means long rather than wide tables)

{dplyr} package

  • {dplyr} is a language for data manipulation
  • Most wrangling puzzles can be solved with knowledge of just a few {dplyr} verbs or functions
  • Many of the concepts of these functions exist in SQL but {dplyr} (and other packages) can extend this further

Some functions/verbs to start with

Some key verbs will help us gain a deeper understanding of our data sets.

Note summarise() can also be spelt summarize()

dplyr::arrange()
dplyr::filter()
dplyr::mutate()
dplyr::summarise()

Building with steps

These verbs aren’t used independently of each other.

Each can be a step in the code, like a recipe but can also be repeated.

A recipe starts with:

potato then
peel then
slice into medium sized pieces then
boil for 25 minutes then
mash

Recipe as code

The potato is the object in R terms and the steps are verbs or functions

Take a potato then
peel then
slice into medium sized pieces then
boil for 25 minutes then
mash

potato |>
peel() |>
slice(size = "medium") |>
boil(time = 25) |>
mash()

the |> can be replaced with the word ‘then’ in this recipe scenario

Pipe

Shortcut key Ctrl+Shift+m

You might be familiar with the pipe %>% from {magrittr} and in {tidyverse} but the new pipe |> doesn’t require any packages to run

Screenshot of the Tools/Options wizard in the Code tab from the side and Editing at the top. Use native pipe operator option to select is highlighted.

Q1. Which organisation provided the highest number of Mental Health beds?

arrange()

Reorder rows based on selected variable

beds_data |> 
  arrange(beds_av)

Descending data

We need descending order:

beds_data |> 
  arrange(desc(beds_av))

desc() works for text and numeric variables

Q2. Which 2 organisations provided the highest number of MH beds in September 2018?

  • We’ll use arrange() as before to get the “highest number”
  • But we require only observations with the date “September 2018”

filter()

The expression inside brackets should return TRUE or FALSE. We are choosing rows where this expression is TRUE.

beds_data |> 
  filter(date == "2018-09-01") 

A negative test of equality

To exclude and test where the expression is NOT equal !=

beds_data |> 
  filter(date != "2018-09-01") 

Ordered and filtered

filter() first to reduce the number of rows to apply the next code to

beds_data  |> 
  filter(date == "2018-09-01") |> 
  arrange(desc(beds_av)) 

Find the top 2 organisations

This isn’t a key function but useful and there are many other functions for slice...

beds_data  |> 
  arrange(desc(beds_av)) |> 
  filter(date == "2018-09-01") |> 
  slice_head(n = 2)

Q3. Which organisations had the highest percentage bed occupancy in September 2018?

  • We’ll use arrange() as before to find “highest”
  • We’ll use filter() as before to restrict by date “September 2018”
  • But we don’t have a percentage variable in the data

Create new variables

= in this context is an alias not a test of equality

beds_data |> 
  mutate(perc_occ = occ_av / beds_av) |> 
  filter(date == "2018-09-01") |> 
  arrange(desc(perc_occ)) 

Q4. What was the mean number of beds (for the dataset)?

  • Let’s first look at how we’d produce summary statistics like a mean
  • And then see how this can be applied to groups of data like organisations

summarise()

Collapses a single summary value

beds_data |> 
  summarise(mean_beds = mean(beds_av))

Missing values

We’ll need to remove NA (not available) values to get a suitable mean. TRUE can also be T

beds_data |> 
  summarise(mean_beds = mean(beds_av,
                             na.rm = TRUE)) 

Have a go!

Instead of mean() use median()

object |> 
  summarise(new_name = function_name(column_name,
                                     na.rm = ???))

Use a sum() statistic twice

object |> 
  summarise(col_1 = function_name(beds_av,
                                  na.rm = ???),
            col_2 = function_name(occ_av,
                                  na.rm = ???)
)
10:00

Answer for summary statistics

median()

beds_data |> 
  summarise(median_beds = median(beds_av,
                                 na.rm = TRUE))
# A tibble: 1 × 1
  median_beds
        <dbl>
1         241

sum()

beds_data |> 
  summarise(total_beds = sum(beds_av, na.rm = TRUE),
            total_occupacy = sum(occ_av, na.rm = TRUE))
# A tibble: 1 × 2
  total_beds total_occupacy
       <dbl>          <dbl>
1     412480         368434

Applying summarise() to groups

  • Now we know how to use summarise()
  • We’ll produce a summary value for each value of date

group_by() - temporary grouping

New for 2023 grouping can be added into the functions directly so is temporary.

Also used in filter() and slice() functions.

beds_data |> 
  summarise(mean_beds = mean(beds_av, na.rm = TRUE),
            .by = date)

group_by() - persistent grouping

group_by() is a function that you may see in other code.

It does nothing to the output alone.
The change occurs behind the scenes.

beds_data |> 
  group_by(date) 

ungroup()

Cartoon of fuzzy creatures created by Allison Horts with party hats on. Two are together and happy but one is behind holding a present and looking sad. The words.

Break?

Option to take this break before an exercise or after

10:00

Q5. Which organisations have the highest mean % bed occupancy?

  • summarise() using sum() for total_beds and total_occupancy.
  • Grouping in summarise() by organisations using .by =.
  • mutate() the new 2 column data frame to create a percentage using the totals (occ / beds)
  • Order to find highest by using arrange()
10:00

Solutions

beds_data |> 
  summarise(total_beds = sum(beds_av, na.rm = TRUE),
            total_occupancy = sum(occ_av, na.rm = TRUE),
            .by = org_name) |> 
  mutate(perc_occ = total_occupancy / total_beds) |> 
  arrange(desc(perc_occ))

Answer using the group_by() function
beds_data |> 
  group_by(org_name) |> 
  summarise(total_beds = sum(beds_av, na.rm = TRUE),
            total_occupancy = sum(occ_av, na.rm = TRUE)) |> 
  mutate(perc_occ = total_occupancy / total_beds) |> 
  arrange(desc(perc_occ))

End of session