Introduction to R and Rstudio

Session - {janitor} clean data

Zoë Turner

Specific packages to clean data

Packages like {janitor} have functions to do a lot of the cleaning required for data like:

  • Remove blank rows and columns
  • Change Excel serial dates to read dates
  • Standardise column names and remove spaces

Example of cleaning column headers

Getting the data from following slides

by_ethnicity <- read_csv("https://www.ethnicity-facts-figures.service.gov.uk/culture-and-community/digital/internet-use/latest/downloads/by-ethnicity.csv")

Changing the column names

Removes spaces and changes the % to a word

library(janitor)

by_ethnicity |> 
  clean_names()

Example of removing blank rows and columns

# Add in blank row and column

by_ethnicity_blank <- by_ethnicity |> 
  mutate(blank_column = NA) |> # Blank column
  add_row() # Blank row

by_ethnicity_blank |> 
  remove_empty(which = c("rows", "cols"))

Getting duplicates

Often code removes duplicates but sometimes you’ll want to see all the duplicated information:

duplicates <- tibble::tribble(
     ~Ethnicity, ~`%`, ~`estimated.number.(thousands)`,
          "All", 90.8,                           48098,
          "All", 90.8,                           48098,
          "All", 90.8,                           48098,
  "Bangladeshi", 91.9,                             354,
      "Chinese", 98.6,                             265,
       "Indian", 90.4,                            1077,
    "Pakistani", 91.1,                             767,
  "Asian other", 95.6,                             620,
        "Black", 92.8,                            1376,
        "Mixed",   96,                             547,
        "White", 90.5,                           42296,
        "Other", 94.5,                             796
  )
library(janitor)
duplicates |> 
  get_dupes()
# A tibble: 3 × 4
  Ethnicity   `%` `estimated.number.(thousands)` dupe_count
  <chr>     <dbl>                          <dbl>      <int>
1 All        90.8                          48098          3
2 All        90.8                          48098          3
3 All        90.8                          48098          3

End session