Introduction to R and Rstudio

Session - joins

Zoë Turner

Relational data

It’s rare to find all the data you need for an analysis in a single table.

Typically, you’ll have to link two (or more) tables together by matching on common “key” variable(s).

We use joins in SQL or R, (V, H or X) LOOKUPs in Excel

Joins

All the joins in SQL are available in R called mutating joins and a couple more called filtering joins which have the same concept as EXISTS used in the WHERE clause in SQL.

left_join()

Keep structure of table x

…and match to observations in table y

“key” variable (common to both tables)

x |> 
  left_join(y, by = "id")

Moving gif that shows two tables and their merge with the data on the left 1, 2 and 3 being added to from the right where it matches

Import some files!

We’re going to join two tables one with cases of tuberculosis by country, one with population by country.

Import the files:

tb_cases.csv
tb_pop.csv
tb_new_table.csv

05:00

Join 2 tables together

When columns have the same name R resolves this conflict by adding .x and .y to the end of the column name

tb_cases |> 
  left_join(tb_pop, by = "country")
# A tibble: 64 × 5
   country     year.x  cases year.y        pop
   <chr>        <dbl>  <dbl>  <dbl>      <dbl>
 1 Afghanistan   1999    745   1999   19987071
 2 Afghanistan   1999    745   2000   20595360
 3 Afghanistan   1999    745   2001   21347782
 4 Afghanistan   1999    745   2002   22202806
 5 Brazil        1999  37737   1999  172006362
 6 Brazil        1999  37737   2000  174504898
 7 Brazil        1999  37737   2001  176968205
 8 Brazil        1999  37737   2002  179393768
 9 China         1999 212258   1999 1272915272
10 China         1999 212258   2000 1280428583
# ℹ 54 more rows

Join on multiple columns

Explicitly joining on more than one column requires the use of a vector

tb_cases |> 
  left_join(tb_pop, by = c("country", "year"))

and as of 2023 now:

tb_cases |> 
  left_join(tb_pop, join_by(country, year))

Default combine

{dplyr} joins can occur automatically

tb_cases |> 
  left_join(tb_pop)
# A tibble: 16 × 4
   country      year  cases        pop
   <chr>       <dbl>  <dbl>      <dbl>
 1 Afghanistan  1999    745   19987071
 2 Brazil       1999  37737  172006362
 3 China        1999 212258 1272915272
 4 Denmark      1999    170    5319410
 5 Afghanistan  2000   2666   20595360
 6 Brazil       2000  80488  174504898
 7 China        2000 213766 1280428583
 8 Denmark      2000    171    5338283
 9 Afghanistan  2001   4639   21347782
10 Brazil       2001  37491  176968205
11 China        2001 212766 1287890449
12 Denmark      2001    124    5354684
13 Afghanistan  2002   6509   22202806
14 Brazil       2002  40723  179393768
15 China        2002 194972 1295322020
16 Denmark      2002    135    5368994

Joining with different names

If two tables have different names for same variable:

tb_cases |> 
  left_join(tb_new_table, by = c("country" = "Place",
                                 "year" = "Year"))

as of 2023

tb_cases |> 
  left_join(tb_new_table, by = join_by(country == Place,
                                       year == Year))

semi_join()

Semi-joins retain the data on the left if it matches the right, but drops the right (so is not like an inner join that keeps both sides).

All rows from x where there are matching values in y, keeping just columns from x.

Moving gif showing the data from the left table being kept if it matches the right but everything is dropped from the right

Moving gif showing the data from the left table being kept if it matches the right but everything is dropped from the right

Use Case - hospital Covid tests

Finding hospital patients who have had a Covid-19 test but only bring back the information on the hospital, nothing about the test.

Join to the tb_new_table and bring back only those records where the column FirstLetter is ‘A’ but drop the data from tb_new_table.

# Create a new data frame to join to
lookup_table <- tb_new_table |> 
  filter(FirstLetter == "A")

tb_cases |> 
  semi_join(lookup_table, by = join_by(country == Place,
                                       year == Year))

anti_join()

All rows from x where there are not matching values in y, keeping just columns from x.

Moving gif that only retrains from the left table what doesn't match on the right

Moving gif that only retrains from the left table what doesn’t match on the right

Use Case - text mining stop words

In text mining, exclude ‘stop words’ (but, and, or) which are kept in a separate table for reference.

Join to the tb_new_table and exclude only those records where the column FirstLetter is ‘A’ .

tb_cases |> 
  anti_join(lookup_table, by = join_by(country == Place,
                                       year == Year))

End session