Session - joins
It’s rare to find all the data you need for an analysis in a single table.
Typically, you’ll have to link two (or more) tables together by matching on common “key” variable(s).
We use joins in SQL or R, (V, H or X) LOOKUPs in Excel
All the joins in SQL are available in R called mutating joins
and a couple more called filtering joins
which have the same concept as EXISTS
used in the WHERE
clause in SQL.
We’re going to join two tables one with cases of tuberculosis by country, one with population by country.
Import the files:
tb_cases.csv
tb_pop.csv
tb_new_table.csv
05:00
When columns have the same name R resolves this conflict by adding .x
and .y
to the end of the column name
# A tibble: 64 × 5
country year.x cases year.y pop
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 1999 19987071
2 Afghanistan 1999 745 2000 20595360
3 Afghanistan 1999 745 2001 21347782
4 Afghanistan 1999 745 2002 22202806
5 Brazil 1999 37737 1999 172006362
6 Brazil 1999 37737 2000 174504898
7 Brazil 1999 37737 2001 176968205
8 Brazil 1999 37737 2002 179393768
9 China 1999 212258 1999 1272915272
10 China 1999 212258 2000 1280428583
# ℹ 54 more rows
Explicitly joining on more than one column requires the use of a vector
and as of 2023 now:
{dplyr} joins can occur automatically
# A tibble: 16 × 4
country year cases pop
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Brazil 1999 37737 172006362
3 China 1999 212258 1272915272
4 Denmark 1999 170 5319410
5 Afghanistan 2000 2666 20595360
6 Brazil 2000 80488 174504898
7 China 2000 213766 1280428583
8 Denmark 2000 171 5338283
9 Afghanistan 2001 4639 21347782
10 Brazil 2001 37491 176968205
11 China 2001 212766 1287890449
12 Denmark 2001 124 5354684
13 Afghanistan 2002 6509 22202806
14 Brazil 2002 40723 179393768
15 China 2002 194972 1295322020
16 Denmark 2002 135 5368994
If two tables have different names for same variable:
as of 2023
Semi-joins retain the data on the left if it matches the right, but drops the right (so is not like an inner join that keeps both sides).
All rows from x
where there are matching values in y
, keeping just columns from x
.
Finding hospital patients who have had a Covid-19 test but only bring back the information on the hospital, nothing about the test.
Join to the tb_new_table
and bring back only those records where the column FirstLetter
is ‘A’ but drop the data from tb_new_table.
In text mining, exclude ‘stop words’ (but, and, or) which are kept in a separate table for reference.
Join to the tb_new_table
and exclude only those records where the column FirstLetter
is ‘A’ .
Thanks for the animated join gifs to Garrick Aden-Buie in his blog on tidyexplain.
A static representation of the joins can be found in the data wrangling cheatsheet shared by Anya Ferguson in the Intermediate to R NHS-R course.