Session - joins
It’s rare to find all the data you need for an analysis in a single table.
Typically, you’ll have to link two (or more) tables together by matching on common “key” variable(s).
We use joins in SQL or R, (V, H or X) LOOKUPs in Excel
All the joins in SQL are available in R called mutating joins and a couple more called filtering joins which have the same concept as EXISTS used in the WHERE clause in SQL.
We’re going to join two tables one with cases of tuberculosis by country, one with population by country.
Import the files:
tb_cases.csvtb_pop.csvtb_new_table.csv
05:00
When columns have the same name R resolves this conflict by adding .x and .y to the end of the column name
# A tibble: 64 × 5
   country     year.x  cases year.y        pop
   <chr>        <dbl>  <dbl>  <dbl>      <dbl>
 1 Afghanistan   1999    745   1999   19987071
 2 Afghanistan   1999    745   2000   20595360
 3 Afghanistan   1999    745   2001   21347782
 4 Afghanistan   1999    745   2002   22202806
 5 Brazil        1999  37737   1999  172006362
 6 Brazil        1999  37737   2000  174504898
 7 Brazil        1999  37737   2001  176968205
 8 Brazil        1999  37737   2002  179393768
 9 China         1999 212258   1999 1272915272
10 China         1999 212258   2000 1280428583
# ℹ 54 more rowsExplicitly joining on more than one column requires the use of a vector
and as of 2023 now:
{dplyr} joins can occur automatically
# A tibble: 16 × 4
   country      year  cases        pop
   <chr>       <dbl>  <dbl>      <dbl>
 1 Afghanistan  1999    745   19987071
 2 Brazil       1999  37737  172006362
 3 China        1999 212258 1272915272
 4 Denmark      1999    170    5319410
 5 Afghanistan  2000   2666   20595360
 6 Brazil       2000  80488  174504898
 7 China        2000 213766 1280428583
 8 Denmark      2000    171    5338283
 9 Afghanistan  2001   4639   21347782
10 Brazil       2001  37491  176968205
11 China        2001 212766 1287890449
12 Denmark      2001    124    5354684
13 Afghanistan  2002   6509   22202806
14 Brazil       2002  40723  179393768
15 China        2002 194972 1295322020
16 Denmark      2002    135    5368994If two tables have different names for same variable:
as of 2023
Semi-joins retain the data on the left if it matches the right, but drops the right (so is not like an inner join that keeps both sides).
All rows from x where there are matching values in y, keeping just columns from x.
Finding hospital patients who have had a Covid-19 test but only bring back the information on the hospital, nothing about the test.
Join to the tb_new_table and bring back only those records where the column FirstLetter is ‘A’ but drop the data from tb_new_table.
In text mining, exclude ‘stop words’ (but, and, or) which are kept in a separate table for reference.
Join to the tb_new_table and exclude only those records where the column FirstLetter is ‘A’ .
Thanks for the animated join gifs to Garrick Aden-Buie in his blog on tidyexplain.
A static representation of the joins can be found in the data wrangling cheatsheet shared by Anya Ferguson in the Intermediate to R NHS-R course.