R in 3 Months Week 6 (Tidy Data)

Agenda

  1. Housekeeping

  2. Tidy Data

  3. Next Week

Housekeeping

Cheatsheets

Discord

What Does the Rest of R in 3 Months Look Like?

  • Week 7: Advanced Data Wrangling, Part 2 (More Functions for Wrangling Data and Functions)

  • Week 8: Advanced Data Wrangling, Part 3 (Data Merging and Exporting Data)

  • Week 9: Advanced Data Viz, Part 1 (Highlighting and Decluttering)

  • Week 10: Catch-Up Week

What Does the Rest of R in 3 Months Look Like?

  • Week 11: Advanced Data Viz, Part 2 (Explaining and Making Your Viz Sparkle)

  • Week 12: Advanced Quarto

  • Week 13: Wrap Up

Access to Materials

  • You have access to course materials and coach feedback FOREVER

Tidy Data

What Questions Do You Have About Tidy Data?

Tidy Data Rule #1: Every Column is a Variable

life_expectancy_over_time
# A tibble: 10 × 13
   country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghan…   28.8   30.3   32.0   34.0   36.1   38.4   39.9   40.8   41.7   41.8
 2 Albania   55.2   59.3   64.8   66.2   67.7   68.9   70.4   72     71.6   73.0
 3 Algeria   43.1   45.7   48.3   51.4   54.5   58.0   61.4   65.8   67.7   69.2
 4 Angola    30.0   32.0   34     36.0   37.9   39.5   39.9   39.9   40.6   41.0
 5 Argent…   62.5   64.4   65.1   65.6   67.1   68.5   69.9   70.8   71.9   73.3
 6 Austra…   69.1   70.3   70.9   71.1   71.9   73.5   74.7   76.3   77.6   78.8
 7 Austria   66.8   67.5   69.5   70.1   70.6   72.2   73.2   74.9   76.0   77.5
 8 Bahrain   50.9   53.8   56.9   59.9   63.3   65.6   69.1   70.8   72.6   73.9
 9 Bangla…   37.5   39.3   41.2   43.5   45.3   46.9   50.0   52.8   56.0   59.4
10 Belgium   68     69.2   70.2   70.9   71.4   72.8   73.9   75.4   76.5   77.5
# ℹ 2 more variables: `2002` <dbl>, `2007` <dbl>
life_expectancy_over_time |>
  pivot_longer(
    cols = -country,
    names_to = "year",
    values_to = "life_expectancy"
  )
# A tibble: 120 × 3
   country     year  life_expectancy
   <chr>       <chr>           <dbl>
 1 Afghanistan 1952             28.8
 2 Afghanistan 1957             30.3
 3 Afghanistan 1962             32.0
 4 Afghanistan 1967             34.0
 5 Afghanistan 1972             36.1
 6 Afghanistan 1977             38.4
 7 Afghanistan 1982             39.9
 8 Afghanistan 1987             40.8
 9 Afghanistan 1992             41.7
10 Afghanistan 1997             41.8
# ℹ 110 more rows

Tidy Data Rule #3: Every Cell is a Single Value

my_addresses
# A tibble: 3 × 2
  time_period address                                   
  <chr>       <chr>                                     
1 Childhood   537 Westdale Avenue, Swarthmore, PA, 19801
2 Childhood   690 Omar Circle, Yellow Springs, OH, 45387
3 Grad School 3809 Meade Avenue, San Diego, CA, 92116   
my_addresses |>
  separate_wider_delim(
    cols = address,
    delim = ", ",
    names = c("street", "city", "state", "zip_code")
  )
# A tibble: 3 × 5
  time_period street              city           state zip_code
  <chr>       <chr>               <chr>          <chr> <chr>   
1 Childhood   537 Westdale Avenue Swarthmore     PA    19801   
2 Childhood   690 Omar Circle     Yellow Springs OH    45387   
3 Grad School 3809 Meade Avenue   San Diego      CA    92116   

Tidy Data Rule #2: Every Row is an Observation

favorite_sports
# A tibble: 4 × 2
  name   favorite_sport              
  <chr>  <chr>                       
1 David  Soccer, Basketball          
2 Elias  Baseball, Soccer, Skiing    
3 Leila  Aerial Dance, Roller Skating
4 Rachel Soccer, Baseball            
favorite_sports |>
  separate_longer_delim(
    cols = favorite_sport,
    delim = ", "
  )
# A tibble: 9 × 2
  name   favorite_sport
  <chr>  <chr>         
1 David  Soccer        
2 David  Basketball    
3 Elias  Baseball      
4 Elias  Soccer        
5 Elias  Skiing        
6 Leila  Aerial Dance  
7 Leila  Roller Skating
8 Rachel Soccer        
9 Rachel Baseball      

Tidy Data Question

When working with “select all that apply” variables in the past in datasets where one row = one individual, I’ve typically dealt with this by converting each response option to its own column where a 1 (“yes”) is present if that response option was selected. This has worked well for my purposes, but I understand now that it violates Tidy Data Rule 1 because a single variable is spread across multiple columns. Am I understanding correctly that while the approach I’ve used in the paste is not inherently better or worse than tidy format, the advantage to making it tidy is that it will be easier to analyze using tidyverse? Is this still true if the unit of analysis I’m interested in is the individual and not the activity (for example)?

Tidy Data Live Coding

David

Gracielle

Next Week

  1. Lessons on additional data wrangling functions and learn to make your own functions

  2. No project assignment (but there will be one in week 8)