R in 3 Months Week 2 (Data Wrangling and Analysis)

What is the most surprising thing about R so far?

Please put your answer in the chat!

Agenda

  1. Housekeeping

  2. Review of dplyr Functions

  3. Common Issues and Your Questions

  4. Weekly Coach Tips

  5. Next Week

Housekeeping

Project Assignments

  • If you submitted a project assignment, you should receive an email notification when Gracielle gives you feedback

  • Gracielle can answer questions relevant to the topic of the week, but can’t answer everything for you

  • Instead, she will share resources with you where applicable

Datasets

Co-Working Session

Cheatsheets

Review of Functions

select()

mutate()

filter()

summarize()

group_by() |> summarize()

arrange()

Common Issues and Your Questions

How to Import Excel Files

Packages

Install packages once per computer

install.packages("tidyverse")


Load packages once per session

library(tidyverse)

Packages

Can you set up R to always load certain packages?

Projects vs Scripts

What is the point of creating both a R script file and a R project file? I understand that R project is a higher level structure and maybe it can pull from outputs of different R script files… please de-mystify.

Working directories

  • RStudio projects set your working directory to be the root of the project (i.e. where you find the .Rproj file)

  • Using a project, you only need to use relative file paths, not absolute file paths

  • This is easier to type and more reproducible

Keyboard shortcut to run code

  • command + enter (Mac) or control + enter (Windows)

Difference between native pipe and tidyverse pipe

Why is there a new native pipe in R, and what is the reason we are using the new native pipe vs. the tidyverse pipe in this course? I currently use the tidyverse pipe in my day-to-day work, so am just curious about whether I should be thinking about switching which pipe I use in that context, too.

What is the ideal maximum pipeline length?

Are there any widely accepted “best practices” around pipeline length? In the penguins example for arrange(), there were ~ 5 functions tied together in the pipeline. I can see the efficiency of this in that you only have to call penguins |> one time to implement all of those functions – however, I personally tend to feel more comfortable implementing code in smaller chunks so I can label exactly what each piece is doing, especially if my code will be shared with others.

What is the ideal maximum pipeline length?

Your pipes are longer than (say) ten steps. In that case, create intermediate objects with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.

Always save your R script files

Parentheses matter (a lot!)

What’s the logic for the “-” (minus) in the second solution below being in its own parenthesis but it’s not in the first solution? That tripped me up.

penguins |>
  select(-species)
penguins |>
  select(-(bill_length_mm:body_mass_g))

Parentheses matter (a lot!)

When I run the read_csv() code again to deal with the -999 values, it does not completely work.

Looking at the data using view(penguins_data), I still have some -999 values in case 4, as well as some -999.0 values in case 4.

Parentheses matter (a lot!)

I have adapted the code to read read_csv("penguins_data.csv", na = "-999, -999.0") to deal with the ones with the decimal point and the zero, but even after that there are still these issues, including in sex_v2

I can see that in sex there are some NA values now, which means the code has partially worked, I guess.

select() issues

penguins |>
  select(-island:year)
penguins |>
  select(-1, island:year)

Does not remove the “species” variable but this does:

penguins |>
  select(island:year)

How R handles NA values

  • SPSS has named NA values

  • In R, a value is only NA if it shows up in red (in the console) or light gray (in Quarto)

How R handles NA values

You’ll learn later to use functions from the tidyr package to deal with missing values:

  • replace_na() will replace existing NA values with your chosen values

  • na_if() will replace values you specify with NA

NA values

Is there some way to change the default behavior of summarize so that it ignores NAs without having to specify it specifically? I didn’t know if there was something like a global variable that you can set in the R script file, or something within the RStudio environment or installed package?

Quotes

Needed to refer to non-existent things

install.packages("tidyverse")

Not needed to refer to existing things

library(tidyverse)


penguins |>
  select(island)

Needed when you’re referring to text

penguins |>
  filter(island == "Torgersen")

Or the name of a file

penguins <- read_csv(file = "data-raw/penguins.csv")

Rounding

I went a step further by wanting to round my average and drop the decimals ( I noticed some decimal places in my answer). I used mutate and round to change it, but is there any easier or simpler way to format it?

penguins |>
  filter(island == "Biscoe") |>
  drop_na(body_mass_g, sex) |>
  group_by(sex) |>
  summarize(mean_body_mass = mean(body_mass_g)) |>
  mutate(mean_body_mass = round(mean_body_mass, digits = 0))

How to See All Variables

penguins

Typos happen to everone

Typos happen to everone

get_acs(
  year = 2019,
  geography = "county",
  geometry = TRUE,
  state = "OR",
  variables = "B01003_001"
) |>
  clean_names() |>
  mutate(name = str_remove(name, " County")) |>
  rename(
    poulation = estimate,
    county = name
  ) |>
  select(county, population)

Weekly coach tips

  1. Do not use generative AI!

  2. Don’t be afraid to submit your assignments.

Next Week

  1. Course assignment: complete data viz lessons

  2. Project assignment: make three plots from your data