Please put your answer in the chat!
Housekeeping
Review of dplyr
Functions
Common Issues and Your Questions
Weekly Coach Tips
Next Week
If you submitted a project assignment, you should receive an email notification when Gracielle gives you feedback
Gracielle can answer questions relevant to the topic of the week, but can’t answer everything for you
Instead, she will share resources with you where applicable
Check in with Gracielle about a good dataset to use!
select()
mutate()
filter()
summarize()
group_by() |> summarize()
arrange()
Install packages once per computer
Load packages once per session
Can you set up R to always load certain packages?
What is the point of creating both a R script file and a R project file? I understand that R project is a higher level structure and maybe it can pull from outputs of different R script files… please de-mystify.
RStudio projects set your working directory to be the root of the project (i.e. where you find the .Rproj
file)
Using a project, you only need to use relative file paths, not absolute file paths
This is easier to type and more reproducible
Why is there a new native pipe in R, and what is the reason we are using the new native pipe vs. the tidyverse pipe in this course? I currently use the tidyverse pipe in my day-to-day work, so am just curious about whether I should be thinking about switching which pipe I use in that context, too.
Are there any widely accepted “best practices” around pipeline length? In the penguins example for arrange(), there were ~ 5 functions tied together in the pipeline. I can see the efficiency of this in that you only have to call penguins |> one time to implement all of those functions – however, I personally tend to feel more comfortable implementing code in smaller chunks so I can label exactly what each piece is doing, especially if my code will be shared with others.
Your pipes are longer than (say) ten steps. In that case, create intermediate objects with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.
Source: R for Data Science, 1st Edition
What’s the logic for the “-” (minus) in the second solution below being in its own parenthesis but it’s not in the first solution? That tripped me up.
When I run the
read_csv()
code again to deal with the -999 values, it does not completely work.
Looking at the data using
view(penguins_data)
, I still have some -999 values in case 4, as well as some -999.0 values in case 4.
I have adapted the code to read
read_csv("penguins_data.csv", na = "-999, -999.0")
to deal with the ones with the decimal point and the zero, but even after that there are still these issues, including in sex_v2
I can see that in sex there are some
NA
values now, which means the code has partially worked, I guess.
select()
issuesDoes not remove the “species” variable but this does:
SPSS has named NA values
In R, a value is only NA if it shows up in red (in the console) or light gray (in Quarto)
You’ll learn later to use functions from the tidyr
package to deal with missing values:
replace_na()
will replace existing NA
values with your chosen values
na_if()
will replace values you specify with NA
Is there some way to change the default behavior of summarize so that it ignores NAs without having to specify it specifically? I didn’t know if there was something like a global variable that you can set in the R script file, or something within the RStudio environment or installed package?
I went a step further by wanting to round my average and drop the decimals ( I noticed some decimal places in my answer). I used mutate and round to change it, but is there any easier or simpler way to format it?
Do not use generative AI!
Don’t be afraid to submit your assignments.
Course assignment: complete data viz lessons
Project assignment: make three plots from your data