Cleaning SFO Weather Data
Helpful Data Wrangling Notes
-
month.abbis a built-in object in R with 3-letter month abbreviations - You can create your own data frame with the
tibble()function. Look up the documentation for this function by typing?tibble::tibblein the Console. - You can create regular sequences in R with
:, eg,3:5generates the sequencec(3, 4, 5). - You can create regular sequences in R with
seq(), eg,seq(from = 3, to = 5, by = 1)generates the sequencec(3, 4, 5). Look up the documentation for this function by typing?seqin the Console.
Practicing Keyboard Shortcuts
Try out the following as you work on this exercise:
- Tab completion (Try this out when writing your file paths! Typing out a partial path will pull up a mini file-explorer)
- Insert a code chunk
- Run a code chunk
- Navigating around words and lines (selecting and deleting them)
- Run selected lines (not a whole code chunk)
- Insert the assignment operator (
<-) - Insert the pipe operator (
|>)
Exercise
Carryout the following steps to clean and save the San Francisco Weather data. Make sure to download and add the data file to your portfolio repository as instructed.
- Read in the weather data in this file with the correct relative file path after you move it to the instructed location.
- There is a variable that has values that don’t make sense in the data context. Figure out which variable this is and clean it up by making those values missing using
na_if(). - Create a variable called
dateInYearthat indicates the day of the year (1-365) for each case. (Jan 1 should be 1, and Dec 31 should be 365). - Create a variable called
month_namethat shows the 3-letter abbreviation for each case. - Save the wrangled data to the
data/processed/folder usingwrite_csv(). Name this fileweather_clean.csv. Look up the documentation for this function by typing?write_csvin the Console. You’ll need to write an appropriate relative path.
# A tibble: 365 × 18
Month Day Low High NormalLow NormalHigh RecordLow LowYr RecordHigh
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11 20 48 55 48 62 35 1964 69
2 6 16 52 68 53 70 46 1952 90
3 5 9 47 63 50 66 41 1950 88
4 10 26 47 69 52 69 39 1954 89
5 9 27 55 82 55 73 47 1955 96
6 7 6 52 70 54 71 47 1953 86
7 11 3 48 60 51 66 40 1971 84
8 3 26 47 58 47 62 38 1980 79
9 10 4 57 66 55 72 47 1989 95
10 11 26 49 59 47 60 36 1952 76
# ℹ 355 more rows
# ℹ 9 more variables: HiYear <dbl>, Precip <dbl>, RecordPrecip <dbl>,
# PrecipYr <dbl>, date <chr>, Record <lgl>, RecordText <chr>, RecordP <lgl>,
# CulmPrec <dbl>
