First Steps in R

In order to follow along with this session, you’ll need to have downloaded some of the files in the OSF repo for this series. If you haven’t done so, go to this page and select the “Download this folder” button on the top right of the screen: download OSF files

If you have any issues, please ask your instructor, or refer to “Block 1: Introduction to OSF”.

Introduction

Now that we’ve familiarized ourselves a bit with R and RStudio, we can start moving forward to working with data in R, which is where the real fun begins!

R Projects

Setting working directory

When we talked about relative file paths in the previous session, it was revolving around this idea of where you are in the file system. In a coding environment, this translates to the idea of setting a working directory, which is a way of telling R (or other coding languages) where you want to be in your computer’s file system when you’re doing your work.

A common way to set a working directory is to use the setwd() function. This function manually sets the working directory during an R session. It tells R where to look for files and where to save outputs just for that session.

This can be done in 2 ways:

  • Manually in R, by writing setwd("path-to-director/secondary-part/etc...")
  • Setting it by selecting the Session tab in the toolbar, and selecting Set Working Directory:

But here’s the problem…

Using setwd() can break your code when:

  • Someone else tries to run it on their machine.
  • You move your project folder.
  • You’re running your code on a server or in cloud environments like RStudio Cloud.

Since file paths are hardcoded and depend on your machine, it’s not reproducible.

Create an R Project

An R Project is a feature in RStudio (and supported in base R too) that provides a self-contained working environment. When you create an R Project it creates a .Rproj file in a folder and that folder becomes the root directory of your project. Every time you open the project (via the .Rproj file), R automatically sets the working directory to that folder. You can reference files relative to the project root — no need to hardcode file paths.

This is super useful when you’re working on multiple analyses, sharing code with collaborators, or version-controlling with Git. It is a good practice for reproducible research

To create an R Project, select File > New Project

Your Turn!

Create your first RProject. Let’s figure out what we should call it!

Packages and Libraries

When you first download R, it comes equipped with a number of pre-installed functions, or capabilities, that you can start using immediately. This is often called “Base R:. However, for certain tasks and workflows, it can be beneficial to use more specialized tools, or functions, to accomplish work and facilitate workflows for efficiently. This is where packages and libraries come into play.

  • Packages: Packages are an extension of the pre-built functions in R, and can be installed to bring in specific functions to accomplish tasks, bring in sample datasets to play with, among other things. There are tons of R packages out there, but here is a list of some of the most common/useful ones: Quick list of useful R packages

  • Libraries: Once you have installed a package, they are stored as libraries in R. You only have to install them once, and anytime you want to use the package you can use the library() function, which is described below.

Tidyverse

As mentioned in the session “Reproducible Research: Moving From Excel to Scripting”, we will be using an R package called the “Tidyverse”. The Tidyverse is a very commonly used package for research and data science activities, and instead of being a single package, it is a collection of packages that are designed to work together and which focus on the connections between activities in the data science workflow. Each package follows the same syntax, which makes learning them easier, and the website functions as a really good reference point if you’re struggling with how to approach a specific task.

Let’s take a closer look! Tidyverse


Let’s get started!

First, let’s create a new R script.

To create an R script file, select File > New File > R Script

Install Package

To install a package we use the function install.packages().

#install.packages("tidyverse")

Load Libraries

Packages are stored in libraries. Once a package is installed, we need to call the library with the function library().

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.1

Note that the package name needs to be in quotations when installing the package, but not when loading the library.

Because packages only need to be installed once, we can do this in the R console as opposed to in the script.

Because libraries need to be loaded in each working session, we can do this in the R script so that others can see what libraries we are using and need to be loaded.

Referring again to the Tidyverse Data Science Workflow, in this session we’ll be focusing on the first two steps:

Source

Reading Data

In order to start working with a dataset in R, we first need to import, or “read in”, the data. As mentioned in the session “Reproducible Research: Moving From Excel to Scripting”, we will be working with .csv files, but R is capable of reading in other file types as well.

Read a csv file

To import a csv file we can use the read_csv() function and assign it to a new object we will call js_data. We create a new object to be able to call it in different functions later on.

js_data <- read_csv("data/block-4_first-steps.csv")

Read Other Formats

In the example we are working with the data is stored in a csv file. The package readr from Tidyverse can also read other formats like read_tsv()(tab-separated values), read_delim()(delimited files CSV and TSV), read_table()(whitespace-separated files), read_log()(web log files).

There are other functions and packages that allow us to import different file types.

File Type Function Package
.csv read_csv() readr
.xlsx read.xlsx() xlsx
.sav read_sav() haven
.sas7bdat , .sas7bcat read_sas() haven
.dta read_dta() haven

Listing Column Names

To ask for a list of all the column names in our dataset we can use the names() function.

names(js_data)
##  [1] "PUMFID"   "AGEGR10"  "SEX"      "MARSTAT"  "PRV"      "LUC_RST"  "EHG_ALL"  "GTU_110"  "GTU_130" 
## [10] "DUR01"    "DUR05"    "DUR06"    "DURS200"  "DURL313"  "DUR08"    "DUR13"    "DUR14"    "DUR15"   
## [19] "MRW_20"   "MRW_30"   "MRW_40"   "MRW_D40A" "MRW_D40B" "EDM_02"   "TST_01"   "TCS_110"  "TCS_120" 
## [28] "TCS_150"  "TCS_200"

Notice that the column names from the original dataset don’t provide a clear description of what the variable is. We will change the column names later to facilitate working with our data in the future.

Head Function

The head function will display the top rows of the dataset. It will include information about the default data type assigned to each column. We will cover data types more in the next session.

head(js_data)
PUMFID AGEGR10 SEX MARSTAT PRV LUC_RST EHG_ALL GTU_110 GTU_130 DUR01 DUR05 DUR06 DURS200 DURL313 DUR08 DUR13 DUR14 DUR15 MRW_20 MRW_30 MRW_40 MRW_D40A MRW_D40B EDM_02 TST_01 TCS_110 TCS_120 TCS_150 TCS_200
10000 5 1 5 46 1 3 1 1 510 60 120 770 90 0 0 0 0 NA 1 1 1 2 NA 8 2 2 2 2
10001 5 1 1 59 1 4 3 4 420 150 0 0 0 0 0 0 0 NA 2 1 1 2 NA 1 2 2 2 2
10002 4 2 1 47 1 5 1 6 570 0 0 630 30 480 0 0 0 NA NA NA 1 1 NA 7 2 1 1 1
10003 6 2 5 35 1 4 2 4 510 10 45 875 80 20 0 0 0 NA NA NA 1 1 NA 1 2 2 2 2
10004 2 1 6 35 1 NA 1 3 525 90 40 815 0 0 0 0 0 NA NA NA 2 2 NA 1 2 2 2 2
10005 1 1 6 35 1 1 1 6 435 0 0 430 40 530 0 0 0 NA NA NA 1 1 NA 2 2 1 1 2

Viewing Data

To visualize the full dataset we use the View() function. This will open our dataset in a separate window.

View(js_data)

Change Column Names

Let’s now return to the R syntax diagram that we looked at in Block 3:

In that session, we discussed objects and functions, and played with the idea of objects storing information, and functions manipulating the object/data.

Now we’re going to start implementing pipes as a way to connect objects to functions and arguments.

Pipes

Pipes are used to chain steps of instructions or actions together, and often involve writing over an object to give it a new value. We’ll walk through some examples of how this works, and start to see how the full syntax of R comes together.


We mentioned earlier that we wanted to work with column names that were more descriptive of the content of each variable. To change column names we can use the function rename().

The function rename() is part of one of the packages that was installed with tidyverse.

Type the following code to change the column name from “PUMFID” to “id”

js_data <- js_data |>
  rename("id" = "PUMFID")

Step-by-step explanation:

  • This command first starts off with the js_data object, which is our dataset.
  • The assignment operator comes next, and will re-write the information stored in js_data with all the information that is to the right side of the operator.
  • We then use the js_data object and the pipe |> to tell R that we want to take the data that is stored in js_data, and then do something to it (which is what comes after the pipe).
  • The rename function is used to rename columns, and is always followed by brackets. Inside those brackets, we’ll put the new column name that we want in quotation marks "", followed by an equal sign =, followed by the existing column name in quotation marks "".
  • We then run the command and hope it works!

Your Turn!

First, use the functionnames() to display the column names.

names(js_data)

Now, see if you can change the AGEGR10 column name to ageGrp

js_data <- js_data |>
  rename("ageGrp" = "AGEGR10")

Next, try to change the following 3 column names:

  • Change SEX to sex
  • Change MARSTAT to maritalStat
  • Change PRV to province

Hint: It can be tedious to do these changes one by one, but by using commas , you can rename column names with a single code chunk.

js_data <- js_data |>
  rename("sex" = "SEX",
          "maritalStat" = "MARSTAT",
          "province" = "PRV")

Now, to change the rest of the column names copy the following code. (If you feel you’re starting to understand how this works, you can show the code below and cope + paste it into your script, or if you want some more practice, feel free to do it yourself!

js_data <- js_data |>
  rename("popCenter" = "LUC_RST",
          "eduLevel" = "EHG_ALL",
          "feelRushed" = "GTU_110",
          "extraTime" = "GTU_130",
          "durSleep" = "DUR01",
          "durMealPrep" = "DUR05",
          "durEating" = "DUR06",
          "durAlone" = "DURS200",
          "durDriving" = "DURL313",
          "durWork" = "DUR08",
          "durShoolSite" = "DUR13",
          "durSchoolOnline" = "DUR14",
          "durStudy" = "DUR15",
          "mainStudy" = "MRW_20",
          "mainJobHunting" = "MRW_30",
          "mainWork" = "MRW_40",
          "worked12m" = "MRW_D40A",
          "workedWeek" = "MRW_D40B",
          "enrollStat" = "EDM_02",
          "dailyTexts" = "TST_01",
          "timeSlowDown" = "TCS_110",
          "timeWorkaholic" = "TCS_120",
          "timeNotFamFriends" = "TCS_150",
          "timeWantAlone" = "TCS_200")

Save Your Work

Now that we’ve created a dataset that has some significant changes, it can be helpful to save this as a version that we can easily return to. Much like we did with reading .csv data into R, there is a similar command to save, or “write” .csv data back to your computer called write_csv(). The syntax is as follows:

write_csv(data-object-name, file="file-path/datafile-name.csv")

The file name below is for example purposes, and feel free to use a file naming convention that you developed:

write_csv(js_data, file="data/clean-cols.csv")


.RData Files

An additional way to save data files in R is as an .Rdata file. Unlike a simple .csv file, .RData files are designed to store R objects and your entire R workspace, while maintain their data types and structures (covered in the next session!). This allows you to return to your R environment exactly as you left it when you reload the file, and allows others opening your files to have that environment as well, acting as a great way to faciliate reproducibility.

The syntax is similar is the same as write_csv, but uses the function save():

save(js_data, file="data/clean-cols.RData")


When opening an .RData file, instead of using read_csv(), you use the load() command. The syntax is the same, and to reopen the file you just saved would look like this:

`load("data/clean-cols.RData")`
---
title: "First steps in R"
pagetitle: "First steps in R"
output:
  html_document:
    code_folding: show # allows toggling of showing and hiding code. Remove if not using code.
    code_download: true # allows the user to download the source .Rmd file. Remove if not using code.
    includes:
      after_body: footer.html # include a custom footer.
    toc: true
    toc_depth: 3
    toc_float:
      collapsed: false
      smooth_scroll: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(message = FALSE, warnings = FALSE)
```

## First Steps in R

**In order to follow along with this session, you'll need to have downloaded some of the files in the OSF repo for this series.  If you haven't done so, go to this page and select the "Download this folder" button on the top right of the screen:** [download OSF files](https://osf.io/rx5a3/files/osfstorage)

If you have any issues, please ask your instructor, or refer to **"Block 1: Introduction to OSF"**.


## Introduction

:::intro
Now that we've familiarized ourselves a bit with R and RStudio, we can start moving forward to working with data in R, which is where the real fun begins!
:::

## R Projects 

### Setting working directory

When we talked about relative file paths in the previous session, it was revolving around this idea of *where you are in the file system*.  In a coding environment, this translates to the idea of **setting a working directory**, which is a way of telling R (or other coding languages) where you want to be in your computer's file system when you're doing your work.  

A common way to set a working directory is to use the `setwd()` function. This function manually sets the working directory during an R session. It tells R where to look for files and where to save outputs just for that session.

This can be done in 2 ways:

* Manually in R, by writing `setwd("path-to-director/secondary-part/etc...")`
* Setting it by selecting the `Session` tab in the toolbar, and selecting `Set Working Directory`:

![](images/block4-1_set-wd.gif)


But here’s the problem...

Using `setwd()` can break your code when:

 - Someone else tries to run it on their machine.
 - You move your project folder.
 - You're running your code on a server or in cloud environments like RStudio Cloud.

Since file paths are hardcoded and depend on your machine, **it's not reproducible**.

### Create an R Project

An R Project is a feature in RStudio (and supported in base R too) that provides a self-contained working environment. When you create an R Project it creates a .Rproj file in a folder and that folder becomes the root directory of your project. Every time you open the project (via the .Rproj file), R automatically sets the working directory to that folder. You can reference files relative to the project root — no need to hardcode file paths.

This is super useful when you're working on multiple analyses, sharing code with collaborators, or version-controlling with Git. **It is a good practice for reproducible research**

To create an R Project, select File > New Project

![](images/day2_CreateProject.gif)


## Your Turn!

:::question
Create your first RProject. Let's figure out what we should call it!
:::


## Packages and Libraries

When you first download R, it comes equipped with a number of pre-installed functions, or capabilities, that you can start using immediately.  This is often called "Base R:.  However, for certain tasks and workflows, it can be beneficial to use more specialized tools, or functions, to accomplish work and facilitate workflows for efficiently.  This is where packages and libraries come into play.

:::note

* **Packages**: Packages are an extension of the pre-built functions in R, and can be installed to bring in specific functions to accomplish tasks, bring in sample datasets to play with, among other things.  There are **tons** of R packages out there, but here is a list of some of the most common/useful ones: [Quick list of useful R packages](https://support.posit.co/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages)

* **Libraries**: Once you have installed a package, they are stored as libraries in R.  You only have to install them once, and anytime you want to use the package you can use the `library()` function, which is described below.

:::


### Tidyverse

As mentioned in the session **"Reproducible Research: Moving From Excel to Scripting"**, we will be using an R package called the **"Tidyverse"**.  The Tidyverse is a very commonly used package for research and data science activities, and instead of being a single package, it is a collection of packages that are designed to work together and which focus on the connections between activities in the data science workflow.  Each package follows the same syntax, which makes learning them easier, and the website functions as a really good reference point if you're struggling with how to approach a specific task.

![](images/block4_data-science-workflow.jpg)


Let's take a closer look!  [Tidyverse](https://www.tidyverse.org/packages/)

<br>

## Let's get started!

First, let's create a new R script.

To create an R script file, select File > New File > R Script

![](images/block3_create-r-script.gif)


### Install Package
To install a package we use the function `install.packages()`. 

```{r}
#install.packages("tidyverse")
```

### Load Libraries
Packages are stored in libraries. Once a package is installed, we need to call the library with the function `library()`.

```{r}
library(tidyverse)
```

:::flag
Note that the package name needs to be in quotations when installing the package, but not when loading the library.

Because packages only need to be installed once, we can do this in the R console as opposed to in the script.

Because libraries need to be loaded in each working session, we can do this in the R script so that others can see what libraries we are using and need to be loaded.
:::


Referring again to the Tidyverse Data Science Workflow, in this session we'll be focusing on the first two steps:

![](images/day2_workflow.png)
<a href="https://telapps.london.edu/analytics_with_R/tidyverse.html">Source</a>

## Reading Data

In order to start working with a dataset in R, we first need to import, or "read in", the data.  As mentioned in the session **"Reproducible Research: Moving From Excel to Scripting"**, we will be working with `.csv` files, but R is capable of reading in other file types as well.


### Read a csv file
To import a csv file we can use the `read_csv()` function and assign it to a new object we will call *js_data*. We create a new object to be able to call it in different functions later on.
```{r}
js_data <- read_csv("data/block-4_first-steps.csv")
```

### Read Other Formats
In the example we are working with the data is stored in a csv file. The package **readr** from Tidyverse can also read other formats like `read_tsv()`(tab-separated values), `read_delim()`(delimited files CSV and TSV), `read_table()`(whitespace-separated files), `read_log()`(web log files).

There are other functions and packages that allow us to import different file types. 

**File Type** |**Function** | **Package**
|:------|:------|:-------|
|.csv | `read_csv()`| readr |
| .xlsx | `read.xlsx()`| xlsx |
| .sav | `read_sav()`| haven |
| .sas7bdat , .sas7bcat | `read_sas()`| haven |
| .dta | `read_dta()`| haven |

## Listing Column Names
To ask for a list of all the column names in our dataset we can use the `names()` function.
```{r}
names(js_data)
```

Notice that the column names from the original dataset don't provide a clear description of what the variable is. We will change the column names later to facilitate working with our data in the future.

## Head Function
The head function will display the top rows of the dataset. It will include information about the default data type assigned to each column. We will cover data types more in the next session.

```{r, data-isolation, results = FALSE}
head(js_data)
```

```{r, echo = FALSE, message = FALSE, warning=FALSE}
library(kableExtra)
head(js_data)|>
  kbl() |>
  #kable_styling(bootstrap_options = "striped")
kable_paper() %>%
  scroll_box(width = "500px", height = "200px")

```

## Viewing Data
To visualize the full dataset we use the `View()` function. This will open our dataset in a separate window.
```{r, eval=FALSE}
View(js_data)
```


## Change Column Names

Let's now return to the R syntax diagram that we looked at in Block 3:

![](images/day2_RSyntax.png)

In that session, we discussed **objects** and **functions**, and played with the idea of objects storing information, and functions manipulating the object/data.

Now we're going to start implementing **pipes** as a way to connect objects to **functions** and **arguments**.

#### Pipes

Pipes are used to chain steps of instructions or actions together, and often involve writing over an object to give it a new value. We'll walk through some examples of how this works, and start to see how the full syntax of R comes together.

<br>

We mentioned earlier that we wanted to work with column names that were more descriptive of the content of each variable. To change column names we can use the function `rename()`.

The function `rename()` is part of one of the packages that was installed with tidyverse.

:::walkthrough
Type the following code to change the column name from "PUMFID" to "id"
```{r}
js_data <- js_data |>
  rename("id" = "PUMFID")
```

**Step-by-step explanation:**

* This command first starts off with the `js_data` object, which is our dataset.
* The assignment operator comes next, and will re-write the information stored in `js_data` with all the information that is to the right side of the operator.
* We then use the `js_data` object and the pipe `|>` to tell R that we want to take the data that is stored in `js_data`, and then do something to it (which is what comes after the pipe).
* The `rename` function is used to rename columns, and is always followed by brackets.  Inside those brackets, we'll put the new column name that we want in quotation marks `""`, followed by an equal sign `=`, followed by the existing column name in quotation marks `""`.
* We then run the command and hope it works!
:::

## Your Turn!

:::question
First, use the function`names()` to display the column names.
```{r, eval=FALSE}
names(js_data)
```
:::

:::question
Now, see if you can change the `AGEGR10` column name to `ageGrp`
```{r, class.source = 'fold-hide'}
js_data <- js_data |>
  rename("ageGrp" = "AGEGR10")
```

:::

:::question
Next, try to change the following 3 column names:

* Change `SEX` to `sex`
* Change `MARSTAT` to `maritalStat`
* Change `PRV` to `province`

**Hint:** It can be tedious to do these changes one by one, but by using commas `,` you can rename column names with a single code chunk.  
```{r, class.source = 'fold-hide'}
js_data <- js_data |>
  rename("sex" = "SEX",
          "maritalStat" = "MARSTAT",
          "province" = "PRV")
```
:::

:::question
Now, to change the rest of the column names copy the following code. (If you feel you're starting to understand how this works, you can show the code below and `cope + paste` it into your script, or if you want some more practice, feel free to do it yourself!

```{r, class.source = 'fold-hide'}
js_data <- js_data |>
  rename("popCenter" = "LUC_RST",
          "eduLevel" = "EHG_ALL",
          "feelRushed" = "GTU_110",
          "extraTime" = "GTU_130",
          "durSleep" = "DUR01",
          "durMealPrep" = "DUR05",
          "durEating" = "DUR06",
          "durAlone" = "DURS200",
          "durDriving" = "DURL313",
          "durWork" = "DUR08",
          "durShoolSite" = "DUR13",
          "durSchoolOnline" = "DUR14",
          "durStudy" = "DUR15",
          "mainStudy" = "MRW_20",
          "mainJobHunting" = "MRW_30",
          "mainWork" = "MRW_40",
          "worked12m" = "MRW_D40A",
          "workedWeek" = "MRW_D40B",
          "enrollStat" = "EDM_02",
          "dailyTexts" = "TST_01",
          "timeSlowDown" = "TCS_110",
          "timeWorkaholic" = "TCS_120",
          "timeNotFamFriends" = "TCS_150",
          "timeWantAlone" = "TCS_200")
```

:::

## Save Your Work


Now that we've created a dataset that has some significant changes, it can be helpful to save this as a version that we can easily return to.  Much like we did with reading `.csv` data into R, there is a similar command to save, or "write" `.csv` data back to your computer called `write_csv()`.  The syntax is as follows:

`write_csv(data-object-name, file="file-path/datafile-name.csv")`

The file name below is for example purposes, and feel free to use a file naming convention that you developed:

```{r}
write_csv(js_data, file="data/clean-cols.csv")
```

<br>

### .RData Files

An additional way to save data files in R is as an `.Rdata` file.  Unlike a simple `.csv` file, `.RData` files are designed to store R objects and your entire R workspace, while maintain their data types and structures (covered in the next session!).  This allows you to return to your R environment exactly as you left it when you reload the file, and allows others opening your files to have that environment as well, acting as a great way to faciliate reproducibility.

The syntax is similar is the same as `write_csv`, but uses the function `save()`:

```{r}
save(js_data, file="data/clean-cols.RData")
```

<br>

:::note
When opening an `.RData` file, instead of using `read_csv()`, you use the `load()` command.  The syntax is the same, and to reopen the file you just saved would look like this:
```{r, eval = FALSE}
`load("data/clean-cols.RData")`
```


:::




