First Steps in R
In order to follow along with this session, you’ll need to
have downloaded some of the files in the OSF repo for this series. If
you haven’t done so, go to this page and select the “Download this
folder” button on the top right of the screen: download OSF files
If you have any issues, please ask your instructor, or refer to
“Block 1: Introduction to OSF”.
Introduction
Now that we’ve familiarized ourselves a bit with R and RStudio, we
can start moving forward to working with data in R, which is where the
real fun begins!
R Projects
Setting working directory
When we talked about relative file paths in the previous session, it
was revolving around this idea of where you are in the file
system. In a coding environment, this translates to the idea of
setting a working directory, which is a way of telling
R (or other coding languages) where you want to be in your computer’s
file system when you’re doing your work.
A common way to set a working directory is to use the
setwd() function. This function manually sets the working
directory during an R session. It tells R where to look for files and
where to save outputs just for that session.
This can be done in 2 ways:
- Manually in R, by writing
setwd("path-to-director/secondary-part/etc...")
- Setting it by selecting the
Session tab in the toolbar,
and selecting Set Working Directory:

But here’s the problem…
Using setwd() can break your code when:
- Someone else tries to run it on their machine.
- You move your project folder.
- You’re running your code on a server or in cloud environments like
RStudio Cloud.
Since file paths are hardcoded and depend on your machine,
it’s not reproducible.
Create an R Project
An R Project is a feature in RStudio (and supported in base R too)
that provides a self-contained working environment. When you create an R
Project it creates a .Rproj file in a folder and that folder becomes the
root directory of your project. Every time you open the project (via the
.Rproj file), R automatically sets the working directory to that folder.
You can reference files relative to the project root — no need to
hardcode file paths.
This is super useful when you’re working on multiple analyses,
sharing code with collaborators, or version-controlling with Git.
It is a good practice for reproducible research
To create an R Project, select File > New Project

Your Turn!
Create your first RProject. Let’s figure out what we should call
it!
Packages and Libraries
When you first download R, it comes equipped with a number of
pre-installed functions, or capabilities, that you can start using
immediately. This is often called “Base R:. However, for certain tasks
and workflows, it can be beneficial to use more specialized tools, or
functions, to accomplish work and facilitate workflows for efficiently.
This is where packages and libraries come into play.
Packages: Packages are an extension of the
pre-built functions in R, and can be installed to bring in specific
functions to accomplish tasks, bring in sample datasets to play with,
among other things. There are tons of R packages out
there, but here is a list of some of the most common/useful ones: Quick
list of useful R packages
Libraries: Once you have installed a package,
they are stored as libraries in R. You only have to install them once,
and anytime you want to use the package you can use the
library() function, which is described below.
Tidyverse
As mentioned in the session “Reproducible Research: Moving
From Excel to Scripting”, we will be using an R package called
the “Tidyverse”. The Tidyverse is a very commonly used
package for research and data science activities, and instead of being a
single package, it is a collection of packages that are designed to work
together and which focus on the connections between activities in the
data science workflow. Each package follows the same syntax, which makes
learning them easier, and the website functions as a really good
reference point if you’re struggling with how to approach a specific
task.

Let’s take a closer look! Tidyverse
Let’s get started!
First, let’s create a new R script.
To create an R script file, select File > New File > R
Script

Install Package
To install a package we use the function
install.packages().
#install.packages("tidyverse")
Load Libraries
Packages are stored in libraries. Once a package is installed, we
need to call the library with the function library().
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.1
Note that the package name needs to be in quotations when installing
the package, but not when loading the library.
Because packages only need to be installed once, we can do this in
the R console as opposed to in the script.
Because libraries need to be loaded in each working session, we can
do this in the R script so that others can see what libraries we are
using and need to be loaded.
Referring again to the Tidyverse Data Science Workflow, in this
session we’ll be focusing on the first two steps:
Source
Reading Data
In order to start working with a dataset in R, we first need to
import, or “read in”, the data. As mentioned in the session
“Reproducible Research: Moving From Excel to
Scripting”, we will be working with .csv files,
but R is capable of reading in other file types as well.
Read a csv file
To import a csv file we can use the read_csv() function
and assign it to a new object we will call js_data. We create a
new object to be able to call it in different functions later on.
js_data <- read_csv("data/block-4_first-steps.csv")
Listing Column Names
To ask for a list of all the column names in our dataset we can use
the names() function.
names(js_data)
## [1] "PUMFID" "AGEGR10" "SEX" "MARSTAT" "PRV" "LUC_RST" "EHG_ALL" "GTU_110" "GTU_130"
## [10] "DUR01" "DUR05" "DUR06" "DURS200" "DURL313" "DUR08" "DUR13" "DUR14" "DUR15"
## [19] "MRW_20" "MRW_30" "MRW_40" "MRW_D40A" "MRW_D40B" "EDM_02" "TST_01" "TCS_110" "TCS_120"
## [28] "TCS_150" "TCS_200"
Notice that the column names from the original dataset don’t provide
a clear description of what the variable is. We will change the column
names later to facilitate working with our data in the future.
Head Function
The head function will display the top rows of the dataset. It will
include information about the default data type assigned to each column.
We will cover data types more in the next session.
head(js_data)
|
PUMFID
|
AGEGR10
|
SEX
|
MARSTAT
|
PRV
|
LUC_RST
|
EHG_ALL
|
GTU_110
|
GTU_130
|
DUR01
|
DUR05
|
DUR06
|
DURS200
|
DURL313
|
DUR08
|
DUR13
|
DUR14
|
DUR15
|
MRW_20
|
MRW_30
|
MRW_40
|
MRW_D40A
|
MRW_D40B
|
EDM_02
|
TST_01
|
TCS_110
|
TCS_120
|
TCS_150
|
TCS_200
|
|
10000
|
5
|
1
|
5
|
46
|
1
|
3
|
1
|
1
|
510
|
60
|
120
|
770
|
90
|
0
|
0
|
0
|
0
|
NA
|
1
|
1
|
1
|
2
|
NA
|
8
|
2
|
2
|
2
|
2
|
|
10001
|
5
|
1
|
1
|
59
|
1
|
4
|
3
|
4
|
420
|
150
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
NA
|
2
|
1
|
1
|
2
|
NA
|
1
|
2
|
2
|
2
|
2
|
|
10002
|
4
|
2
|
1
|
47
|
1
|
5
|
1
|
6
|
570
|
0
|
0
|
630
|
30
|
480
|
0
|
0
|
0
|
NA
|
NA
|
NA
|
1
|
1
|
NA
|
7
|
2
|
1
|
1
|
1
|
|
10003
|
6
|
2
|
5
|
35
|
1
|
4
|
2
|
4
|
510
|
10
|
45
|
875
|
80
|
20
|
0
|
0
|
0
|
NA
|
NA
|
NA
|
1
|
1
|
NA
|
1
|
2
|
2
|
2
|
2
|
|
10004
|
2
|
1
|
6
|
35
|
1
|
NA
|
1
|
3
|
525
|
90
|
40
|
815
|
0
|
0
|
0
|
0
|
0
|
NA
|
NA
|
NA
|
2
|
2
|
NA
|
1
|
2
|
2
|
2
|
2
|
|
10005
|
1
|
1
|
6
|
35
|
1
|
1
|
1
|
6
|
435
|
0
|
0
|
430
|
40
|
530
|
0
|
0
|
0
|
NA
|
NA
|
NA
|
1
|
1
|
NA
|
2
|
2
|
1
|
1
|
2
|
Viewing Data
To visualize the full dataset we use the View()
function. This will open our dataset in a separate window.
View(js_data)
Change Column Names
Let’s now return to the R syntax diagram that we looked at in Block
3:

In that session, we discussed objects and
functions, and played with the idea of objects storing
information, and functions manipulating the object/data.
Now we’re going to start implementing pipes as a way
to connect objects to functions and
arguments.
Pipes
Pipes are used to chain steps of instructions or actions together,
and often involve writing over an object to give it a new value. We’ll
walk through some examples of how this works, and start to see how the
full syntax of R comes together.
We mentioned earlier that we wanted to work with column names that
were more descriptive of the content of each variable. To change column
names we can use the function rename().
The function rename() is part of one of the packages
that was installed with tidyverse.
Type the following code to change the column name from “PUMFID” to
“id”
js_data <- js_data |>
rename("id" = "PUMFID")
Step-by-step explanation:
- This command first starts off with the
js_data object,
which is our dataset.
- The assignment operator comes next, and will re-write the
information stored in
js_data with all the information that
is to the right side of the operator.
- We then use the
js_data object and the pipe
|> to tell R that we want to take the data that is
stored in js_data, and then do something to it (which is
what comes after the pipe).
- The
rename function is used to rename columns, and is
always followed by brackets. Inside those brackets, we’ll put the new
column name that we want in quotation marks "", followed by
an equal sign =, followed by the existing column name in
quotation marks "".
- We then run the command and hope it works!
Your Turn!
First, use the functionnames() to display the column
names.
names(js_data)
Now, see if you can change the AGEGR10 column name to
ageGrp
js_data <- js_data |>
rename("ageGrp" = "AGEGR10")
Next, try to change the following 3 column names:
- Change
SEX to sex
- Change
MARSTAT to maritalStat
- Change
PRV to province
Hint: It can be tedious to do these changes one by
one, but by using commas , you can rename column names with
a single code chunk.
js_data <- js_data |>
rename("sex" = "SEX",
"maritalStat" = "MARSTAT",
"province" = "PRV")
Now, to change the rest of the column names copy the following code.
(If you feel you’re starting to understand how this works, you can show
the code below and cope + paste it into your script, or if
you want some more practice, feel free to do it yourself!
js_data <- js_data |>
rename("popCenter" = "LUC_RST",
"eduLevel" = "EHG_ALL",
"feelRushed" = "GTU_110",
"extraTime" = "GTU_130",
"durSleep" = "DUR01",
"durMealPrep" = "DUR05",
"durEating" = "DUR06",
"durAlone" = "DURS200",
"durDriving" = "DURL313",
"durWork" = "DUR08",
"durShoolSite" = "DUR13",
"durSchoolOnline" = "DUR14",
"durStudy" = "DUR15",
"mainStudy" = "MRW_20",
"mainJobHunting" = "MRW_30",
"mainWork" = "MRW_40",
"worked12m" = "MRW_D40A",
"workedWeek" = "MRW_D40B",
"enrollStat" = "EDM_02",
"dailyTexts" = "TST_01",
"timeSlowDown" = "TCS_110",
"timeWorkaholic" = "TCS_120",
"timeNotFamFriends" = "TCS_150",
"timeWantAlone" = "TCS_200")
Save Your Work
Now that we’ve created a dataset that has some significant changes,
it can be helpful to save this as a version that we can easily return
to. Much like we did with reading .csv data into R, there
is a similar command to save, or “write” .csv data back to
your computer called write_csv(). The syntax is as
follows:
write_csv(data-object-name, file="file-path/datafile-name.csv")
The file name below is for example purposes, and feel free to use a
file naming convention that you developed:
write_csv(js_data, file="data/clean-cols.csv")
.RData Files
An additional way to save data files in R is as an
.Rdata file. Unlike a simple .csv file,
.RData files are designed to store R objects and your
entire R workspace, while maintain their data types and structures
(covered in the next session!). This allows you to return to your R
environment exactly as you left it when you reload the file, and allows
others opening your files to have that environment as well, acting as a
great way to faciliate reproducibility.
The syntax is similar is the same as write_csv, but uses
the function save():
save(js_data, file="data/clean-cols.RData")
When opening an .RData file, instead of using
read_csv(), you use the load() command. The
syntax is the same, and to reopen the file you just saved would look
like this:
`load("data/clean-cols.RData")`
---
title: "First steps in R"
pagetitle: "First steps in R"
output:
  html_document:
    code_folding: show # allows toggling of showing and hiding code. Remove if not using code.
    code_download: true # allows the user to download the source .Rmd file. Remove if not using code.
    includes:
      after_body: footer.html # include a custom footer.
    toc: true
    toc_depth: 3
    toc_float:
      collapsed: false
      smooth_scroll: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(message = FALSE, warnings = FALSE)
```

## First Steps in R

**In order to follow along with this session, you'll need to have downloaded some of the files in the OSF repo for this series.  If you haven't done so, go to this page and select the "Download this folder" button on the top right of the screen:** [download OSF files](https://osf.io/rx5a3/files/osfstorage)

If you have any issues, please ask your instructor, or refer to **"Block 1: Introduction to OSF"**.


## Introduction

:::intro
Now that we've familiarized ourselves a bit with R and RStudio, we can start moving forward to working with data in R, which is where the real fun begins!
:::

## R Projects 

### Setting working directory

When we talked about relative file paths in the previous session, it was revolving around this idea of *where you are in the file system*.  In a coding environment, this translates to the idea of **setting a working directory**, which is a way of telling R (or other coding languages) where you want to be in your computer's file system when you're doing your work.  

A common way to set a working directory is to use the `setwd()` function. This function manually sets the working directory during an R session. It tells R where to look for files and where to save outputs just for that session.

This can be done in 2 ways:

* Manually in R, by writing `setwd("path-to-director/secondary-part/etc...")`
* Setting it by selecting the `Session` tab in the toolbar, and selecting `Set Working Directory`:

![](images/block4-1_set-wd.gif)


But here’s the problem...

Using `setwd()` can break your code when:

 - Someone else tries to run it on their machine.
 - You move your project folder.
 - You're running your code on a server or in cloud environments like RStudio Cloud.

Since file paths are hardcoded and depend on your machine, **it's not reproducible**.

### Create an R Project

An R Project is a feature in RStudio (and supported in base R too) that provides a self-contained working environment. When you create an R Project it creates a .Rproj file in a folder and that folder becomes the root directory of your project. Every time you open the project (via the .Rproj file), R automatically sets the working directory to that folder. You can reference files relative to the project root — no need to hardcode file paths.

This is super useful when you're working on multiple analyses, sharing code with collaborators, or version-controlling with Git. **It is a good practice for reproducible research**

To create an R Project, select File > New Project

![](images/day2_CreateProject.gif)


## Your Turn!

:::question
Create your first RProject. Let's figure out what we should call it!
:::


## Packages and Libraries

When you first download R, it comes equipped with a number of pre-installed functions, or capabilities, that you can start using immediately.  This is often called "Base R:.  However, for certain tasks and workflows, it can be beneficial to use more specialized tools, or functions, to accomplish work and facilitate workflows for efficiently.  This is where packages and libraries come into play.

:::note

* **Packages**: Packages are an extension of the pre-built functions in R, and can be installed to bring in specific functions to accomplish tasks, bring in sample datasets to play with, among other things.  There are **tons** of R packages out there, but here is a list of some of the most common/useful ones: [Quick list of useful R packages](https://support.posit.co/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages)

* **Libraries**: Once you have installed a package, they are stored as libraries in R.  You only have to install them once, and anytime you want to use the package you can use the `library()` function, which is described below.

:::


### Tidyverse

As mentioned in the session **"Reproducible Research: Moving From Excel to Scripting"**, we will be using an R package called the **"Tidyverse"**.  The Tidyverse is a very commonly used package for research and data science activities, and instead of being a single package, it is a collection of packages that are designed to work together and which focus on the connections between activities in the data science workflow.  Each package follows the same syntax, which makes learning them easier, and the website functions as a really good reference point if you're struggling with how to approach a specific task.

![](images/block4_data-science-workflow.jpg)


Let's take a closer look!  [Tidyverse](https://www.tidyverse.org/packages/)

<br>

## Let's get started!

First, let's create a new R script.

To create an R script file, select File > New File > R Script

![](images/block3_create-r-script.gif)


### Install Package
To install a package we use the function `install.packages()`. 

```{r}
#install.packages("tidyverse")
```

### Load Libraries
Packages are stored in libraries. Once a package is installed, we need to call the library with the function `library()`.

```{r}
library(tidyverse)
```

:::flag
Note that the package name needs to be in quotations when installing the package, but not when loading the library.

Because packages only need to be installed once, we can do this in the R console as opposed to in the script.

Because libraries need to be loaded in each working session, we can do this in the R script so that others can see what libraries we are using and need to be loaded.
:::


Referring again to the Tidyverse Data Science Workflow, in this session we'll be focusing on the first two steps:

![](images/day2_workflow.png)
<a href="https://telapps.london.edu/analytics_with_R/tidyverse.html">Source</a>

## Reading Data

In order to start working with a dataset in R, we first need to import, or "read in", the data.  As mentioned in the session **"Reproducible Research: Moving From Excel to Scripting"**, we will be working with `.csv` files, but R is capable of reading in other file types as well.


### Read a csv file
To import a csv file we can use the `read_csv()` function and assign it to a new object we will call *js_data*. We create a new object to be able to call it in different functions later on.
```{r}
js_data <- read_csv("data/block-4_first-steps.csv")
```

### Read Other Formats
In the example we are working with the data is stored in a csv file. The package **readr** from Tidyverse can also read other formats like `read_tsv()`(tab-separated values), `read_delim()`(delimited files CSV and TSV), `read_table()`(whitespace-separated files), `read_log()`(web log files).

There are other functions and packages that allow us to import different file types. 

**File Type** |**Function** | **Package**
|:------|:------|:-------|
|.csv | `read_csv()`| readr |
| .xlsx | `read.xlsx()`| xlsx |
| .sav | `read_sav()`| haven |
| .sas7bdat , .sas7bcat | `read_sas()`| haven |
| .dta | `read_dta()`| haven |

## Listing Column Names
To ask for a list of all the column names in our dataset we can use the `names()` function.
```{r}
names(js_data)
```

Notice that the column names from the original dataset don't provide a clear description of what the variable is. We will change the column names later to facilitate working with our data in the future.

## Head Function
The head function will display the top rows of the dataset. It will include information about the default data type assigned to each column. We will cover data types more in the next session.

```{r, data-isolation, results = FALSE}
head(js_data)
```

```{r, echo = FALSE, message = FALSE, warning=FALSE}
library(kableExtra)
head(js_data)|>
  kbl() |>
  #kable_styling(bootstrap_options = "striped")
kable_paper() %>%
  scroll_box(width = "500px", height = "200px")

```

## Viewing Data
To visualize the full dataset we use the `View()` function. This will open our dataset in a separate window.
```{r, eval=FALSE}
View(js_data)
```


## Change Column Names

Let's now return to the R syntax diagram that we looked at in Block 3:

![](images/day2_RSyntax.png)

In that session, we discussed **objects** and **functions**, and played with the idea of objects storing information, and functions manipulating the object/data.

Now we're going to start implementing **pipes** as a way to connect objects to **functions** and **arguments**.

#### Pipes

Pipes are used to chain steps of instructions or actions together, and often involve writing over an object to give it a new value. We'll walk through some examples of how this works, and start to see how the full syntax of R comes together.

<br>

We mentioned earlier that we wanted to work with column names that were more descriptive of the content of each variable. To change column names we can use the function `rename()`.

The function `rename()` is part of one of the packages that was installed with tidyverse.

:::walkthrough
Type the following code to change the column name from "PUMFID" to "id"
```{r}
js_data <- js_data |>
  rename("id" = "PUMFID")
```

**Step-by-step explanation:**

* This command first starts off with the `js_data` object, which is our dataset.
* The assignment operator comes next, and will re-write the information stored in `js_data` with all the information that is to the right side of the operator.
* We then use the `js_data` object and the pipe `|>` to tell R that we want to take the data that is stored in `js_data`, and then do something to it (which is what comes after the pipe).
* The `rename` function is used to rename columns, and is always followed by brackets.  Inside those brackets, we'll put the new column name that we want in quotation marks `""`, followed by an equal sign `=`, followed by the existing column name in quotation marks `""`.
* We then run the command and hope it works!
:::

## Your Turn!

:::question
First, use the function`names()` to display the column names.
```{r, eval=FALSE}
names(js_data)
```
:::

:::question
Now, see if you can change the `AGEGR10` column name to `ageGrp`
```{r, class.source = 'fold-hide'}
js_data <- js_data |>
  rename("ageGrp" = "AGEGR10")
```

:::

:::question
Next, try to change the following 3 column names:

* Change `SEX` to `sex`
* Change `MARSTAT` to `maritalStat`
* Change `PRV` to `province`

**Hint:** It can be tedious to do these changes one by one, but by using commas `,` you can rename column names with a single code chunk.  
```{r, class.source = 'fold-hide'}
js_data <- js_data |>
  rename("sex" = "SEX",
          "maritalStat" = "MARSTAT",
          "province" = "PRV")
```
:::

:::question
Now, to change the rest of the column names copy the following code. (If you feel you're starting to understand how this works, you can show the code below and `cope + paste` it into your script, or if you want some more practice, feel free to do it yourself!

```{r, class.source = 'fold-hide'}
js_data <- js_data |>
  rename("popCenter" = "LUC_RST",
          "eduLevel" = "EHG_ALL",
          "feelRushed" = "GTU_110",
          "extraTime" = "GTU_130",
          "durSleep" = "DUR01",
          "durMealPrep" = "DUR05",
          "durEating" = "DUR06",
          "durAlone" = "DURS200",
          "durDriving" = "DURL313",
          "durWork" = "DUR08",
          "durShoolSite" = "DUR13",
          "durSchoolOnline" = "DUR14",
          "durStudy" = "DUR15",
          "mainStudy" = "MRW_20",
          "mainJobHunting" = "MRW_30",
          "mainWork" = "MRW_40",
          "worked12m" = "MRW_D40A",
          "workedWeek" = "MRW_D40B",
          "enrollStat" = "EDM_02",
          "dailyTexts" = "TST_01",
          "timeSlowDown" = "TCS_110",
          "timeWorkaholic" = "TCS_120",
          "timeNotFamFriends" = "TCS_150",
          "timeWantAlone" = "TCS_200")
```

:::

## Save Your Work


Now that we've created a dataset that has some significant changes, it can be helpful to save this as a version that we can easily return to.  Much like we did with reading `.csv` data into R, there is a similar command to save, or "write" `.csv` data back to your computer called `write_csv()`.  The syntax is as follows:

`write_csv(data-object-name, file="file-path/datafile-name.csv")`

The file name below is for example purposes, and feel free to use a file naming convention that you developed:

```{r}
write_csv(js_data, file="data/clean-cols.csv")
```

<br>

### .RData Files

An additional way to save data files in R is as an `.Rdata` file.  Unlike a simple `.csv` file, `.RData` files are designed to store R objects and your entire R workspace, while maintain their data types and structures (covered in the next session!).  This allows you to return to your R environment exactly as you left it when you reload the file, and allows others opening your files to have that environment as well, acting as a great way to faciliate reproducibility.

The syntax is similar is the same as `write_csv`, but uses the function `save()`:

```{r}
save(js_data, file="data/clean-cols.RData")
```

<br>

:::note
When opening an `.RData` file, instead of using `read_csv()`, you use the `load()` command.  The syntax is the same, and to reopen the file you just saved would look like this:
```{r, eval = FALSE}
`load("data/clean-cols.RData")`
```


:::




