Web scraping with R and rvest

The lessons below were designed for those interested in using R to obtain data from websites.

This introduction assumes no programming experience and should be used as a supplement to one of the other Carpentries R lessons, either before taking an R lesson for a quick introduction to R, or after an introduction to R, to examine another use case. The lessons can be taught in about 2 hours.

The first lesson introduces web scraping, then follows an introduction to HTML, scraping data using rvest, plotting the scraped data and a conclusion.

This lesson assumes no prior knowledge of R or RStudio and no programming experience.

Episodes


  1. Before we start
  2. Introduction: What is web scraping?
  3. What is HTML?
  4. Web scraping with rvest
  5. Plotting scraped data
  6. Conclusion

Preparations


Carpentry’s teaching is hands-on, and to follow this lesson learners must have R and RStudio installed on their computers. They also need to be able to install a number of R packages, create directories, and download files.

To avoid troubleshooting during the lesson, learners should follow the instruction below to download and install everything beforehand. If they are using their own computers this should be no problem, but if the computer is managed by their organization’s IT department they might need help from an IT administrator.

Install R and RStudio

R and RStudio are two separate pieces of software:

  • R is a programming language that is especially powerful for data exploration, visualization, and statistical analysis
  • RStudio is an integrated development environment (IDE) that makes using R easier. In this course we use RStudio to interact with R.

If you don’t already have R and RStudio installed, follow the instructions for your operating system below. You have to install R before you install RStudio.

Windows

  • Download R from the CRAN website.
  • Run the .exe file that was just downloaded
  • Go to the RStudio download page
  • Under All Installers, download the RStudio Installer for Windows.
  • Double click the file to install it
  • Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
MacOS
  • Download R from the CRAN website.
  • Select the .pkg file for the latest R version
  • Double click on the downloaded file to install R
  • It is also a good idea to install XQuartz (needed by some packages)
  • Go to the RStudio download page
  • Under All Installers, download the RStudio Installer for MacOS.
  • Double click the file to install RStudio
  • Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
Linux
  • Follow the instructions for your distribution from CRAN, they provide information to get the most recent version of R for common distributions. For most distributions, you could use your package manager (e.g., for Debian/Ubuntu run sudo apt-get install r-base, and for Fedora sudo yum install R), but we don’t recommend this approach as the versions provided by this are usually out of date. In any case, make sure you have at least R 3.3.1.
  • Go to the RStudio download page
  • Under All Installers, select the version that matches your distribution and install it with your preferred method (e.g., with Debian/Ubuntu sudo dpkg -i rstudio-YYYY.MM.X-ZZZ-amd64.deb at the terminal).
  • Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.

Update R and RStudio

If you already have R and RStudio installed, first check if your R version is up to date:

  • When you open RStudio your R version will be printed in the console on the bottom left. Alternatively, you can type sessionInfo() into the console. If your R version is 4.0.0 or later, you don’t need to update R for this lesson. If your version of R is older than that, download and install the latest version of R from the R project website for Windows, for MacOS, or for Linux
  • It is not necessary to remove old versions of R from your system, but if you wish to do so you can check How do I uninstall R?
  • Note: The changes introduced by new R versions are usually backwards-compatible. That is, your old code should still work after updating your R version. However, if breaking changes happen, it is useful to know that you can have multiple versions of R installed in parallel and that you can switch between them in RStudio by going to Tools > Global Options > General > Basic.
  • After installing a new version of R, you will have to reinstall all your packages with the new version. For Windows, there is a package called installr that can help you with upgrading your R version and migrate your package library.

To update RStudio to the latest version, open RStudio and click on Help > Check for Updates. If a new version is available follow the instruction on screen. By default, RStudio will also automatically notify you of new versions every once in a while.

Install required R packages

During the course we will need a number of R packages. Packages contain useful R code written by other people. We will use the packages rvest and tidyverse.

To try to install these packages, open RStudio and copy and paste the following command into the console window (look for a blinking cursor on the bottom left), then press the Enter (Windows and Linux) or Return (MacOS) to execute the command.

R

install.packages(c("tidyverse", "rvest", "ggplot2", "httr", "stringr"))

Alternatively, you can install the packages using RStudio’s graphical user interface by going to Tools > Install Packages and typing the names of the packages separated by a comma.

R tries to download and install the packages on your machine. When the installation has finished, you can try to load the packages by pasting the following code into the console:

R

library(tidyverse)
library(rvest)
library(ggplot2)
library(httr)
library(stringr)

If you do not see an error like there is no package called ‘...' you are good to go!

Updating R packages

Generally, it is recommended to keep your R version and all packages up to date, because new versions bring improvements and important bugfixes. To update the packages that you have installed, click Update in the Packages tab in the bottom right panel of RStudio, or go to Tools > Check for Package Updates....

Sometimes, package updates introduce changes that break your old code, which can be very frustrating. To avoid this problem, you can use a package called renv. It locks the package versions you have used for a given project and makes it straightforward to reinstall those exact package version in a new environment, for example after updating your R version or on another computer. However, the details are outside of the scope of this lesson.

Acknowledgments


This lesson draws material from the R Ecology lesson, R Social Science lesson and the Introduction to Web Scraping lesson.

FIXME: Setup instructions live in this document. Please specify the tools and the data sets the Learner needs to have installed.

Data Sets


Download the data zip file and unzip it to your Desktop

Software Setup


Details

Setup for different systems can be presented in dropdown menus via a solution tag. They will join to this discussion block, so you can give a general overview of the software used in this lesson here and fill out the individual operating systems (and potentially add more, e.g. online setup) in the solutions blocks.

Use PuTTY

Use Terminal.app

Use Terminal