Web scraping using R and rvest
Last updated on 2023-07-11 | Edit this page
Overview
Questions
- How can scraping a web site be automated?
- How can I setup a scraping project using R and rvest?
- How do I tell rvest what elements to scrape from a webpage?
- What to do with the data extracted with rvest?
Objectives
- Setting up an rvest project.
- Understanding the various elements of an rvest project.
- Scrape a website and extract specific elements.
- Plot scraped data.
- Store the extracted data.
Introduction
The rvest
package allows us to scrape data from a
website. We will obtain the list of Brazilian senators and plot the
number of senators each political party has.
Setup the environment
R
library(rvest)
library(httr)
library(ggplot2)
library(stringr)
The additional libraries are - httr
for controling
downloads - ggplot2
plotting the data -
stringr
string manipulations to process the data
Get the website
R
BR_sen_html <- read_html(
GET("https://www25.senado.leg.br/web/senadores/em-exercicio",
timeout(600)))
# Check type of document
class(BR_sen_html)
# Examine the first few lines of the document
BR_sen_html
Now, check the type of document and examine the first few lines:
R
class(BR_sen_html)
BR_sen_html
Extract the data
Examining the source html page, all names are in a table with the id
id="senadoresemexercicio-tabela-senadores"
Let us extract
this table:
R
BR_sen_df <- BR_sen_html %>%
html_element("#senadoresemexercicio-tabela-senadores") %>%
html_table()
Let us view the first few rows
``r head(BR_sen_df)
## Clean the data
Each region has three senators, but the table also
contains lines with redundant information for the region
Since we want to analyze the data, we want to remove
these lines. We also want to substitute state
name abbreviations by the full names. We first create
a dataframe with the names and abbreviations, and
then create a vector we can use to substitute abbreviations
for full names:
```r
BR_sen_regions_df <- as.data.frame(str_split_fixed(BR_sen_df$UF, " - ", 2))
BR_sen_regions_df <- BR_sen_regions_df[c(seq(1,108,4)),]
bbr_replacements <- as.character(BR_sen_regions_df$V2)
names(abbr_replacements) <- BR_sen_regions_df$V1
We now remove the extra rows from the data frame:
R
BR_sen_df <- BR_sen_df[!grepl(' - ',BR_sen_df$Nome),]
and then replace the abbreviations
R
BR_sen_df$UF <- str_replace_all(BR_sen_df$UF,abbr_replacements)
Plot a portion of the data
With the clean data, we can plot the number of senators each party has,
R
g <- ggplot(BR_sen_df,aes(y=Partido))
g + geom_bar()
Save the data
Finally, we can save the data in a csv file:
R
write.csv(BR_sen_df,file="BrazilianSenateMembers.csv")