15 Web scraping

🧩 Learning Goals

By the end of this lesson, you should be able to: - Use CSS Selectors and the Selector Gadget tool to locate data of interest within a webpage - Use the html_elements(), html_text(), and html_attr() functions within the rvest packages to scrape data from webpage using CSS selectors.

When to Use

We have talked about how to acquire data from APIs. Whenever an API is available for your project, you should default to getting data from the API. Sometimes an API will not be available, and web scraping is another means of getting data.

Web scraping describes the use of code to extract information displayed on a web page. In R, the rvest package offers tools for scraping. (rvest is meant to sound like “harvest”.)

Ethics

robots.txt is a file that some websites includes to clarify what can and cannot be scraped in addition to other constraints about scraping. When a website includes this file, we need to comply with the information in it for moral and legal reasons.

We will look through the information in ZenRows Tutorial How to Read robots.txt for Web Scraping and apply this to the NIH robots.txt file.

From our investigation of the NIH robots.txt, we learn:

User-agent: *: Anyone is allowed to scrape
No Crawl-delay entry: no restriction on the scraping speed
No Visit-time entry: no restrictions on time of day that scraping is allowed
No Request-rate entry: no restrictions on simultaneous requests
No mention of ?page=, news-events, news-releases, or https://science.education.nih.gov/ in the Disallow sections. (This is what we want to scrape today.)

Further considerations

The Towards Data Science article Ethics in Web Scraping–not available as of 10/24/2025 :( describes some good principles to ensure that we are valuing the labor that website owners invested to provide data and creating good from the information we do scrape.

HTML structure

HTML (hypertext markup language) is the formatting language used to create webpages. Let’s look at the core parts of HTML from the rvest vignette.

Key HTML Concepts:

Elements which consist of a start tag (e.g. <tag>) and end with an end tag (e.g. </tag>)
Optional attributes (id='first') within the start tag
Contents of an element

Key CSS Concepts:

To identify part of the html code, we can use the tag type (e.g. <a>, <h1>, <p>) and class or id attributes.
Each element can have a class attribute and/or an id attribute.
We can also identify parts of the code based on whether they are children or descendants of other elements, in that they are contained within an element. We won’t need this for simple applications.

Finding CSS Selectors

In order to gather information from a webpage, we must learn the language used to identify patterns of specific information. For example, on the NIH News Releases page, we can see that the data is represented in a consistent pattern of image + title + abstract.

We will identify data in a web page using a pattern matching language called CSS Selectors that can refer to specific patterns in HTML, the language used to write web pages.

For example:

Selecting by element/tag type:
- "a" selects all hyperlinks in a webpage (“a” represents “anchor” links in HTML)
- "p" selects all paragraph elements
Selecting by class and ID:
- ".description" selects all elements with class equal to “description”
  - The . at the beginning is what signifies class selection.
  - This is one of the most common CSS selectors for scraping because in HTML, the class attribute is extremely commonly used to format webpage elements. (Any number of HTML elements can have the same class, which is not true for the id attribute.)
- "#mainTitle" selects the SINGLE element with id equal to “mainTitle”
  - The # at the beginning is what signifies id selection.

<p class="title">Title of resource 1</p>
<p class="description">Description of resource 1</p>

<p class="title">Title of resource 2</p>
<p class="description">Description of resource 2</p>

Warning: Websites change often! So if you are going to scrape a lot of data, it is probably worthwhile to save and date a copy of the website (File > Save As). Otherwise, you may return after some time and your scraping code will include all of the wrong CSS selectors.

Although you can learn how to use CSS Selectors by hand (CSS Selectors), we will use a shortcut by installing the Selector Gadget tool.

There is a version available for Chrome–add it to Chrome via the Chome Web Store.
- Make sure to pin the extension to the menu bar. (Click the 3 dots > Extensions > Manage extensions. Click the “Details” button under SelectorGadget and toggle the “Pin to toolbar” option.)
There is also a version that can be saved as a bookmark in the browser–see here.

Let’s watch the Selector Gadget tutorial video before proceeding.

Example 1: NIH New Releases

Head over to the NIH News Releases page.

Click the Selector Gadget extension icon or bookmark button.

As you mouse over the webpage, different parts will be highlighted in orange.

Click on the title of the first news release.

You’ll notice that the Selector Gadget information in the lower right describes what you clicked on.

Scroll through the page to verify that only the information you intend (the description paragraph) is selected.

The selector panel shows the CSS selector (.thumbnail-teaser__heading) and the number of matches for that CSS selector (10). (You may have to be careful with your clicking–there are two overlapping boxes, and clicking on the link of the title can lead to the CSS selector of “a”.)

Exercise: Repeat the process above to find the correct selectors for the following fields. Make sure that each matches 10 results:

The publication date
The article abstract paragraph (which will also include the publication date) .thumbnail-teaser__description

Retrieving Data Using `rvest` and CSS Selectors

Now that we have identified CSS selectors for the information we need, let’s fetch the data using the rvest package.

Code

nih <- read_html("https://www.nih.gov/news-events/news-releases")

Once the webpage is loaded, we can retrieve data using the CSS selectors we specified earlier. The following code retrieves the article titles:

Code

# Retrieve and inspect article titles
article_titles <- nih |>
    html_elements(".thumbnail-teaser__heading") |>
    html_text()
head(article_titles)

[1] "HHS Doubles AI-Backed Childhood Cancer Research Funding"                                                   
[2] "Secretary Kennedy Swears in Dr. Anthony Letai as Director of the National Cancer Institute"                
[3] "NIH establishes nation's first dedicated organoid development center to reduce reliance on animal modeling"
[4] "NIH launches $50M Autism Data Science Initiative to unlock causes and improve outcomes"                    
[5] "Repeated head impacts cause early neuron loss and inflammation in young athletes"                          
[6] "NIH launches landmark project on whole-person health and function"

Exercise: Our goal is to get article titles, publication dates, and abstract text for news releases across several pages of results.

Before doing any coding, plan your approach.

What functions will you write?
What arguments will they have?
How will you use your functions?

Consult with your peers and compare plans.

Stop to reflect

Write a few observations about your comfort/confidence in this planning exercise. As you proceed through the implementation of this plan in the next steps, make notes about any places you struggled, were uncertain, or benefited from peer input.

Exercise: Carry out your plan to get the article title, publication date, and abstract text for the first 5 pages of news releases in a single data frame. You will need to write at least one function, and you will need iteration–Start with an appropriate map() function from purrr. If you have time, also try to use a for loop instead.

Notes:

Mouse over the page buttons at the very bottom of the news home page to see what the URLs look like.
The abstract should not have the publication date–use stringr and regular expressions to remove the publication date.
If the website has Crawl-delay: x in its robots.txt, include Sys.sleep(x) in your function to respect that.
Recall that bind_rows() from dplyr and list_rbind() from purrr takes a list of data frames and stacks them on top of each other.

Code

get_text_element <- function(page, element) {
  page %>%
    html_elements(element) %>%
    html_text()
}

scrape <- function(url) {
  Sys.sleep(2) 
  page <- read_html(url)
  title <- get_text_element(page, ".thumbnail-teaser__heading") 
  text <- get_text_element(page, ".thumbnail-teaser__description")
  date <- str_extract(text, "^([a-zA-Z0-9, ]+)—")
  date <- str_remove(date, " —")
  text <- str_remove(text, "^([a-zA-Z0-9, ]+)—")

  tibble(
    title = title,
    date = date,
    abstract = text
    )
}

base_url <- "https://www.nih.gov/news-events/news-releases"
urls_all_pages <- c(base_url, str_c(base_url, "?page=", 1:4))

pages <- purrr::map(urls_all_pages, scrape)
df_articles <- bind_rows(pages)
head(df_articles)

# A tibble: 6 × 3
  title                                                           date  abstract
  <chr>                                                           <chr> <chr>   
1 HHS Doubles AI-Backed Childhood Cancer Research Funding         Sept… "   Fol…
2 Secretary Kennedy Swears in Dr. Anthony Letai as Director of t… Sept… "   Dr.…
3 NIH establishes nation's first dedicated organoid development … Sept… "   The…
4 NIH launches $50M Autism Data Science Initiative to unlock cau… Sept… "   Fun…
5 Repeated head impacts cause early neuron loss and inflammation… Sept… "   NIH…
6 NIH launches landmark project on whole-person health and funct… Sept… "   App…

Example 2: NIH STEM Teaching Resources

Let’s look at a more complex example with the NIH STEM Teaching Resources webpage.

Using Selector Gadget to select the resource titles can be tricky if we can only get one resource title at a time.
In Chrome, you can right click part of a web page and click “Inspect”. This opens up Chrome’s Developer Tools. Mousing over the HTML in the top right panel highlights the corresponding part of the web page.
- For non-Chrome browsers, use the Help menu to search for Developer Tools.

The underlying HTML used to create a web page is also called the page source code or page source.

We learn from this that the resource titles are <h4> headings that have the class resource-title.

We can infer from this that .resource-title would be the CSS selector for the resource titles. ## Done!

Check the ICA Instructions for how to (a) push your code to GitHub and (b) update your portfolio website

--- title: "15 Web scraping" --- ## 🧩 Learning Goals By the end of this lesson, you should be able to: - Use CSS Selectors and the Selector Gadget tool to locate data of interest within a webpage - Use the `html_elements()`, `html_text()`, and `html_attr()` functions within the `rvest` packages to scrape data from webpage using CSS selectors. ```{r setup_15, echo=FALSE, message=FALSE} library(tidyverse) library(rvest) library(stringr) ``` ## When to Use We have talked about how to acquire data from APIs. Whenever an API is available for your project, you should default to getting data from the API. Sometimes an API will not be available, and **web scraping** is another means of getting data. Web scraping describes the use of code to extract information displayed on a web page. In R, the `rvest` package offers tools for scraping. (`rvest` is meant to sound like "harvest".) ## Ethics `robots.txt` is a file that some websites includes to clarify what can and cannot be scraped in addition to other constraints about scraping. When a website includes this file, we need to comply with the information in it for moral and legal reasons. We will look through the information in ZenRows Tutorial [How to Read robots.txt for Web Scraping](https://www.zenrows.com/blog/robots-txt-web-scraping) and apply this to the [NIH robots.txt file](https://www.nih.gov/robots.txt). From our investigation of the NIH `robots.txt`, we learn: - `User-agent: *`: Anyone is allowed to scrape - No `Crawl-delay` entry: no restriction on the scraping speed - No `Visit-time` entry: no restrictions on time of day that scraping is allowed - No `Request-rate` entry: no restrictions on simultaneous requests - No mention of `?page=`, `news-events`, `news-releases`, or `https://science.education.nih.gov/` in the `Disallow` sections. (This is what we want to scrape today.) **Further considerations** The Towards Data Science article [Ethics in Web Scraping--not available as of 10/24/2025 :(](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01) describes some good principles to ensure that we are valuing the labor that website owners invested to provide data and creating good from the information we do scrape. ## HTML structure HTML (hypertext markup language) is the formatting language used to create webpages. Let's look at the core parts of HTML from the [rvest vignette](https://cran.r-project.org/web/packages/rvest/vignettes/rvest.html). Key HTML Concepts: - **Elements** which consist of a start tag (e.g. `<tag>`) and end with an end tag (e.g. `</tag>`) - Optional **attributes** (`id='first'`) within the start tag - **Contents** of an element Key CSS Concepts: - To identify part of the html code, we can use the tag type (e.g. `<a>`, `<h1>`, `<p>`) and class or id attributes. - Each element can have a **class** attribute and/or an **id** attribute. - We can also identify parts of the code based on whether they are children or descendants of other elements, in that they are contained within an element. We won't need this for simple applications. ## Finding CSS Selectors In order to gather information from a webpage, we must learn the language used to identify patterns of specific information. For example, on the [NIH News Releases page](https://www.nih.gov/news-events/news-releases), we can see that the data is represented in a consistent pattern of image + title + abstract. We will identify data in a web page using a pattern matching language called [CSS Selectors](https://css-tricks.com/how-css-selectors-work/) that can refer to specific patterns in HTML, the language used to write web pages. For example: - Selecting by element/tag type: - `"a"` selects all hyperlinks in a webpage ("a" represents "anchor" links in HTML) - `"p"` selects all paragraph elements - Selecting by class and ID: - `".description"` selects all elements with `class` equal to "description" - The `.` at the beginning is what signifies `class` selection. - This is one of the most common CSS selectors for scraping because in HTML, the `class` attribute is extremely commonly used to format webpage elements. (Any number of HTML elements can have the same `class`, which is not true for the `id` attribute.) - `"#mainTitle"` selects the SINGLE element with **id** equal to "mainTitle" - The `#` at the beginning is what signifies `id` selection. ```html <p class="title">Title of resource 1</p> <p class="description">Description of resource 1</p> <p class="title">Title of resource 2</p> <p class="description">Description of resource 2</p> ``` **Warning**: Websites change often! So if you are going to scrape a lot of data, it is probably worthwhile to save and date a copy of the website (File > Save As). Otherwise, you may return after some time and your scraping code will include all of the wrong CSS selectors. Although you can learn how to use CSS Selectors by hand ([CSS Selectors](https://css-tricks.com/how-css-selectors-work/)), we will use a shortcut by installing the [Selector Gadget](http://selectorgadget.com/) tool. - There is a version available for Chrome--add it to Chrome via the [Chome Web Store](https://chromewebstore.google.com/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb). - Make sure to pin the extension to the menu bar. (Click the 3 dots > Extensions > Manage extensions. Click the "Details" button under SelectorGadget and toggle the "Pin to toolbar" option.) - There is also a version that can be saved as a bookmark in the browser--see [here](https://rvest.tidyverse.org/articles/selectorgadget.html). Let's watch the Selector Gadget [tutorial video](http://selectorgadget.com/) before proceeding. ## Example 1: NIH New Releases Head over to the [NIH News Releases page](https://www.nih.gov/news-events/news-releases). Click the Selector Gadget extension icon or bookmark button. As you mouse over the webpage, different parts will be highlighted in orange. Click on the title of the first news release. You'll notice that the Selector Gadget information in the lower right describes what you clicked on. Scroll through the page to verify that only the information you intend (the description paragraph) is selected. - The selector panel shows the CSS selector (`.thumbnail-teaser__heading`) and the number of matches for that CSS selector (10). (You may have to be careful with your clicking--there are two overlapping boxes, and clicking on the link of the title can lead to the CSS selector of "a".) **Exercise:** Repeat the process above to find the correct selectors for the following fields. Make sure that each matches 10 results: - The publication date - The article abstract paragraph (which will also include the publication date) .thumbnail-teaser__description ## Retrieving Data Using `rvest` and CSS Selectors Now that we have identified CSS selectors for the information we need, let's fetch the data using the `rvest` package. ```{r read_html} nih <- read_html("https://www.nih.gov/news-events/news-releases") ``` Once the webpage is loaded, we can retrieve data using the CSS selectors we specified earlier. The following code retrieves the article titles: ```{r html_elements_text} # Retrieve and inspect article titles article_titles <- nih |> html_elements(".thumbnail-teaser__heading") |> html_text() head(article_titles) ``` **Exercise:** Our goal is to get article titles, publication dates, and abstract text for news releases across several pages of results. **Before doing any coding,** plan your approach. - What functions will you write? - What arguments will they have? - How will you use your functions? Consult with your peers and compare plans. ::: {.callout-tip title="Stop to reflect"} Write a few observations about your comfort/confidence in this planning exercise. As you proceed through the implementation of this plan in the next steps, make notes about any places you struggled, were uncertain, or benefited from peer input. ::: **Exercise:** Carry out your plan to get the article title, publication date, and abstract text for the first 5 pages of news releases in a single data frame. You will need to write at least one function, and you will need iteration--Start with an appropriate `map()` function from `purrr`. If you have time, also try to use a `for` loop instead. Notes: - Mouse over the page buttons at the very bottom of the news home page to see what the URLs look like. - The abstract should not have the publication date--use `stringr` and regular expressions to remove the publication date. - If the website has `Crawl-delay: x` in its `robots.txt`, include `Sys.sleep(x)` in your function to respect that. - Recall that `bind_rows()` from `dplyr` and `list_rbind()` from `purrr` takes a list of data frames and stacks them on top of each other. ```{r} get_text_element <- function(page, element) { page %>% html_elements(element) %>% html_text() } scrape <- function(url) { Sys.sleep(2) page <- read_html(url) title <- get_text_element(page, ".thumbnail-teaser__heading") text <- get_text_element(page, ".thumbnail-teaser__description") date <- str_extract(text, "^([a-zA-Z0-9, ]+)—") date <- str_remove(date, " —") text <- str_remove(text, "^([a-zA-Z0-9, ]+)—") tibble( title = title, date = date, abstract = text ) } base_url <- "https://www.nih.gov/news-events/news-releases" urls_all_pages <- c(base_url, str_c(base_url, "?page=", 1:4)) pages <- purrr::map(urls_all_pages, scrape) df_articles <- bind_rows(pages) head(df_articles) ``` ## Example 2: NIH STEM Teaching Resources Let's look at a more complex example with the [NIH STEM Teaching Resources webpage](https://science.education.nih.gov/). - Using Selector Gadget to select the resource titles can be tricky if we can only get one resource title at a time. - In Chrome, you can right click part of a web page and click "Inspect". This opens up Chrome's Developer Tools. Mousing over the HTML in the top right panel highlights the corresponding part of the web page. - For non-Chrome browsers, use the Help menu to search for Developer Tools. The underlying HTML used to create a web page is also called the page **source code** or **page source**. We learn from this that the resource titles are `<h4>` headings that have the class `resource-title`. We can infer from this that `.resource-title` would be the CSS selector for the resource titles. ## Done! - Check the ICA Instructions for how to (a) push your code to GitHub and (b) update your portfolio website

🧩 Learning Goals

When to Use

Ethics

HTML structure

Finding CSS Selectors

Example 1: NIH New Releases

Retrieving Data Using rvest and CSS Selectors

Example 2: NIH STEM Teaching Resources

Retrieving Data Using `rvest` and CSS Selectors