Paperwizard: Scrape News Sites using readability.js

R
package
Author

David Schoch

Published

February 12, 2025

In this blog post I want to introduce my last academically motivated R package paperwizard. The package is designed to extract readable content (such as news articles) from webpages using Readability.js. To do so, the package leverages Node.js to parse webpages and identify the main content of an article, allowing you to work with cleaner, structured content.

The package is supposed to be an addon for paperboy, which implements custom scraper for many international news websites.

Installation

You can install the package from GitHub

pak::pak("schochastics/paperwizard")

or r-universe

install.packages("paperwizard", repos = c("https://schochastics.r-universe.dev", "https://cloud.r-project.org"))

Setup

To use paperwizard, you need to have Node.js installed. Download and install Node.js from the official website. The page offers instructions for all major OS. After installing Node.js, you can confirm the installation by running the following command in your terminal.

node -v

This should return the version of Node.js installed.

To make sure that the package knows where the command node is found, set

options(paperwizard.node_path = "/path/to/node")

if it is not installed in a standard location.

Once Node.js is installed, you need to install the necessary libraries which are linkedom, Readability.js, puppeteer and axios. There is a convenient wrapper available in the package.

pw_npm_install()

Using the package

You can use it either by supplying a url

pw_deliver(url)

or a data.frame that was created by paperboy::pb_collect()

x <- paperboy::pb_collect(list_or_urls)
pw_deliver(x)

Example

To get more insights on the returned objects, let us get a recent article from The Conversation.

url <- "https://theconversation.com/generative-ai-online-platforms-and-compensation-for-content-the-need-for-a-new-framework-242847"
article <- paperwizard::pw_deliver(url)
str(article)
tibble [1 × 9] (S3: tbl_df/tbl/data.frame)
 $ url         : chr "https://theconversation.com/generative-ai-online-platforms-and-compensation-for-content-the-need-for-a-new-framework-242847"
 $ expanded_url: chr "https://theconversation.com/generative-ai-online-platforms-and-compensation-for-content-the-need-for-a-new-framework-242847"
 $ domain      : chr "theconversation.com"
 $ status      : num 200
 $ datetime    : POSIXct[1:1], format: "2025-02-10 16:14:29"
 $ author      : chr "Thomas Paris"
 $ headline    : chr "Generative AI, online platforms and compensation for content: the need for a new framework"
 $ text        : chr "The emergence of generative artificial intelligence has put the issue of compensation for content producers bac"| __truncated__
 $ misc        :List of 1
  ..$ :List of 10
  .. ..$ title        : chr "Generative AI, online platforms and compensation for content: the need for a new framework"
  .. ..$ byline       : chr "Thomas Paris"
  .. ..$ dir          : chr ""
  .. ..$ lang         : chr "en-EUROPE"
  .. ..$ content      : chr "<DIV class=\"page\" id=\"readability-page-1\"><div itemprop=\"articleBody\">\n    <p><strong>The emergence of g"| __truncated__
  .. ..$ textContent  : chr "\n    The emergence of generative artificial intelligence has put the issue of compensation for content produce"| __truncated__
  .. ..$ length       : int 5708
  .. ..$ excerpt      : chr "How will content creators be compensated for material used by artificial intelligence? Disputes involving tech "| __truncated__
  .. ..$ siteName     : chr "The Conversation"
  .. ..$ publishedTime: chr "2025-02-10T16:14:29Z"

Most fields should be self explanatory. The misc field is a dump of the raw return values of the scraper for debugging or to get additional information that is not available in the standard fields.

Paperboy vs. Paperwizard

As I said in the introduction, paperwizard is meant to be an addon to paperboy. Generally, it is always better to have a dedicated scraper for a news site, but building and maintaining such a scraper, let alone dozens of them, is a lot of work. Paperwizard can help in situations where a dedicated scraper is either not available or currently broken. But given its generality, it does not mean that it will work for any given site without issues. It is always a good idea to at least check a few examples manually to verify that the scraper worked.

Important Considerations

While web scraping is a valuable tool for data collection, it’s essential for researchers to approach it responsibly. Responsible web scraping helps ensure that data is collected ethically, legally, and in ways that protect both the integrity of the website and the privacy of individuals whose data may be included. If you are new to the topic, you can finde some help in this GESIS DBD Guide.

Reuse

Citation

BibTeX citation:
@online{schoch2025,
  author = {Schoch, David},
  title = {Paperwizard: {Scrape} {News} {Sites} Using Readability.js},
  date = {2025-02-12},
  url = {http://blog.schochastics.net/posts/2025-02-12_paperwizard-scrape-news-articles/},
  langid = {en}
}
For attribution, please cite this work as:
Schoch, David. 2025. “Paperwizard: Scrape News Sites Using Readability.js.” February 12, 2025. http://blog.schochastics.net/posts/2025-02-12_paperwizard-scrape-news-articles/.