Scraping the web in node.js using request and cheerio.js

There are multiple tools to scrap and retrieve content from websites in multiple languages. In Java you can use Jaunt, for C# you can use this approach, and in python there is a popular library called scrapy. Today, I’m going to show you a simple way to crawl a website using server-side Javascript (Node.js) and the help of Request and Cheerio.

Request is a library which main purpose is to create http requests to retrieve web content easily. It works on top of http library and simplifies calls and responses handling.

Cheerio is the equivalent to jQuery for Node.js. It implements the core functions of jQuery. Remember that in Node, unlike in client-side javascript, there isn’t a DOM. Using cheerio we will be able to create a DOM and manipulate it as same as we do in client-side javascript using jQuery.

So, the setting for web scraping is quite simple:

  1. We will use request to retrieve html/xml documents from the web
  2. We will parse html content to create a DOM using cheerio
  3. Using jQuery-like functions we will be able to manipulate and extract content from the retrieved webpage.

Our example project will consist in two files, the node package.json and the app.js which contains the code to scrap the web:

{
  "name": "youtubeScrapingExample",
  "version": "0.0.1",
  "scripts": {
    "start": "node app.js"
  },
  "dependencies": {
    "cheerio": "latest",
    "request": "latest"
  }
}
// Import request and cheerio libraries
const request = require('request');
const cheerio = require('cheerio');

// Set an example youtube url
let videoUrl = 'https://www.youtube.com/watch?v=7fYKMCCPh28';

// HTTP GET of the youtube website using request
request.get(videoUrl, function(error, response, html){
    // Use cheerio to parse and create the jQuery-like DOM based on the retrieved html string
    let $ = cheerio.load(html);
    // Find the element node which contains the title and retrieve it's text
    let title = $('span#eow-title').text();
    // Output the result
    console.log('The title of the video %s is %s', videoUrl, title);
    // Output: The title of the video https://www.youtube.com/watch?v=7fYKMCCPh28 is The Earth: 4K Extended Edition
});

In this example we are scraping the title of a youtube video link. The video title is the content of the node span#eow-title as you can see in the image below.

Youtube title scraping node element
Youtube title scraping node element

If you open the debug console in a youtube video and you write the following code, you will receive the title of the current video:

$('span#eow-title').innerText

In this case we have to take in to account that the content from the website is publicly accessible using a GET http call. One of the most powerful capabilities of request is that you can define complex requests, using any of the methods of the http protocol. Also, you can provide your browser’s HTTP Archive (HAR) [in Google chrome, you can retrieve the HAR in the developer tool] if the website where you need to scrap from is too complex to access from in a simple http call.

It is a quite simple example, but I hope that this mini-tutorial will help you to scrap the entire web!

 

Recommended reads and related links

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.