Scraping the web in node.js using request and cheerio.js

There are multiple tools to scrap and retrieve content from websites in multiple languages. In Java you can use Jaunt, for C# you can use this approach, and in python there is a popular library called scrapy. Today, I’m going to show you a simple way to crawl a website using server-side Javascript (Node.js) and the help of Request and Cheerio.

Request is a library which main purpose is to create http requests to retrieve web content easily. It works on top of http library and simplifies calls and responses handling.

Cheerio is the equivalent to jQuery for Node.js. It implements the core functions of jQuery. Remember that in Node, unlike in client-side javascript, there isn’t a DOM. Using cheerio we will be able to create a DOM and manipulate it as same as we do in client-side javascript using jQuery.

So, the setting for web scraping is quite simple:

  1. We will use request to retrieve html/xml documents from the web
  2. We will parse html content to create a DOM using cheerio
  3. Using jQuery-like functions we will be able to manipulate and extract content from the retrieved webpage.

Our example project will consist in two files, the node package.json and the app.js which contains the code to scrap the web:

{
  "name": "youtubeScrapingExample",
  "version": "0.0.1",
  "scripts": {
    "start": "node app.js"
  },
  "dependencies": {
    "cheerio": "latest",
    "request": "latest"
  }
}
// Import request and cheerio libraries
const request = require('request');
const cheerio = require('cheerio');

// Set an example youtube url
let videoUrl = 'https://www.youtube.com/watch?v=7fYKMCCPh28';

// HTTP GET of the youtube website using request
request.get(videoUrl, function(error, response, html){
    // Use cheerio to parse and create the jQuery-like DOM based on the retrieved html string
    let $ = cheerio.load(html);
    // Find the element node which contains the title and retrieve it's text
    let title = $('span#eow-title').text();
    // Output the result
    console.log('The title of the video %s is %s', videoUrl, title);
    // Output: The title of the video https://www.youtube.com/watch?v=7fYKMCCPh28 is The Earth: 4K Extended Edition
});

In this example we are scraping the title of a youtube video link. The video title is the content of the node span#eow-title as you can see in the image below.

Youtube title scraping node element
Youtube title scraping node element

If you open the debug console in a youtube video and you write the following code, you will receive the title of the current video:

$('span#eow-title').innerText

In this case we have to take in to account that the content from the website is publicly accessible using a GET http call. One of the most powerful capabilities of request is that you can define complex requests, using any of the methods of the http protocol. Also, you can provide your browser’s HTTP Archive (HAR) [in Google chrome, you can retrieve the HAR in the developer tool] if the website where you need to scrap from is too complex to access from in a simple http call.

It is a quite simple example, but I hope that this mini-tutorial will help you to scrap the entire web!

 

Recommended reads and related links

Chrome extensions scaffolding

Today I’m going to talk about a great Yeoman generator focused on Chrome extensions scaffolding. The idea behind Yeoman, for the people who don’t know what it is, is to deploy a project with its file/folder structure, technologies, and so on, ready to start developing. Yeoman has a large amount of tools of generators (or plugins) to create bootstrap projects, client-side webs,…, but in this post, I will focus on a Google Chrome Extensions ready-to-develop generator called generator-chrome-extension.

As I said before, Yeoman helps you to kickstart a new project, prescribing good practices and a ecosystem to define a faster developing. Usually developers spend a little time defining tools, technologies, and searching how to work among them. For example, to define a Google Chrome extension, we need to create a particular file/folder structure, such as, a manifest.json, background.js, popup.js, etc. It’s a waste of time to define on every project this structure or use a template and adapt it for each one.

Another issue is the non-standardization of projects structure and technologies used. This is a handicap if you need to understand third parties developments. If you have a similar manner to develop a chrome extension, it will be more understandable, and probably widely extensible or shareable.

The third advantage of using this tools is the tool ecosystem generated to develop. The generator-chrome-extension define some task to easy test, deploy, watch code changes, code reviewing, packaging, etc. In the current version 0.5.1, it defines some Gulp tasks to:

  • trans-compile ES2015 (the new ECMAScript standard) to javascript supported by Google Chrome. It uses Babel ES2015.
  • watch changes in your code and update automatically your extension in your browser. Using this idea, you don’t need to care about reloading your chrome extension during development.
  • build and package your extension ready-to-deploy in the Chrome Store.
  • code linting using ESLint to standardizase your javascript code.
  • css preprocessing using Sass syntax.
  • and much more.

In general those tools help a lot during our Chrome Extensions development, but also they help to improve our development skills, standardization, etc.

In conclusion, I talked about generator-chrome-extension, that is a powerful yeoman app scaffolding project, but there are a lot of great plugins that I recommend to give them a try.