Scraping the web in node.js using request and cheerio.js

There are multiple tools to scrap and retrieve content from websites in multiple languages. In Java you can use Jaunt, for C# you can use this approach, and in python there is a popular library called scrapy. Today, I’m going to show you a simple way to crawl a website using server-side Javascript (Node.js) and the help of Request and Cheerio.

Request is a library which main purpose is to create http requests to retrieve web content easily. It works on top of http library and simplifies calls and responses handling.

Cheerio is the equivalent to jQuery for Node.js. It implements the core functions of jQuery. Remember that in Node, unlike in client-side javascript, there isn’t a DOM. Using cheerio we will be able to create a DOM and manipulate it as same as we do in client-side javascript using jQuery.

So, the setting for web scraping is quite simple:

  1. We will use request to retrieve html/xml documents from the web
  2. We will parse html content to create a DOM using cheerio
  3. Using jQuery-like functions we will be able to manipulate and extract content from the retrieved webpage.

Our example project will consist in two files, the node package.json and the app.js which contains the code to scrap the web:

{
  "name": "youtubeScrapingExample",
  "version": "0.0.1",
  "scripts": {
    "start": "node app.js"
  },
  "dependencies": {
    "cheerio": "latest",
    "request": "latest"
  }
}
// Import request and cheerio libraries
const request = require('request');
const cheerio = require('cheerio');

// Set an example youtube url
let videoUrl = 'https://www.youtube.com/watch?v=7fYKMCCPh28';

// HTTP GET of the youtube website using request
request.get(videoUrl, function(error, response, html){
    // Use cheerio to parse and create the jQuery-like DOM based on the retrieved html string
    let $ = cheerio.load(html);
    // Find the element node which contains the title and retrieve it's text
    let title = $('span#eow-title').text();
    // Output the result
    console.log('The title of the video %s is %s', videoUrl, title);
    // Output: The title of the video https://www.youtube.com/watch?v=7fYKMCCPh28 is The Earth: 4K Extended Edition
});

In this example we are scraping the title of a youtube video link. The video title is the content of the node span#eow-title as you can see in the image below.

Youtube title scraping node element
Youtube title scraping node element

If you open the debug console in a youtube video and you write the following code, you will receive the title of the current video:

$('span#eow-title').innerText

In this case we have to take in to account that the content from the website is publicly accessible using a GET http call. One of the most powerful capabilities of request is that you can define complex requests, using any of the methods of the http protocol. Also, you can provide your browser’s HTTP Archive (HAR) [in Google chrome, you can retrieve the HAR in the developer tool] if the website where you need to scrap from is too complex to access from in a simple http call.

It is a quite simple example, but I hope that this mini-tutorial will help you to scrap the entire web!

 

Recommended reads and related links

Chrome extensions scaffolding

Today I’m going to talk about a great Yeoman generator focused on Chrome extensions scaffolding. The idea behind Yeoman, for the people who don’t know what it is, is to deploy a project with its file/folder structure, technologies, and so on, ready to start developing. Yeoman has a large amount of tools of generators (or plugins) to create bootstrap projects, client-side webs,…, but in this post, I will focus on a Google Chrome Extensions ready-to-develop generator called generator-chrome-extension.

As I said before, Yeoman helps you to kickstart a new project, prescribing good practices and a ecosystem to define a faster developing. Usually developers spend a little time defining tools, technologies, and searching how to work among them. For example, to define a Google Chrome extension, we need to create a particular file/folder structure, such as, a manifest.json, background.js, popup.js, etc. It’s a waste of time to define on every project this structure or use a template and adapt it for each one.

Another issue is the non-standardization of projects structure and technologies used. This is a handicap if you need to understand third parties developments. If you have a similar manner to develop a chrome extension, it will be more understandable, and probably widely extensible or shareable.

The third advantage of using this tools is the tool ecosystem generated to develop. The generator-chrome-extension define some task to easy test, deploy, watch code changes, code reviewing, packaging, etc. In the current version 0.5.1, it defines some Gulp tasks to:

  • trans-compile ES2015 (the new ECMAScript standard) to javascript supported by Google Chrome. It uses Babel ES2015.
  • watch changes in your code and update automatically your extension in your browser. Using this idea, you don’t need to care about reloading your chrome extension during development.
  • build and package your extension ready-to-deploy in the Chrome Store.
  • code linting using ESLint to standardizase your javascript code.
  • css preprocessing using Sass syntax.
  • and much more.

In general those tools help a lot during our Chrome Extensions development, but also they help to improve our development skills, standardization, etc.

In conclusion, I talked about generator-chrome-extension, that is a powerful yeoman app scaffolding project, but there are a lot of great plugins that I recommend to give them a try.

My experience using music identifiers

Maybe someday or somewhere while you were in a public place, such as stores, pubs,… you have heard a song that you like it, but you didn’t know which one was its name.

Probably, you already know what is a music identifier, but maybe you only know Shazam, because it is the most used app to identify music just listening to it. It is great, it works like a charm if you don’t have great expectations. If somewhere sounds a little piece of a song, usually it find a match. Usually, but it has for me several issues that make me think about possible alternatives like SoundHound. Yes, there are more alternatives like Sound Search by Google Play, but for me it was too simple.

In my opinion, one of the most important issues is the speed, yes, the time that it needs to open. The problem of the music is that it won’t be waiting for you until your app is opened, for example in adverts which takes at most 15 seconds for each one. And Shazam (if you don’t have a expensive/good smartphone) spends a lot of time to open and start tracking the song, using my phone, more than 15 seconds, and with SoundHound, around 10.

Another important issue is its recognition power. The problem nowadays is not to match an exact same song (cause it is quite easy), it is important to match similar tracks, like remixes. But if we go further, it could be great to match also if someone is whistling or humming or singing by him/herself. Those are the new the challenges. I made some tests and SoundHound works better than Shazam in those cases, but they are far away from recognizing any song sung by someone.

Besides that, tracking songs in some places will be difficult, due to other noises. For example, some days ago I was on the bus, and the driver has the radio turned on. It was sounding a song I will try to track it using SoundHound. There was a lot of noisy, the motor, people speaking, but it recognized it. For me was astonishing, I didn’t have faith it would work.

Not everything is perfect if you use SoundHound. For me, one of the greatest problems with SoundHound is that doesn’t exist an app for Windows 10, in spite of Shazam, that is a “polyglot” app. You have a web site which uses the same services as SoundHound called Midomi, . As far as I read online, and my personal knowledge, SoundHound is quite better than Shazam, less spam, it is fast, simple, and it has a better successful identification ratio. Is a good alternative to Shazam, isn’t it?

Hello world!

Welcome to everybody!

This is my first post in my new personal blog. It’s not my first time writing on a blog, but it’s my first time writing in English and also using my name. Yes, I need to present myself first. My name is Haritz Medina, and that’s all. It’s a joke, I have a lot of things to share about my job, feelings and ideas, and that’s the main reason to have opened this blog.

Usually it’s difficult to give a brief presentation about myself, but I will try it. I’m a computer science student, maybe, since I discovered the computers; but officially, since 2008. I studied a superior grade (it’s like a 2 years degree to prepare or oriented to introduce in the working market) in the Uni Eibar-Ermua. However, I decided to continue my studies doing a grade on computer science in the Faculty of Computer Science in the University of the Basque Country. And today I’m finishing a post-graduate master also in the same faculty. During these studies I’ve been working on 3 companies related to the software development. Maybe I will need another post to explain in what was my work focused on. Apart from my studies and my jobs, I have too much hobbies like gaming, programming, running, swimming, travelling, but my favorite one is to spend time with people that I love.

As I said before, the main idea to open this blog is to share news (specially focused on computer science), tutorials, ideas, opinions, etc. but as my mother language is the Spanish, I think that would be a good idea to improve my English skills. So, if you read any mistake or you can add something to the things that I’ll write here, or you are disappointed with something, feel free to comment me whatever you want. There is only one rule, try not to spam too much!

I hope you’ll enjoy reading me!