Processing Big Data

Let's dive right into it by looking at a classic Node problem: counting all Node modules available on npm. The npm registry exposes an HTTP endpoint where we can get the entire contents of the npm registry content as JSON.

Using the command line tool, curl, which is included (or at least installable) on most operating systems, we can try it out.

$ curl https://skimdb.npmjs.com/registry/_changes?include_docs=true

This will print a new line delimited JSON stream of all modules.

The JSON stream returned by the registry contains a JSON object for each module stored on npm followed by a new line character.

A simple Node program that counts all modules could look like this:

var request = require('request') 
var npmDb = 'https://skimdb.npmjs.com'
var registryUrl = `${npmDb}/registry/_changes?include_docs=true`
request(registryUrl, function (err, data) {
if (err) throw err
var numberOfLines = data.split('\n').length + 1
console.log('Total modules on npm: ' + numberOfLines)
})

If we try and run the preceding program, we'll notice a couple of things.

First of all, this program takes quite a long time to run. Second, depending on the machine we are using, there is a very good chance the program will crash with an out of memory error.

Why is this happening?

The npm registry stores a very large amount of JSON data, and it takes quite a bit of memory to buffer it all.

In this recipe, we'll investigate how we can use streams to improve our program.