Sina Samangooei's thoughts. Musings of an engineer trapped in an academic's body. login

MagicalThinking.

I have something to add...

Learning node.js - Introducing jsgrep

So TLDR: I made a tool using node.js/jsdom/jquery and it is on github, enjoy.

Last week we finished porting a really cool face detector developed by some guys over at oxford’s robots research group back in 2007. To test the face detector I thought it would be cool to be able to detect members of the deparment. As a training set I thought their public profiles from ecs people would be nice to grab. A light amount of investigation seemed to show that the SPARQL endpoint didn’t give easy access to the images. This meant some HTML grep goodness. But this didn’t feel nice, my immediate desire was to do a jQuery dom selection of all the appropriate image tags, then printing all their src tags then… passing the output to curl or something?

It occured to me that it would be much easier to do this via the command line rather than using firebug and some copypasta. But To My Knowledge™ no such tool exists. In the end i decided it would be nice if there was a tool that let me do something like this:

jsgrep -s <jqueryLikeDomSelector> [-a <someAttribute>] [-r] [-nt]

You’d use -r to make the elements get printed recursively (i.e. all their children) and you’d use -nt to not print the actual tags (i.e. just their content). You could then pipe the output to a large chain of lovely unix tools and get what you wanted done much more efficiently than having to grep your way around html tags.

An excuse to use node.js


So jQuery is a js library right? I’d recently heard about v8 and node.js and I thought, “finally, an excuse!”. So i went about implementing my first nodejs program that read off the standard input and spat stuff to the standard out. Here is what i came up with:

var stdin = process.openStdin();
stdin.setEncoding('utf8');

input = ""
stdin.on('data', function (chunk) {
	input += chunk;
});
stdin.on('end', function (chunk) {
	process.stdout.write(input);
});

So everything in nodejs is events right? What this does is take data from the stdin when there is some, i.e. when the data event happens. It concatinates all this output and when the stdin is done (end event) it spits it all out. Lovely! Weird, and probably the wrong way to do this relatively simple thing, but it served to help me understand “what the fuck was going on”™.

Next! let’s load jquery. Right so i just need to do is include jquery right?

require("./jquery.js")

No sorry. jQuery needs the dom. It has the dom attached to its soul. Luckily the nodejs community have been busy little bees. They’ve built this awesome package manager called npm which is a lovely package manager. Now I can install a lovely thing called jsdom. JSdom was purpose built for the task of including jquery into the shit you do, specifically it lets you load html as a DOM and… do DOM things. Sweet! You make all that go as follows:

var jsdom = require('jsdom').jsdom

jsdom.env(input, ['jquery.js'], {
	features:{
		FetchExternalResources   : false, 
		ProcessExternalResources : false,
		MutationEvents           : false,
		QuerySelector            : false
	}
},
function(errors, window) {
	// do shit to your window
})

Well isn’t this nice? You provide that callback function and (once the html is loaded as a dom object) away it goes! You can even tell it to inject jquery into your html for you. Sweet! Right so finally, i want to read command line arguments. Now you can do by including sys and dealing with sys.argv manually. But fuck that noise, someone somewhere must have done a nice args manager. yes, yes they have! It is also on npm. js-opts it is called. Once that’s installed you can do lovely things like:

var options = [
	{ short : 'i'
	, long : 'input'
	, description : 'HTML URL or Filename'
	, value : true
	}
];

// Grab the options
opts.parse(options, true);
var input  = opts.get('input') || false

Awesome! That’s it really, the tool i’ve built should even be installable via npm though not online, and I havn’t tested that. But by doing a node src/jsgrep.js you can feed it html on the pipe, or on the -i, and get some lovely things going on, so here are all the images I use on my blog:

curl -s sinjax.net | node src/jsgrep.js -s img -a src -nt | sort | uniq

outputs:

/graphics/cats.png
/graphics/comments.png
/graphics/date.png
/graphics/globe.png
/graphics/link.png
/style/face

Nice. Clearly using nodejs is a little overkill. Hell using jQuery is a little overkill. But yeah think i learnt something from that voyage so that’s nice :-)

Update: If you have installed nodejs and npm already you can now do a npm install jsgrep to have a play

Stolen, torn apart, slapped together and otherwise created by Sina Samangooei. Licensed under WTFPL login