Tuesday, April 15, 2008

Ruby: Create a Web Spider.

If you want to write a spider/crawler to scrape the HTML source from a given web URL, there are two way I know which will allow you to write using existing libraries of Ruby. The first example for ruby scraper is using open-uri. This doesn't have much features but handy enough to write a crawler. Below is the example code block which will first read the HTML source from google, and print the title of the page after matching regular expression.
require 'open-uri';

url = "http://www.google.com/";
connection = open(url);
content = connection.read;
if(content =~ /<title>(.*?)<\/title>/)
print $1,"\n";
Below is another way, using the ruby library net/http, you can write a crawler too. But this one has more options compare to previous one. You can handle GET/POST method along with cookie and other features. My suggestion is use this one rather previous example.
require 'net/http';

url = URI.parse('http://www.yahoo.com/');
req = Net::HTTP::Get.new(url.path);
res = Net::HTTP.start(url.host, url.port) {|http|
content = res.body;
if(content =~ /<title>(.*?)<\/title>/)
print $1,"\n";
Hope this will make you a good spiderman using Ruby script.


Anonymous said...

nice, thanks

Anonymous said...

i wouldn`t name this simply script a spider/crawler

Get function name programaticaly - Python

This little piece of code will help you to get the function name programatically. This is very helpful when you are implementing the debug...