Wednesday, March 26, 2008

Write web spider to scrape data using Java


Share at Facebook

I'm going to explain how to write a web spider using Java programming language. Though i don't like to write any spider using Java, but I'm explaining for those who are interested.

First of all you need to define the URL class with your url. At below code making a input stream class that will read the HTML source code from given url http://icwow.blogspot.com/.

URL url = new URL("http://icwow.blogspot.com/");
URLConnection urlConnection = url.openConnection();
DataInputStream dis = new DataInputStream(urlConnection.getInputStream());

After reading all lines of HTML source from the provided urls. replace all the white spaces with single space. This will help you write expression more simpler way.

html = html.replaceAll("\\s+", " ");

Now build the pattern using regular expression, I'm showing a simple regex for extract the Title of the web page.

Pattern p = Pattern.compile("<title>(.*?)</title>");

Match the pattern to scrape title with given html source

Matcher m = p.matcher(html);

Obtain the title from the matches and print it.

while (m.find() == true){
   System.out.println("Page Title is "+m.group(1));
}

Below is the complete sample code of Spider to scrape the title of given URL. Just copy this and modify. You can use your required regular expressions to scrape the data as per your requirements.

import java.io.*;
import java.net.*;
import java.util.regex.*;

class Spider{
  public static void main(String []argv){
    try {

      URL url = new URL("http://icwow.blogspot.com/");
      URLConnection urlConnection = url.openConnection();
      DataInputStream dis = new DataInputStream(urlConnection.getInputStream());
      String html= "", tmp = "";
      // read all HTML source from given URL
      while ((tmp = dis.readLine()) != null) {
        html += " "+tmp;
      }
      dis.close();

      // replace all white spaces region with single space
      html = html.replaceAll("\\s+", " ");
      // build the pattern using regular expression
      Pattern p = Pattern.compile("<title>(.*?)</title>");
      // Match the pattern with given html source
      Matcher m = p.matcher(html);
      // Get all matches that matched my pattern
      while (m.find() == true){
        // Print the first matched pattern
        System.out.println(m.group(1));
      }
    }catch (Exception e) {
      System.out.println(e);
    }
  }
}




4 comments:

Anonymous said...

Thank you this was very helpful.

Anonymous said...

This works Good Thank you for this spider...

Anonymous said...

Hi, Its working but how to do this multiple URL's and is it possible to use Xpath instead of regex.

Anonymous said...

Hi,

Thanks for this, I've been looking for something like it for a long time.

Could this script retrieve multiple pieces of data from the same site such as h1 tags and other meta data?