Write web spider to scrape data using Java

I'm going to explain how to write a web spider using Java programming language. Though i don't like to write any spider using Java, but I'm explaining for those who are interested.

First of all you need to define the URL class with your url. At below code making a input stream class that will read the HTML source code from given url http://icwow.blogspot.com/.


URL url = new URL("http://icwow.blogspot.com/");
URLConnection urlConnection = url.openConnection();
DataInputStream dis = new DataInputStream(urlConnection.getInputStream());

After reading all lines of HTML source from the provided urls. replace all the white spaces with single space. This will help you write expression more simpler way.


html = html.replaceAll("\\s+", " ");

Now build the pattern using regular expression, I'm showing a simple regex for extract the Title of the web page.


Pattern p = Pattern.compile("<title>(.*?)</title>");

Match the pattern to scrape title with given html source


Matcher m = p.matcher(html);

Obtain the title from the matches and print it.


while (m.find() == true){
   System.out.println("Page Title is "+m.group(1));
}

Below is the complete sample code of Spider to scrape the title of given URL. Just copy this and modify. You can use your required regular expressions to scrape the data as per your requirements.


import java.io.*;
import java.net.*;
import java.util.regex.*;

class Spider{
  public static void main(String []argv){
    try {
            
      URL url = new URL("http://icwow.blogspot.com/");
      URLConnection urlConnection = url.openConnection();
      DataInputStream dis = new DataInputStream(urlConnection.getInputStream());
      String html= "", tmp = "";
      // read all HTML source from given URL
      while ((tmp = dis.readLine()) != null) {
        html += " "+tmp;
      }
      dis.close();
            
      // replace all white spaces region with single space
      html = html.replaceAll("\\s+", " ");
      // build the pattern using regular expression
      Pattern p = Pattern.compile("<title>(.*?)</title>");
      // Match the pattern with given html source
      Matcher m = p.matcher(html);
      // Get all matches that matched my pattern
      while (m.find() == true){
        // Print the first matched pattern
        System.out.println(m.group(1));
      }
    }catch (Exception e) {
      System.out.println(e);
    }
  }
}

Comments

Anonymous said…

Thank you this was very helpful.

July 2, 2008 at 4:20 PM

This works Good Thank you for this spider...

January 9, 2009 at 7:53 PM

Hi, Its working but how to do this multiple URL's and is it possible to use Xpath instead of regex.

December 30, 2011 at 1:33 AM

Hi,

Thanks for this, I've been looking for something like it for a long time.

Could this script retrieve multiple pieces of data from the same site such as h1 tags and other meta data?

August 20, 2012 at 6:07 AM

The world is amazing

Search This Blog

Write web spider to scrape data using Java

Comments