I'm going to explain how to write a web spider using Java programming language. Though i don't like to write any spider using Java, but I'm explaining for those who are interested.
First of all you need to define the URL class with your url. At below code making a input stream class that will read the HTML source code from given url http://icwow.blogspot.com/.
After reading all lines of HTML source from the provided urls. replace all the white spaces with single space. This will help you write expression more simpler way.
Now build the pattern using regular expression, I'm showing a simple regex for extract the Title of the web page.
Match the pattern to scrape title with given html source
Obtain the title from the matches and print it.
Below is the complete sample code of Spider to scrape the title of given URL. Just copy this and modify. You can use your required regular expressions to scrape the data as per your requirements.
First of all you need to define the URL class with your url. At below code making a input stream class that will read the HTML source code from given url http://icwow.blogspot.com/.
URL url = new URL("http://icwow.blogspot.com/");
URLConnection urlConnection = url.openConnection();
DataInputStream dis = new DataInputStream(urlConnection.getInputStream());
After reading all lines of HTML source from the provided urls. replace all the white spaces with single space. This will help you write expression more simpler way.
html = html.replaceAll("\\s+", " ");
Now build the pattern using regular expression, I'm showing a simple regex for extract the Title of the web page.
Pattern p = Pattern.compile("<title>(.*?)</title>");
Match the pattern to scrape title with given html source
Matcher m = p.matcher(html);
Obtain the title from the matches and print it.
while (m.find() == true){
System.out.println("Page Title is "+m.group(1));
}
Below is the complete sample code of Spider to scrape the title of given URL. Just copy this and modify. You can use your required regular expressions to scrape the data as per your requirements.
import java.io.*;
import java.net.*;
import java.util.regex.*;
class Spider{
public static void main(String []argv){
try {
URL url = new URL("http://icwow.blogspot.com/");
URLConnection urlConnection = url.openConnection();
DataInputStream dis = new DataInputStream(urlConnection.getInputStream());
String html= "", tmp = "";
// read all HTML source from given URL
while ((tmp = dis.readLine()) != null) {
html += " "+tmp;
}
dis.close();
// replace all white spaces region with single space
html = html.replaceAll("\\s+", " ");
// build the pattern using regular expression
Pattern p = Pattern.compile("<title>(.*?)</title>");
// Match the pattern with given html source
Matcher m = p.matcher(html);
// Get all matches that matched my pattern
while (m.find() == true){
// Print the first matched pattern
System.out.println(m.group(1));
}
}catch (Exception e) {
System.out.println(e);
}
}
}
Comments
Thanks for this, I've been looking for something like it for a long time.
Could this script retrieve multiple pieces of data from the same site such as h1 tags and other meta data?