When reading html content from an url, python by default used to set the useragent string as "Python-urllib/2.5" with http request, if you don't set any. So the actual request to the server will be like this.
You might not want to let the website that you are using a python library to browse their page, and you are not a human, just a bot. They site might block you. That case its must to change this User-Agent String. Also few site used to provide different type of HTML data based on the User-Agent(browser).
Using the add_header() function of a python request object, you can do this. This function used to take HTTP headers as key, value format. So if you want to set the UserAgent header, the code will be.
I'm now giving you a sample code, that will browse the google.com using mozilla user agent, and print the html from google into screen.
User-Agent: Python-urllib/2.5
You might not want to let the website that you are using a python library to browse their page, and you are not a human, just a bot. They site might block you. That case its must to change this User-Agent String. Also few site used to provide different type of HTML data based on the User-Agent(browser).
Using the add_header() function of a python request object, you can do this. This function used to take HTTP headers as key, value format. So if you want to set the UserAgent header, the code will be.
req.add_header('User-agent', 'Mozilla 3.10')
I'm now giving you a sample code, that will browse the google.com using mozilla user agent, and print the html from google into screen.
import re
import urllib2
req = urllib2.Request('http://www.google.com/')
req.add_header('User-agent', 'Mozilla 3.10')
res = urllib2.urlopen(req)
html = res.read()
print html
Comments