Saturday, November 27, 2010

Python: Unicode html character escape

Share at Facebook

Sometimes things can get too much irritating when you are in need of something, and you are not finding out the solution, but there is a easy way to do you know.

I was working with a large python project, and it was handling with foreign language. Most of the characters are html encoded(i.e. aacute character entities). For example like below,

á = á
ó = ó
é = é
í = í
ñ = ñ

There are plenty of more like this. But I required to convert them using python script into appropriate one.

Here is the technique that I have came up with BeautifulStoneSoup from BeautifulSoup and its really beautiful.

## importing appropriate library
from BeautifulSoup import BeautifulStoneSoup

## input content
content = 'Procederá la aplicación analógica de las normas cuando éstas no contemplen un supuesto específico,'

## doing conversion from html format to text format
content=unicode(BeautifulStoneSoup(content,convertEntities=BeautifulStoneSoup.HTML_ENTITIES )).encode('utf-8')
print content

This code will print the content into screen after converting those html codes(i think these are Latin encoding??). If your console doesn't support those character encoding, it will not display into there properly.

You can write the output into a file using python code, and browse the file form browser, and you'll find that its converted successfully.

If you know other ways, you can share that.

No comments: