Skip to content Skip to sidebar Skip to footer

How To Identify Character Encoding From Website?

What I'm trying to do: I'm getting from a database a list of uris and download them, removing the stopwords and counting the frequency that the words appears in the webpage, then t

Solution 1:

It is not easy to find out the correct character encoding of a website because the information in the header might be wrong. BeautifulSoup does a pretty good job at guessing the character encoding and automatically decodes it to Unicode.

from bs4 import BeautifulSoup
import urllib

url = 'http://www.google.de'
fh = urllib.urlopen(url)
html = fh.read()
soup = BeautifulSoup(html)

# text is a Unicode string 
text = soup.body.get_text()
# encoded_text is a utf-8 string that you can store in mongo
encoded_text = text.encode('utf-8')

See also the answers to this question.


Post a Comment for "How To Identify Character Encoding From Website?"