How To Identify Character Encoding From Website?
What I'm trying to do: I'm getting from a database a list of uris and download them, removing the stopwords and counting the frequency that the words appears in the webpage, then t
Solution 1:
It is not easy to find out the correct character encoding of a website because the information in the header might be wrong. BeautifulSoup does a pretty good job at guessing the character encoding and automatically decodes it to Unicode.
from bs4 import BeautifulSoup
import urllib
url = 'http://www.google.de'
fh = urllib.urlopen(url)
html = fh.read()
soup = BeautifulSoup(html)
# text is a Unicode string
text = soup.body.get_text()
# encoded_text is a utf-8 string that you can store in mongo
encoded_text = text.encode('utf-8')
See also the answers to this question.
Post a Comment for "How To Identify Character Encoding From Website?"