Skip to content Skip to sidebar Skip to footer

Python Encoding Chinese To Special Character

I have scrape/curl request to get html from other site, that have chinese language but some text result is weird, it showing like this: °¢Àï°Í°ÍΪÄúÌṩÁË×ÔÁ�

Solution 1:

Chinese has several possible charsets. 3 common chinese charsets are: gb2312,big5 and gbk. Here is a snippet to convert from gb2312 to utf-8.

import codecs

infile = codecs.open("in.txt", "r", "gb2312")
lines = infile.readline()
infile.close()

print(lines)

outfile = codecs.open("out.txt", "wb", "utf-8")
outfile.writelines(lines)
outfile.close()

Solution 2:

It was really simple solution, as mentioned by @Thu Yein tun, to see the header response of the http request link for the content type, and I it showing as text/html;charset=GBK, then I give the solution to my code like this

result.decode('gbk')

Solution 3:

Try this block of code.

You can do by importing the unquote file & encode the content using latin1 encoding mechanism.

#!/usr/bin/env python# -*- coding: utf-8 -*-from urllib2 import unquote

bytesquoted = u'å%8f°å%8d%97 親å­%90é¤%90廳'.encode('latin1')
unquoted = unquote(bytesquoted)
print unquoted.decode('utf8')

Output :

台南 親子餐廳

Post a Comment for "Python Encoding Chinese To Special Character"