Parsing Xhtml Using Xml.etree.elementtree
I want to use xml.etree.ElementTree to parse an XHTML document in Python 3. The document contains entities, so I cannot use the default parser settings. I'd like to do s
Solution 1:
Well I encountered same problem. The sample code in the question and the chosen answer might work before, but right now it won't work in my Python 3.3 and Python 3.4 environment.
I finally got it working. Quoted from this Q&A.
Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.
This works for both Python 2.6, 2.7, 3.3, 3.4.
import xml.etree.ElementTree as ET
html = '''<html><div>Some reasonably well-formed HTML content.</div><formaction="login"><inputname="foo"value="bar"/><inputname="username"/><inputname="password"/><div>It is not unusual to see in an HTML page.</div></form></html>'''
magic = '''<!DOCTYPE htmlPUBLIC"-//W3C//DTD XHTML 1.0 Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY nbsp' '>
]>''' # You can define more entities here, if needed
et = ET.fromstring(magic + html)
Solution 2:
Feed the parser:
with urllib.request.urlopen(BASE_URL) as url:
body = url.read()
parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update(entitydefs)
parser.feed(body)
root = parser.close() # this returns you the tree
Post a Comment for "Parsing Xhtml Using Xml.etree.elementtree"