Skip to content Skip to sidebar Skip to footer

Parse Large Python Xml Using Xmltree

I have a python script that parses huge xml files ( largest one is 446 MB) try: parser = etree.XMLParser(encoding='utf-8') tree = etree.parse(os.path.join(srcDi

Solution 1:

Iterparse is not that difficult to use in this case.

temp.xml is the file presented in your question with a </MyRoot> stuck on as a line at the end.

Think of the source = as boilerplace, if you will, that parses the xml file and returns chunks of it element-by-element, indicating whether the chunk is the 'start' of an element or the 'end' and supplying information about the element.

In this case we need consider only the 'start' events. We watch for the 'PersonName' tags and pick up their texts. Having found the one and only such item in the xml file we abandon the processing.

>>>from xml.etree import ElementTree>>>source = iter(ElementTree.iterparse('temp.xml', events=('start', 'end')))>>>for an_event, an_element in source:...if an_event=='start'and an_element.tag.endswith('PersonName'):...        an_element.text...break... 
'Miracle Smith'

Edit, in response to question in a comment:

Normally you wouldn't do this since iterparse is intended for use with large chunks of xml. However, by wrapping a string in a StringIO object it can be processed with iterparse.

>>>from xml.etree import ElementTree>>>from io import StringIO>>>xml = StringIO('''\...<?xml version="1.0" encoding="utf-8"?>...<MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">...  <Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">...       <Description>myData</Description>...            <Identifier>43hhjh87n4nm</Identifier>...              </Aliases>...                <RollNo uom="kPa">39979172.201167159</RollNo>...                  <PersonName>Miracle Smith</PersonName>...                    <Date>2017-06-02T01:10:32-05:00</Date>...</MyRoot>''')>>>source = iter(ElementTree.iterparse(xml, events=('start', 'end')))>>>for an_event, an_element in source:...if an_event=='start'and an_element.tag.endswith('PersonName'):...        an_element.text...break...
'Miracle Smith'

Post a Comment for "Parse Large Python Xml Using Xmltree"