Remove Xml Nodes Without Child Nodes Using Python
Solution 1:
Since you use the lxml
module, consider XSLT, the special-purpose language designed to transform XML files. With this approach, no for
loops or if
logic is required.
In fact, your XML looks to be using XSLT per the processing instruction so you might be able to include below script in that stylesheet. Following script runs the Identity Transform and an empty template on any <image>
tags with zero count of children. Empty templates remove such nodes.
XSLT(save as .xsl file)
<xsl:stylesheetversion="1.0"xmlns:xsl="http://www.w3.org/1999/XSL/Transform"><xsl:strip-spaceelements="*"/><xsl:outputindent="yes"/><xsl:templatematch="@*|node()"><xsl:copy><xsl:apply-templatesselect="@*|node()"/></xsl:copy></xsl:template><xsl:templatematch="image[count(*)=0]"/></xsl:stylesheet>
Python
import lxml.etree as et
doc = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')
transform = et.XSLT(xsl)
result = transform(doc)
# OUTPUT TO SCREENprint(result)
# OUTPUT TO FILEwithopen('Output.xml', 'wb') as f:
f.write(result)
Output
<?xml version="1.0"?><?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?><dataset><images><imagefile="VideoExtract/testset/10224.jpg"><boxtop="436"left="266"width="106"height="61"><label>1</label></box></image><imagefile="VideoExtract/testset/1044.jpg"><boxtop="507"left="330"width="52"height="27"><label>2</label></box></image></images></dataset>
Solution 2:
This code might do exactly what you have asked for in your question. I doubt it's exactly what you want.
>>>from lxml import etree>>>tree = etree.parse('testxml.xml')>>>for el in tree.iter():... el.tag, len(list(el.iterchildren()))...ifnotlen(list(el.iterchildren())):... parent = el.getparent()...if parent isnotNone:... parent.remove(el)...
('dataset', 1)
('images', 3)
('image', 1)
('box', 1)
('label', 0)
('image', 1)
('box', 1)
('label', 0)
('image', 0)
>>>tree.write('temp.xml', pretty_print=True)
Here's the resulting xml file.
<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?><dataset><images><imagefile="VideoExtract/testset/10224.jpg"><boxtop="436"left="266"width="106"height="61"></box></image><imagefile="VideoExtract/testset/1044.jpg"><boxtop="507"left="330"width="52"height="27"></box></image></images></dataset>
I notice that the label
nodes contain no nodes (although they contain text!); therefore, they are missing from the output. Is this what you really want?
In contrast, this version of the code preserves the label
elements.
>>>tree = etree.parse('testxml.xml')>>>for el in tree.iter():...iflen(list(el.iterchildren())) or''.join([_.strip() for _ in el.itertext()]):...pass...else:... parent = el.getparent()...if parent isnotNone:... parent.remove(el)
Here's the resulting file in this case.
<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?><dataset><images><imagefile="VideoExtract/testset/10224.jpg"><boxtop="436"left="266"width="106"height="61"><label>1</label></box></image><imagefile="VideoExtract/testset/1044.jpg"><boxtop="507"left="330"width="52"height="27"><label>2</label></box></image></images></dataset>
Post a Comment for "Remove Xml Nodes Without Child Nodes Using Python"