Skip to content Skip to sidebar Skip to footer

Remove Xml Nodes Without Child Nodes Using Python

I have the following xml output: <

Solution 1:

Since you use the lxml module, consider XSLT, the special-purpose language designed to transform XML files. With this approach, no for loops or if logic is required.

In fact, your XML looks to be using XSLT per the processing instruction so you might be able to include below script in that stylesheet. Following script runs the Identity Transform and an empty template on any <image> tags with zero count of children. Empty templates remove such nodes.

XSLT(save as .xsl file)

<xsl:stylesheetversion="1.0"xmlns:xsl="http://www.w3.org/1999/XSL/Transform"><xsl:strip-spaceelements="*"/><xsl:outputindent="yes"/><xsl:templatematch="@*|node()"><xsl:copy><xsl:apply-templatesselect="@*|node()"/></xsl:copy></xsl:template><xsl:templatematch="image[count(*)=0]"/></xsl:stylesheet>

Python

import lxml.etree as et

doc = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')

transform = et.XSLT(xsl)    
result = transform(doc)

# OUTPUT TO SCREENprint(result)

# OUTPUT TO FILEwithopen('Output.xml', 'wb') as f:
    f.write(result)

Output

<?xml version="1.0"?><?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?><dataset><images><imagefile="VideoExtract/testset/10224.jpg"><boxtop="436"left="266"width="106"height="61"><label>1</label></box></image><imagefile="VideoExtract/testset/1044.jpg"><boxtop="507"left="330"width="52"height="27"><label>2</label></box></image></images></dataset>

Solution 2:

This code might do exactly what you have asked for in your question. I doubt it's exactly what you want.

>>>from lxml import etree>>>tree = etree.parse('testxml.xml')>>>for el in tree.iter():...    el.tag, len(list(el.iterchildren()))...ifnotlen(list(el.iterchildren())):...        parent = el.getparent()...if parent isnotNone:...            parent.remove(el)...
('dataset', 1)
('images', 3)
('image', 1)
('box', 1)
('label', 0)
('image', 1)
('box', 1)
('label', 0)
('image', 0)
>>>tree.write('temp.xml', pretty_print=True)

Here's the resulting xml file.

<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?><dataset><images><imagefile="VideoExtract/testset/10224.jpg"><boxtop="436"left="266"width="106"height="61"></box></image><imagefile="VideoExtract/testset/1044.jpg"><boxtop="507"left="330"width="52"height="27"></box></image></images></dataset>

I notice that the label nodes contain no nodes (although they contain text!); therefore, they are missing from the output. Is this what you really want?

In contrast, this version of the code preserves the label elements.

>>>tree = etree.parse('testxml.xml')>>>for el in tree.iter():...iflen(list(el.iterchildren())) or''.join([_.strip() for _ in el.itertext()]):...pass...else:...        parent = el.getparent()...if parent isnotNone:...            parent.remove(el)

Here's the resulting file in this case.

<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?><dataset><images><imagefile="VideoExtract/testset/10224.jpg"><boxtop="436"left="266"width="106"height="61"><label>1</label></box></image><imagefile="VideoExtract/testset/1044.jpg"><boxtop="507"left="330"width="52"height="27"><label>2</label></box></image></images></dataset>

Post a Comment for "Remove Xml Nodes Without Child Nodes Using Python"