Getting Duplicate Keys In Yaml Using Python
Solution 1:
PyYAML will just silently overwrite the first entry, ruamel.yaml¹ will give a DuplicateKeyFutureWarning
if used with the legacy API, and raise a DuplicateKeyError
with the new API.
If you don't want to create a full Constructor
for all types, overwriting the mapping constructor in SafeConstructor
should do the job:
import sys
from ruamel.yaml import YAML
from ruamel.yaml.constructor import SafeConstructor
yaml_str = """\
build:
step: 'step1'
build:
step: 'step2'
"""
def construct_yaml_map(self, node):
# test if there are duplicate node keys
data = []
yield data
for key_node, value_node in node.value:
key = self.construct_object(key_node, deep=True)
val = self.construct_object(value_node, deep=True)
data.append((key, val))
SafeConstructor.add_constructor(u'tag:yaml.org,2002:map', construct_yaml_map)
yaml = YAML(typ='safe')
data = yaml.load(yaml_str)
print(data)
which gives:
[('build', [('step', 'step1')]), ('build', [('step', 'step2')])]
However it doesn't seem necessary to make step: 'step1'
into a list. The following will only create the list if there are duplicate items (could be optimised if necessary, by caching the result of the self.construct_object(key_node, deep=True)
):
def construct_yaml_map(self, node):
# test if there are duplicate node keys
keys = set()
for key_node, value_node in node.value:
key = self.construct_object(key_node, deep=True)
if key in keys:
break
keys.add(key)
else:
data = {} # type: Dict[Any, Any]
yield data
value = self.construct_mapping(node)
data.update(value)
return
data = []
yield data
for key_node, value_node in node.value:
key = self.construct_object(key_node, deep=True)
val = self.construct_object(value_node, deep=True)
data.append((key, val))
which gives:
[('build', {'step': 'step1'}), ('build', {'step': 'step2'})]
Some points:
- Probably needless to say, this will not work with YAML merge keys (
<<: *xyz
) - If you need ruamel.yaml's round-trip capabilities (
yaml = YAML()
) , that will require a more complexconstruct_yaml_map
. If you want to dump the output, you should instantiate a new
YAML()
instance for that, instead of re-using the "patched" one used for loading (it might work, this is just to be sure):yaml_out = YAML(typ='safe') yaml_out.dump(data, sys.stdout)
which gives (with the first
construct_yaml_map
):- - build - - [step, step1] - - build - - [step, step2]
What doesn't work in PyYAML nor ruamel.yaml is
yaml.load('file.yml')
. If you don't want toopen()
the file yourself you can do:from pathlib import Path # or: from ruamel.std.pathlib import Path yaml = YAML(typ='safe') yaml.load(Path('file.yml')
¹ Disclaimer: I am the author of that package.
Solution 2:
If you can modify the input data very slightly, you should be able to do this by converting the single yaml-like file into multiple yaml documents. yaml documents can be in the same file if they're separated by ---
on a line by itself, and you handily appear to have entries separated by two newlines next to each other:
with open('file.yml', 'r') as f:
data = f.read()
data = data.replace('\n\n', '\n---\n')
for document in yaml.load_all(data):
print(document)
Output:
{'build': {'step': 'step1'}}
{'build': {'step': 'step2'}}
Solution 3:
You can override how pyyaml loads keys. For example, you could use a defaultdict with lists of values for each keys:
from collections import defaultdict
import yaml
def parse_preserving_duplicates(src):
# We deliberately define a fresh class inside the function,
# because add_constructor is a class method and we don't want to
# mutate pyyaml classes.
class PreserveDuplicatesLoader(yaml.loader.Loader):
pass
def map_constructor(loader, node, deep=False):
"""Walk the mapping, recording any duplicate keys.
"""
mapping = defaultdict(list)
for key_node, value_node in node.value:
key = loader.construct_object(key_node, deep=deep)
value = loader.construct_object(value_node, deep=deep)
mapping[key].append(value)
return mapping
PreserveDuplicatesLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, map_constructor)
return yaml.load(src, PreserveDuplicatesLoader)
Post a Comment for "Getting Duplicate Keys In Yaml Using Python"