Skip to content Skip to sidebar Skip to footer

Python Split At Tag Regex

I'm trying to split these lines: Next stop Into: [''] ['

Solution 1:

Using lookarounds and a capture group to keep the text after splitting:

re.split(r'(?<=>)(.+?)(?=<)', '<label>Olympic Games</label>')

Solution 2:

This regex works for me:

<(label|title)>([^<]*)</(label|title)>

or, as cwallenpoole suggested:

<(label|title)>([^<]*)</(\1)>

enter image description here

I've used http://www.regexpal.com/

I have used three capturing groups, if you don't need them, simply remove the ()

What is wrong about your regex <\*> is that is matching only one thing: <*>. You have scaped * using \*, so what you are saying is:

  • Match any text with <, then a * and then a >.

Solution 3:

Data:

line = """<label>Olympic Games</label>
<title>Next stop</title>"""

With look-ahead / look-behind assertions with re.findall:

import re

pattern = re.compile("(<.*(?<=>))(.*)((?=</)[^>]*>)")
print re.findall(pattern, line)
# [('<label>', 'Olympic Games', '</label>'), ('<title>', 'Next stop', '</title>')]

Without look-ahead / look-behind assertions, just by capturing groups, with re.findall:

pattern = re.compile("(<[^>]*>)(.*)(</[^>]*>)")
print re.findall(pattern, line)
# [('<label>', 'Olympic Games', '</label>'), ('<title>', 'Next stop', '</title>')]

Solution 4:

If you don't mind punctuation, here is a quick non-regex alternative using itertools.groupby.

Code

import itertools as it


def split_at(iterable, pred, keep_delimter=False):
    """Return an iterable split by a delimiter."""
    if keep_delimter:
        return [list(g) for k, g in it.groupby(iterable, pred)]
    return [list(g) for k, g in it.groupby(iterable, pred) if k]

Demo

>>> words = "Lorem ipsum ..., consectetur ... elit, sed do eiusmod ...".split(" ")
>>> pred = lambda x: "elit" in x
>>> split_at(words, pred, True)
[['Lorem', 'ipsum', '...,', 'consectetur', '...'],
 ['elit,'],
 ['sed', 'do', 'eiusmod', '...']]

>>> words = "Lorem ipsum ..., consectetur ... elit, sed do eiusmod ...".split(" ")
>>> pred = lambda x: "consect" in x
>>> split_at(words, pred, True)
[['Lorem', 'ipsum', '...,'],
 ['consectetur'],
 ['...', 'elit,', 'sed', 'do', 'eiusmod', '...']]

Post a Comment for "Python Split At Tag Regex"