Skip to content Skip to sidebar Skip to footer

Pandas Extract Numbers From Column Into New Columns

I currently have this df where the rect column is all strings. I need to extract the x, y, w and h from it into separate columns. The dataset is very large so I need an efficient a

Solution 1:

Using extractall

df[['x', 'y', 'w', 'h']] = df['rect'].str.extractall('(\d+)').unstack().loc[:,0]
Out[267]: 
match    0    1    2    3
0      120  168  260  120
1      120  168  260  120
2      120  168  260  120
3      120  168  260  120
4      120  168  260  120

Solution 2:

Inline

Produce a copy

df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))

                          rect    x    y    w    h
0  <Rect (120,168),260by120>  1201682601201  <Rect (120,168),260by120>  1201682601202  <Rect (120,168),260by120>  1201682601203  <Rect (120,168),260by120>  1201682601204  <Rect (120,168),260by120>  120168260120

Or just reassign to df

df = df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))

df

                          rect    x    y    w    h
0  <Rect (120,168),260by120>  1201682601201  <Rect (120,168),260by120>  1201682601202  <Rect (120,168),260by120>  1201682601203  <Rect (120,168),260by120>  1201682601204  <Rect (120,168),260by120>  120168260120

Inplace

Modify existing df

df[[*'xywh']] = pd.DataFrame(df.rect.str.findall('\d+').tolist())

df

                          rect    x    y    w    h
0  <Rect (120,168),260by120>  1201682601201  <Rect (120,168),260by120>  1201682601202  <Rect (120,168),260by120>  1201682601203  <Rect (120,168),260by120>  1201682601204  <Rect (120,168),260by120>  120168260120

Solution 3:

If the strings follow a specific format <Rect \((\d+),(\d+)\),(\d+) by (\d+)>, you can use this regular expression with str.extract method:

df[['x','y','w','h']] = df.rect.str.extract(r'<Rect \((\d+),(\d+)\),(\d+) by (\d+)>')

df
#                          rect    x    y    w    h#0  <Rect (120,168),260 by 120>  120  168  260  120#1  <Rect (120,168),260 by 120>  120  168  260  120#2  <Rect (120,168),260 by 120>  120  168  260  120#3  <Rect (120,168),260 by 120>  120  168  260  120#4  <Rect (120,168),260 by 120>  120  168  260  120

Solution 4:

Use str.extract, which extracts groups from regex into columns:

df['rect'].str.extract(r'\((?P<x>\d+),(?P<y>\d+)\),(?P<w>\d+) by (?P<h>\d+)', expand=True)

Result:

     x    y    w    h
0  120  168  260  120
1  120  168  260  120
2  120  168  260  120
3  120  168  260  120
4  120  168  260  120

Solution 5:

This is one of those cases where it makes sense to "optimize" the data itself instead of trying to morph it into what a consumer wants. It's much easier to change clean data into a specialized format than it is to change a specialized format into something portable.

That said, if you really have to parse this, you can do something like

>>> import re
>>> re.findall(r'\d+', '<Rect (120,168),260 by 120>')
['120', '168', '260', '120']
>>>

Post a Comment for "Pandas Extract Numbers From Column Into New Columns"