How To Identify The Ultimate Parent From A Nested Table Structure Using Python?
Solution 1:
In the same vein as @Vaishali's answer, here is a version that uses Python looping over the major operations, but uses np / pd
operations within the dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{ 'id': pd.Series([5., 6, 2, 51, 1, 70, 10]),
'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, 51, np.nan])
}
)
deffind_ultimate_parents(df):
# Make a copy of df, using 'id' as the index so we can lookup parent ids
df2 = df.set_index(df['id'])
df2['nextpar'] = df2['parent_id']
# Next-parent-2 not null - fake it for now
np2nn = df2['nextpar'].notnull()
while np2nn.any():
# Lookup df2[parent-id], since the index is now by id. Get the# parent-id (of the parent-id), put that value in nextpar2.# So basically, if row B.nextpar has A, nextpar2 has (parent-of-A), or Nan.# Set na_action='ignore' so any Nan doesn't bother looking up, just copies# the Nan to the next generation.
df2['nextpar2'] = df2['nextpar'].map(df2['parent_id'], na_action='ignore')
# Re-evaluate who is a Nan in the nextpar2 column.
np2nn = df2['nextpar2'].notnull()
# Only update nextpar from nextpar2 if nextpar2 is not a Nan. Thus, stop# at the root.
df2.loc[np2nn, 'nextpar'] = df2[np2nn]['nextpar2']
# At this point, we've run out of parents to look up. df2['nextpar'] has# the "ultimate" parents.return df2['nextpar']
df['ultimate_parent_id'] = find_ultimate_parents(df)
print(df)
The loop guard checks np2nn.any()
which is a vector op on a boolean Series. Each pass through the loop looks up the "next parent", so the number of passes through the loop will be the maximum depth of any child-parent chain. The worst case in O(N), for a list like 1->2->3->4->...->n. The best case is 0, for a list with no parents.
The loop does a .map
with na_action='ignore'
to simply propagate Nan values. This is O(fast-N) times the cost of the index lookup, which should be O(1).
With the nextpar2
field computed, the loop re-computes np2nn
using a simple .notnull()
which again is O(fast-N).
Finally, the nextpar
field is updated from nextpar2,
which again should be O(fast-N).
Thus, worst-case performance is O(slow-N * fast-N), which is N², but it's a Pandas-N², not a Python-N². Average-case should be O(slow-m * fast-N) where m is the average-case maximum tree depth, and best case is O(fast-N) for 1 fast pass through the rows.
Solution 2:
Here is one solution with use of map and combine_first. First create a dictionary from the df values for mapping. Now use map on parent_id to map those values first and then use map again to map values to id. Combine_first will ensure that values mapped from parent_id gets precedence. Final combine_first to fill out NaN values with id.
d = final_df.dropna().set_index('id').to_dict()
final_df['ultimate_parent_id'] =
final_df['parent_id'].map(d['parent_id'])\
.combine_first(final_df['id'].map(d['parent_id']))\
.combine_first(final_df['id'])
You get
id parent_id ultimate_parent_id
05.051.051.016.01.010.022.0NaN2.0351.0NaN51.041.010.010.0570.0NaN70.0610.0NaN10.0
Solution 3:
Let's first cleanup the DataFrame and get rid of the nan
s. A negative number is a good replacement:
original_df = original_df.fillna(-1).astype(int)
Convert the DataFrame into a dictionary:
d = original_df.set_index('id').to_dict()['parent_id']
#{1: 10, 2: -1, 51: -1, 5: 51, 6: 1, 10: -1, 70: -1}
Now, you need a recursive function to translate an ID into the ultimate parent ID:
deftranslate(x):
return x if d[x] == -1else translate(d[x])
Apply the recursive function to each dictionary key, collect the results into another DataFrame:
ultimate = pd.DataFrame(pd.Series({x: translate(x) for x in d.keys()}),
columns=('ultimate_parent_id', ))
Combine the result with the original DataFrame:
original_df.merge(ultimate, left_on='id', right_index=True)
# id parent_id ultimate_parent_id#0 5 51 51#1 6 1 10#2 2 -1 2#3 51 -1 51#4 1 10 10#5 70 -1 70#6 10 -1 10
Solution 4:
Adding to @adhast's answer, the last line the function(find_ultimate_parents(df)) should be
return df2['nextpar'].values
df2 is using df['id'] as an index so doesn't correspond to df's index.
Below is the complete script.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{ 'id': pd.Series([5., 6, 2, 51, 1, 70, 10]),
'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, 51, np.nan])
}
)
deffind_ultimate_parents(df):
# Make a copy of df, using 'id' as the index so we can lookup parent ids
df2 = df.set_index(df['id'])
df2['nextpar'] = df2['parent_id']
# Next-parent-2 not null - fake it for now
np2nn = df2['nextpar'].notnull()
while np2nn.any():
# Lookup df2[parent-id], since the index is now by id. Get the# parent-id (of the parent-id), put that value in nextpar2.# So basically, if row B.nextpar has A, nextpar2 has (parent-of-A), or Nan.# Set na_action='ignore' so any Nan doesn't bother looking up, just copies# the Nan to the next generation.
df2['nextpar2'] = df2['nextpar'].map(df2['parent_id'], na_action='ignore')
# Re-evaluate who is a Nan in the nextpar2 column.
np2nn = df2['nextpar2'].notnull()
# Only update nextpar from nextpar2 if nextpar2 is not a Nan. Thus, stop# at the root.
df2.loc[np2nn, 'nextpar'] = df2[np2nn]['nextpar2']
# At this point, we've run out of parents to look up. df2['nextpar'] has# the "ultimate" parents.return df2['nextpar'].values
df['ultimate_parent_id'] = find_ultimate_parents(df)
print(df)
Post a Comment for "How To Identify The Ultimate Parent From A Nested Table Structure Using Python?"