Skip to content Skip to sidebar Skip to footer

How To Identify The Ultimate Parent From A Nested Table Structure Using Python?

I have the following table: My question is: how do I programmatically identify the ultimate parent? Here are the rules explained through an example: the id 5.0's parent is 51.0.

Solution 1:

In the same vein as @Vaishali's answer, here is a version that uses Python looping over the major operations, but uses np / pd operations within the dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame(
        { 'id': pd.Series([5., 6, 2, 51, 1, 70, 10]),
        'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, 51, np.nan])
        }
    )

deffind_ultimate_parents(df):
    # Make a copy of df, using 'id' as the index so we can lookup parent ids
    df2 = df.set_index(df['id'])
    df2['nextpar'] = df2['parent_id']

    # Next-parent-2 not null - fake it for now
    np2nn = df2['nextpar'].notnull()

    while np2nn.any():
        # Lookup df2[parent-id], since the index is now by id. Get the# parent-id (of the parent-id), put that value in nextpar2.# So basically, if row B.nextpar has A, nextpar2 has (parent-of-A), or Nan.# Set na_action='ignore' so any Nan doesn't bother looking up, just copies# the Nan to the next generation.
        df2['nextpar2'] = df2['nextpar'].map(df2['parent_id'], na_action='ignore')

        # Re-evaluate who is a Nan in the nextpar2 column.
        np2nn = df2['nextpar2'].notnull()

        # Only update nextpar from nextpar2 if nextpar2 is not a Nan. Thus, stop# at the root.
        df2.loc[np2nn, 'nextpar'] = df2[np2nn]['nextpar2']

    # At this point, we've run out of parents to look up. df2['nextpar'] has# the "ultimate" parents.return df2['nextpar']


df['ultimate_parent_id'] = find_ultimate_parents(df)
print(df)

The loop guard checks np2nn.any() which is a vector op on a boolean Series. Each pass through the loop looks up the "next parent", so the number of passes through the loop will be the maximum depth of any child-parent chain. The worst case in O(N), for a list like 1->2->3->4->...->n. The best case is 0, for a list with no parents.

The loop does a .map with na_action='ignore' to simply propagate Nan values. This is O(fast-N) times the cost of the index lookup, which should be O(1).

With the nextpar2 field computed, the loop re-computes np2nn using a simple .notnull() which again is O(fast-N).

Finally, the nextpar field is updated from nextpar2, which again should be O(fast-N).

Thus, worst-case performance is O(slow-N * fast-N), which is , but it's a Pandas-N², not a Python-N². Average-case should be O(slow-m * fast-N) where m is the average-case maximum tree depth, and best case is O(fast-N) for 1 fast pass through the rows.

Solution 2:

Here is one solution with use of map and combine_first. First create a dictionary from the df values for mapping. Now use map on parent_id to map those values first and then use map again to map values to id. Combine_first will ensure that values mapped from parent_id gets precedence. Final combine_first to fill out NaN values with id.

d = final_df.dropna().set_index('id').to_dict()
final_df['ultimate_parent_id'] = 
final_df['parent_id'].map(d['parent_id'])\
.combine_first(final_df['id'].map(d['parent_id']))\
.combine_first(final_df['id'])

You get

    id      parent_id   ultimate_parent_id
05.051.051.016.01.010.022.0NaN2.0351.0NaN51.041.010.010.0570.0NaN70.0610.0NaN10.0

Solution 3:

Let's first cleanup the DataFrame and get rid of the nans. A negative number is a good replacement:

original_df = original_df.fillna(-1).astype(int)

Convert the DataFrame into a dictionary:

d = original_df.set_index('id').to_dict()['parent_id']
#{1: 10, 2: -1, 51: -1, 5: 51, 6: 1, 10: -1, 70: -1}

Now, you need a recursive function to translate an ID into the ultimate parent ID:

deftranslate(x):
    return x if d[x] == -1else translate(d[x])

Apply the recursive function to each dictionary key, collect the results into another DataFrame:

ultimate = pd.DataFrame(pd.Series({x: translate(x) for x in d.keys()}), 
                 columns=('ultimate_parent_id', ))

Combine the result with the original DataFrame:

original_df.merge(ultimate, left_on='id', right_index=True)

#   id  parent_id  ultimate_parent_id#0   5         51                  51#1   6          1                  10#2   2         -1                   2#3  51         -1                  51#4   1         10                  10#5  70         -1                  70#6  10         -1                  10

Solution 4:

Adding to @adhast's answer, the last line the function(find_ultimate_parents(df)) should be

return df2['nextpar'].values

df2 is using df['id'] as an index so doesn't correspond to df's index.

Below is the complete script.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    { 'id': pd.Series([5., 6, 2, 51, 1, 70, 10]),
    'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, 51, np.nan])
    }
)

deffind_ultimate_parents(df):
    # Make a copy of df, using 'id' as the index so we can lookup parent ids
    df2 = df.set_index(df['id'])
    df2['nextpar'] = df2['parent_id']

    # Next-parent-2 not null - fake it for now
    np2nn = df2['nextpar'].notnull()

    while np2nn.any():
        # Lookup df2[parent-id], since the index is now by id. Get the# parent-id (of the parent-id), put that value in nextpar2.# So basically, if row B.nextpar has A, nextpar2 has (parent-of-A), or Nan.# Set na_action='ignore' so any Nan doesn't bother looking up, just copies# the Nan to the next generation.
        df2['nextpar2'] = df2['nextpar'].map(df2['parent_id'], na_action='ignore')

        # Re-evaluate who is a Nan in the nextpar2 column.
        np2nn = df2['nextpar2'].notnull()

        # Only update nextpar from nextpar2 if nextpar2 is not a Nan. Thus, stop# at the root.
        df2.loc[np2nn, 'nextpar'] = df2[np2nn]['nextpar2']

    # At this point, we've run out of parents to look up. df2['nextpar'] has# the "ultimate" parents.return df2['nextpar'].values


df['ultimate_parent_id'] = find_ultimate_parents(df)
print(df)

Post a Comment for "How To Identify The Ultimate Parent From A Nested Table Structure Using Python?"