Skip to content Skip to sidebar Skip to footer

How To Split A Dataframe In Pandas In Predefined Percentages?

I have a pandas dataframe sorted by a number of columns. Now I'd like to split the dataframe in predefined percentages, so as to extract and name a few segments. For example, I wa

Solution 1:

Use numpy.split:

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])

Sample:

np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
          A         B         C         D         E
00.5434050.2783690.4245180.8447760.00471910.1215690.6707490.8258530.1367070.57509320.8913220.2092020.1853280.1083770.21969730.9786240.8116830.1719410.8162250.274074print (b)
          A         B         C         D         E
40.4317040.9400300.8176490.3361120.17541050.3728320.0056890.2524260.7956630.01525560.5988430.6038050.1051480.3819430.03647670.8904120.9809210.0599420.8905460.57690180.7424800.6301840.5818420.0204390.21002790.5446850.7691150.2506950.2858960.852395print (c)
           A         B         C         D         E
100.9750060.8848530.3595080.5988590.354796110.3401900.1780810.2376940.0448620.505431120.3762520.5928050.6299420.1426000.933841130.9463800.6022970.3877660.3631880.204345140.2767650.2465360.1736080.9666100.957013150.5979740.7313010.3403850.0920560.463498160.5086990.0884600.5280350.9921580.395036170.3355960.8054510.7543490.3130660.634037180.5404050.2967940.1107880.3126400.456979190.6589400.2542580.6411010.2001240.657625

Solution 2:

  1. Creating a dataframe with 70% values of original dataframe part_1 = df.sample(frac = 0.7)

  2. Creating dataframe with rest of the 30% values part_2 = df.drop(part_1.index)

Solution 3:

I've written a simple function that does the job.

Maybe that might help you.

P.S:

  • Sum of fractions must be 1.
  • It will return len(fracs) new dfs. so you can insert fractions list at long as you want (e.g: fracs=[0.1, 0.1, 0.3, 0.2, 0.2])

    np.random.seed(100)
    df = pd.DataFrame(np.random.random((99,4)))
    
    defsplit_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42):
        assertsum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs))
        remain = df.index.copy().to_frame()
        res = []
        for i inrange(len(fracs)):
            fractions_sum=sum(fracs[i:])
            frac = fracs[i]/fractions_sum
            idxs = remain.sample(frac=frac, random_state=random_state).index
            remain=remain.drop(idxs)
            res.append(idxs)
        return [df.loc[idxs] for idxs in res]
    
    train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation]print(train.shape, test.shape, val.shape)
    

    outputs:

    (79, 4) (10, 4) (10, 4)
    

Post a Comment for "How To Split A Dataframe In Pandas In Predefined Percentages?"