Random Seed Chose Different Rows

December 27, 2023 Post a Comment

I was applying .sample with random_state set to a constant and after using set_index it started selecting different rows. A member dropped that was previously included in the subse

Solution 1:

Applying .sort_index() after reading in the data and before performing .sample() corrected the issue. As long as the data remains the same, this will produce the same sample everytime.

Solution 2:

When sampling rows (without weight), the only things that matter are n, the number of rows, and whether or not you choose replacement. This generates the .iloc indices to take, regardless of the data.

For rows, sampling occurs as;

axis_length = self.shape[0]  # DataFrame length

rs = pd.core.common.random_state(random_state)  
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)  # np.random_choicereturnself.take(locs, axis=axis, is_copy=False)

Just to illustrate the point

Sample Data

import pandas as pd
import numpy as np

n = 100000
np.random.seed(123)
df = pd.DataFrame({'id': list(range(n)), 'gender': np.random.choice(['M', 'F'], n)})
df1 = pd.DataFrame({'id': list(range(n)), 'gender': ['M']}, 
                    index=np.random.choice(['foo', 'bar', np.NaN], n)).assign(blah=1)

Sampling will always choose row 42083 (integer array index): df.iloc[42803] for this seed and length:

df.sample(n=1, random_state=123)
# id gender#42083  42083      M

df1.sample(n=1, random_state=123)
# id gender  blah#foo  42083      M     1

df1.reset_index().shift(10).sample(n=1, random_state=123)
#      index       id gender  blah#42083   nan  42073.0      M   1.0

Even numpy:

np.random.seed(123)
np.random.choice(df.shape[0], size=1, replace=False)
#array([42083])

Python Dummy

Random Seed Chose Different Rows

Solution 1:

Solution 2:

Sample Data

Post a Comment for "Random Seed Chose Different Rows"