Delete Specific Rows From Csv Using Pandas
Solution 1:
sample DataFrame built with @andrew_reece's code
In[9]: dfOut[9]:
centerleftrightsteeringthrottlebrake0center_54.jpgleft_75.jpgright_39.jpg1001center_20.jpgleft_81.jpgright_49.jpg3112center_34.jpgleft_96.jpgright_11.jpg0423center_98.jpgleft_87.jpgright_34.jpg0004center_67.jpgleft_12.jpgright_28.jpg1105center_11.jpgleft_25.jpgright_94.jpg2106center_66.jpgleft_27.jpgright_52.jpg1337center_18.jpgleft_50.jpgright_17.jpg0048center_60.jpgleft_25.jpgright_28.jpg2419center_98.jpgleft_97.jpgright_55.jpg330
.. ... ... ... ... ... ...
90center_31.jpgleft_90.jpgright_43.jpg01091center_29.jpgleft_7.jpgright_30.jpg30092center_37.jpgleft_10.jpgright_15.jpg10093center_18.jpgleft_1.jpgright_83.jpg31194center_96.jpgleft_20.jpgright_56.jpg30095center_37.jpgleft_40.jpgright_38.jpg03196center_73.jpgleft_86.jpgright_71.jpg01097center_85.jpgleft_31.jpgright_0.jpg30498center_34.jpgleft_52.jpgright_40.jpg00299center_91.jpgleft_46.jpgright_17.jpg000[100 rows x 6 columns]In[10]: df.steering.value_counts()
Out[10]:
043 # NOTE: 43zeros118215412312Name: steering, dtype: int64In[11]: df.shapeOut[11]: (100, 6)
your solution (unchanged):
In [12]: df = df.drop(df.query('steering==0').sample(frac=0.90).index)
In [13]: df.steering.value_counts()
Out[13]:
11821541231204 # NOTE: 4 zeros (~10% from 43)
Name: steering, dtype: int64
In [14]: df.shape
Out[14]: (61, 6)
NOTE: make sure that steering
column has numeric dtype! If it's a string (object) then you would need to change your code as follows:
df = df.drop(df.query('steering=="0"').sample(frac=0.90).index)
# NOTE: ^ ^
after that you can save the modified (reduced) DataFrame to CSV:
df.to_csv('/path/to/filename.csv', index=False)
Solution 2:
Here's a one-line approach, using concat()
and sample()
:
import numpy as np
import pandas as pd
# first, some sample data# generate filename fields
positions = ['center','left','right']
N = 100
fnames = ['{}_{}.jpg'.format(loc, np.random.randint(100)) for loc in np.repeat(positions, N)]
df = pd.DataFrame(np.array(fnames).reshape(3,100).T, columns=positions)
# generate numeric fields
values = [0,1,2,3,4]
probas = [.5,.2,.1,.1,.1]
df['steering'] = np.random.choice(values, p=probas, size=N)
df['throttle'] = np.random.choice(values, p=probas, size=N)
df['brake'] = np.random.choice(values, p=probas, size=N)
print(df.shape)
(100,3)
The first few rows of sample output:
df.head()
center leftright steering throttle brake
0 center_72.jpg left_26.jpg right_59.jpg3301 center_75.jpg left_68.jpg right_26.jpg0022 center_29.jpg left_8.jpg right_88.jpg0103 center_22.jpg left_26.jpg right_23.jpg1004 center_88.jpg left_0.jpg right_56.jpg4105 center_93.jpg left_18.jpg right_15.jpg000
Now drop all but 10% of rows with steering==0
:
newdf = pd.concat([df.loc[df.steering!=0],
df.loc[df.steering==0].sample(frac=0.1)])
With the probability weightings I used in this example, you'll see somewhere between 50-60 remaining entries in newdf
, with about 5 steering==0
cases remaining.
Solution 3:
Using a mask on steering
combined with a random number should work:
df = df[(df.steering != 0) | (np.random.rand(len(df)) < 0.1)]
This does generate some extra random values, but it's nice and compact.
Edit: That said, I tried your example code and it worked as well. My guess is the error is coming from the fact that your df.query()
statement is returning an empty dataframe, which probably means that the "sample"
column does not contain any zeros, or alternatively that the column is read as strings rather than numeric. Try converting the column to integer before running the above snippet.
Post a Comment for "Delete Specific Rows From Csv Using Pandas"