Skip to content Skip to sidebar Skip to footer

Subsampling An Unbalanced Dataset In Tensorflow

Tensorflow beginner here. This is my first project and I am working with pre-defined estimators. I have an extremely unbalanced dataset where positive outcomes represent roughly 0.

Solution 1:

You will probably get better results by oversampling your under-represented class rather than throwing away data in your over-represented class. This way you keep the variance in the over-represented class. You might as well use the data you have.

The easiest way to achieve this is probably to create two Datasets, one for each class. Then you can use Dataset.interleave to sample equally from both datasets.

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#interleave

Solution 2:

Oversampling can be easily achieved with following code:

resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.7, 0.3])

Tensorflow has a good guide on dealing with unbalanced data you can find more ideas here: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#oversampling

Post a Comment for "Subsampling An Unbalanced Dataset In Tensorflow"