Random Under-sampling

Random Under-sampling is an under-sampling method that randomly select a subset of majority samples.

random_under(data, y, samp_method='balance', drop_na_col=True, drop_na_row=True, replacement=False, manual_perc=False, perc_u=-1, rel_thres=0.5, rel_method='auto', rel_xtrm_type='both', rel_coef=1.5, rel_ctrl_pts_rg=None)

Parameters:

data (Pandas dataframe) – Pandas dataframe, the dataset to re-sample.
y (str) – Column name of the target variable in the Pandas dataframe.
samp_method (str) – Method to determine re-sampling percentage. Either balance or extreme.
drop_na_col (bool) – Determine whether or not automatically drop columns containing NaN values. The data frame should not contain any missing values, so it is suggested to keep it as default.
drop_na_row (bool) – Determine whether or not automatically drop rows containing NaN values. The data frame should not contain any missing values, so it is suggested to keep it as default.
replacement (bool) – Randomly select sample to duplicate: with or without replacement.
manual_perc (bool) – Keep the same percentage of re-sampling for all bins. If True, perc_u is required to be a real number between 0 and 1 (0, 1).
perc_u (float) – User-specified fixed percentage of under-sampling for all bins. Must be a real number between 0 and 1 (0, 1) if manual_perc = True.
rel_thres (float) – Relevance threshold, above which a sample is considered rare. Must be a real number between 0 and 1 (0, 1].
rel_method (str) – Method to define the relevance function, either auto or manual. If manual, must specify rel_ctrl_pts_rg.
rel_xtrm_type (str) – Distribution focus, high, low, or both. If high, rare cases having small y values will be considerd as normal, and vise versa.
rel_coef (float) – Coefficient for box plot.
rel_ctrl_pts_rg (2D array) – Manually specify the regions of interest. See SMOGN advanced example for more details.

Returns:

Re-sampled dataset.

Return type:

Pandas dataframe

Raises:

ValueError – If an input attribute has wrong data type or invalid value, or relevance values are all zero or all one, or under_sampled data contains missing values.

Examples

>>> from ImbalancedLearningRegression import random_under
>>> housing = pandas.read_csv("https://raw.githubusercontent.com/paobranco/ImbalancedLearningRegression/master/data/housing.csv")
>>> housing_ru = random_under(data = housing, y = "SalePrice")