Introduction of Gaussian Noise
Introduction of Gaussian Noise is an over-sampling method that synthesizes new samples by introducing small perturbations on the numeric attributes and target variables of the seed samples. This over-sampling method has an optional choice of random under-sampling.
- gn(data, y, pert=0.02, samp_method='balance', under_samp=True, drop_na_col=True, drop_na_row=True, replace=False, manual_perc=False, perc_u=-1, perc_o=-1, rel_thres=0.5, rel_method='auto', rel_xtrm_type='both', rel_coef=1.5, rel_ctrl_pts_rg=None)
- Parameters:
data (Pandas dataframe) – Pandas dataframe, the dataset to re-sample.
y (str) – Column name of the target variable in the Pandas dataframe.
pert (float) – Perturbation amplitude. Must be a real number between 0 and 1 (0, 1].
samp_method (str) – Method to determine re-sampling percentage. Either
balance
orextreme
.under_samp (bool) – If
True
, random under-sampling will be conducted on the normal bins.drop_na_col (bool) – Determine whether or not automatically drop columns containing NaN values. The data frame should not contain any missing values, so it is suggested to keep it as default.
drop_na_row (bool) – Determine whether or not automatically drop rows containing NaN values. The data frame should not contain any missing values, so it is suggested to keep it as default.
replace (bool) – For decimal part of the over-sampling percentage, a subset of original dataset will be choosed as base samples to introduce noise, the selection can be with or without replacement.
manual_perc (bool) – Keep the same percentage of re-sampling for all bins. If
True
,perc_u
is required to be a real number between 0 and 1 (0, 1), andperc_o
is required to be a positive real number.perc_u (float) – User-specified fixed percentage of under-sampling for all bins. Must be a real number between 0 and 1 (0, 1) if
manual_perc = True
.perc_o (float) – User-specified fixed percentage of over-sampling for all bins. Must be a positive real number if
manual_perc = True
.rel_thres (float) – Relevance threshold, above which a sample is considered rare. Must be a real number between 0 and 1 (0, 1].
rel_method (str) – Method to define the relevance function, either
auto
ormanual
. Ifmanual
, must specifyrel_ctrl_pts_rg
.rel_xtrm_type (str) – Distribution focus,
high
,low
, orboth
. Ifhigh
, rare cases having small y values will be considerd as normal, and vise versa.rel_coef (float) – Coefficient for box plot.
rel_ctrl_pts_rg (2D array) – Manually specify the regions of interest. See SMOGN advanced example for more details.
- Returns:
Re-sampled dataset.
- Return type:
- Raises:
ValueError – If an input attribute has wrong data type or invalid value, or relevance values are all zero or all one, or synthetic data contains missing values.
References
[1] P. Branco, L. Torgo, R. P. Ribeiro, “Pre-processing approaches for imbalanced distributions in regression,” Neurocomputing, 343, pp. 76-99, 2019.
Examples
>>> from ImbalancedLearningRegression import gn
>>> housing = pandas.read_csv("https://raw.githubusercontent.com/paobranco/ImbalancedLearningRegression/master/data/housing.csv")
>>> housing_gn = gn(data = housing, y = "SalePrice")