ImbalancedLearningRegression
latest

Introduction

  • Imbalanced Learning Regression

Over-sampling Techniques

  • Random Over-sampling
  • SMOTE
  • Introduction of Gaussian Noise
  • ADASYN

Under-sampling Techniques

  • Random Under-sampling
  • Condensed Nearest Neighbor
  • TomekLinks
    • tomeklinks()
    • References
    • Examples
  • Edited Nearest Neighbor

Notes

  • Glossary
ImbalancedLearningRegression
  • TomekLinks
  • Edit on GitHub

TomekLinks

TomekLinks is an under-sampling method that under-samples the majority/minority/both class(es) by removing TomekLinks.

tomeklinks(data, y, option='majority', drop_na_col=True, drop_na_row=True, rel_thres=0.5, rel_method='auto', rel_xtrm_type='both', rel_coef=1.5, rel_ctrl_pts_rg=None)
Parameters:
  • data (Pandas dataframe) – Pandas dataframe, the dataset to re-sample.

  • y (str) – Column name of the target variable in the Pandas dataframe.

  • option (str) – Sampling information to sample the data set. If majority, resample only the majority class; if minority, resample only the minority class; if both, resample both majority and minority class.

  • drop_na_col (bool) – Determine whether or not automatically drop columns containing NaN values. The data frame should not contain any missing values, so it is suggested to keep it as default.

  • drop_na_row (bool) – Determine whether or not automatically drop rows containing NaN values. The data frame should not contain any missing values, so it is suggested to keep it as default.

  • rel_thres (float) – Relevance threshold, above which a sample is considered rare. Must be a real number between 0 and 1 (0, 1].

  • rel_method (str) – Method to define the relevance function, either auto or manual. If manual, must specify rel_ctrl_pts_rg.

  • rel_xtrm_type (str) – Distribution focus, high, low, or both. If high, rare cases having small y values will be considerd as normal, and vise versa.

  • rel_coef (float) – Coefficient for box plot.

  • rel_ctrl_pts_rg (2D array) – Manually specify the regions of interest. See SMOGN advanced example for more details.

Returns:

Re-sampled dataset.

Return type:

Pandas dataframe

Raises:

ValueError – If an input attribute has wrong data type or invalid value, or relevance values are all zero or all one, or synthetic data contains missing values.

References

[1] I. Tomek, “Two modifications of CNN,” In Systems, Man, and Cybernetics, IEEE Transactions on, vol. 6, pp 769-772, 1976.

[2] T. Elhassan, M. Aljurf, “Classification of imbalance data using tomek link (t-link) combined with random under-sampling (rus) as a data reduction method,” Global J Technol Optim S, 1, 2016.

Examples

>>> from ImbalancedLearningRegression import tomeklinks
>>> housing = pandas.read_csv("https://raw.githubusercontent.com/paobranco/ImbalancedLearningRegression/master/data/housing.csv")
>>> housing_tomeklinks = tomeklinks(data = housing, y = "SalePrice")
Previous Next

© Copyright 2022, Wenglei Wu. Revision 5d0598e5.

Built with Sphinx using a theme provided by Read the Docs.