The effects of data balancing approaches: A case study (2024)

Abstract

Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.

Original language	English
Article number	109853
Journal	Applied Soft Computing
Volume	132
DOIs	https://doi.org/10.1016/j.asoc.2022.109853
Publication status	Published - Jan 2023

Keywords

Cattle
Classification
Feature selection
Hormone abuse detection
Imbalanced dataset
LC–MS
Missing data
Resampling
Supervised machine learning

Access to Document

10.1016/j.asoc.2022.109853Licence: CC BY

https://edepot.wur.nl/583554Licence: CC BY

Fingerprint

Dive into the research topics of 'The effects of data balancing approaches: A case study'. Together they form a unique fingerprint.

View full fingerprint

Cite this

APA
Author
BIBTEX
Harvard
Standard
RIS
Vancouver

Mooijman, P., Catal, C., Tekinerdogan, B., Lommen, A. (2023). The effects of data balancing approaches: A case study. Applied Soft Computing, 132, Article 109853. https://doi.org/10.1016/j.asoc.2022.109853

Mooijman, Paul ; Catal, Cagatay ; Tekinerdogan, Bedir et al. / The effects of data balancing approaches : A case study. In: Applied Soft Computing. 2023 ; Vol. 132.

@article{6cf2d37a7ef34ff5ae5b699af15d4e9f,

title = "The effects of data balancing approaches: A case study",

abstract = "Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.",

keywords = "Cattle, Classification, Feature selection, Hormone abuse detection, Imbalanced dataset, LC–MS, Missing data, Resampling, Supervised machine learning",

author = "Paul Mooijman and Cagatay Catal and Bedir Tekinerdogan and Arjen Lommen and Marco Blokland",

year = "2023",

month = jan,

doi = "10.1016/j.asoc.2022.109853",

language = "English",

volume = "132",

journal = "Applied Soft Computing",

issn = "1568-4946",

publisher = "Elsevier",

}

Mooijman, P, Catal, C, Tekinerdogan, B, Lommen, A 2023, 'The effects of data balancing approaches: A case study', Applied Soft Computing, vol. 132, 109853. https://doi.org/10.1016/j.asoc.2022.109853

The effects of data balancing approaches: A case study. / Mooijman, Paul; Catal, Cagatay; Tekinerdogan, Bedir et al.
In: Applied Soft Computing, Vol. 132, 109853, 01.2023.

Research output: Contribution to journal › Article › Academic › peer-review

TY - JOUR

T1 - The effects of data balancing approaches

T2 - A case study

AU - Mooijman, Paul

AU - Catal, Cagatay

AU - Tekinerdogan, Bedir

AU - Lommen, Arjen

AU - Blokland, Marco

PY - 2023/1

Y1 - 2023/1

N2 - Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.

AB - Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.

KW - Cattle

KW - Classification

KW - Feature selection

KW - Hormone abuse detection

KW - Imbalanced dataset

KW - LC–MS

KW - Missing data

KW - Resampling

KW - Supervised machine learning

U2 - 10.1016/j.asoc.2022.109853

DO - 10.1016/j.asoc.2022.109853

M3 - Article

AN - SCOPUS:85143699222

SN - 1568-4946

VL - 132

JO - Applied Soft Computing

JF - Applied Soft Computing

M1 - 109853

ER -

Mooijman P, Catal C, Tekinerdogan B, Lommen A, Blokland M. The effects of data balancing approaches: A case study. Applied Soft Computing. 2023 Jan;132:109853. doi: 10.1016/j.asoc.2022.109853