Abstract
Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.
Original language | English |
---|---|
Article number | 109853 |
Journal | Applied Soft Computing |
Volume | 132 |
DOIs | |
Publication status | Published - Jan 2023 |
Keywords
- Cattle
- Classification
- Feature selection
- Hormone abuse detection
- Imbalanced dataset
- LC–MS
- Missing data
- Resampling
- Supervised machine learning
Access to Document
10.1016/j.asoc.2022.109853Licence: CC BY
https://edepot.wur.nl/583554Licence: CC BY
Fingerprint
Dive into the research topics of 'The effects of data balancing approaches: A case study'. Together they form a unique fingerprint.
View full fingerprint
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver
Mooijman, P., Catal, C., Tekinerdogan, B., Lommen, A. (2023). The effects of data balancing approaches: A case study. Applied Soft Computing, 132, Article 109853. https://doi.org/10.1016/j.asoc.2022.109853
Mooijman, Paul ; Catal, Cagatay ; Tekinerdogan, Bedir et al. / The effects of data balancing approaches : A case study. In: Applied Soft Computing. 2023 ; Vol. 132.
@article{6cf2d37a7ef34ff5ae5b699af15d4e9f,
title = "The effects of data balancing approaches: A case study",
abstract = "Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.",
keywords = "Cattle, Classification, Feature selection, Hormone abuse detection, Imbalanced dataset, LC–MS, Missing data, Resampling, Supervised machine learning",
author = "Paul Mooijman and Cagatay Catal and Bedir Tekinerdogan and Arjen Lommen and Marco Blokland",
year = "2023",
month = jan,
doi = "10.1016/j.asoc.2022.109853",
language = "English",
volume = "132",
journal = "Applied Soft Computing",
issn = "1568-4946",
publisher = "Elsevier",
}
Mooijman, P, Catal, C, Tekinerdogan, B, Lommen, A 2023, 'The effects of data balancing approaches: A case study', Applied Soft Computing, vol. 132, 109853. https://doi.org/10.1016/j.asoc.2022.109853
The effects of data balancing approaches: A case study. / Mooijman, Paul; Catal, Cagatay; Tekinerdogan, Bedir et al.
In: Applied Soft Computing, Vol. 132, 109853, 01.2023.
Research output: Contribution to journal › Article › Academic › peer-review
TY - JOUR
T1 - The effects of data balancing approaches
T2 - A case study
AU - Mooijman, Paul
AU - Catal, Cagatay
AU - Tekinerdogan, Bedir
AU - Lommen, Arjen
AU - Blokland, Marco
PY - 2023/1
Y1 - 2023/1
N2 - Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.
AB - Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.
KW - Cattle
KW - Classification
KW - Feature selection
KW - Hormone abuse detection
KW - Imbalanced dataset
KW - LC–MS
KW - Missing data
KW - Resampling
KW - Supervised machine learning
U2 - 10.1016/j.asoc.2022.109853
DO - 10.1016/j.asoc.2022.109853
M3 - Article
AN - SCOPUS:85143699222
SN - 1568-4946
VL - 132
JO - Applied Soft Computing
JF - Applied Soft Computing
M1 - 109853
ER -
Mooijman P, Catal C, Tekinerdogan B, Lommen A, Blokland M. The effects of data balancing approaches: A case study. Applied Soft Computing. 2023 Jan;132:109853. doi: 10.1016/j.asoc.2022.109853