The effects of data balancing approaches: A case study (2024)

Abstract

Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.

Original languageEnglish
Article number109853
JournalApplied Soft Computing
Volume132
DOIs
Publication statusPublished - Jan 2023

Keywords

  • Cattle
  • Classification
  • Feature selection
  • Hormone abuse detection
  • Imbalanced dataset
  • LC–MS
  • Missing data
  • Resampling
  • Supervised machine learning

Fingerprint

Dive into the research topics of 'The effects of data balancing approaches: A case study'. Together they form a unique fingerprint.

View full fingerprint

Cite this

  • APA
  • Author
  • BIBTEX
  • Harvard
  • Standard
  • RIS
  • Vancouver

Mooijman, P., Catal, C., Tekinerdogan, B., Lommen, A. (2023). The effects of data balancing approaches: A case study. Applied Soft Computing, 132, Article 109853. https://doi.org/10.1016/j.asoc.2022.109853

Mooijman, Paul ; Catal, Cagatay ; Tekinerdogan, Bedir et al. / The effects of data balancing approaches : A case study. In: Applied Soft Computing. 2023 ; Vol. 132.

@article{6cf2d37a7ef34ff5ae5b699af15d4e9f,

title = "The effects of data balancing approaches: A case study",

abstract = "Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.",

keywords = "Cattle, Classification, Feature selection, Hormone abuse detection, Imbalanced dataset, LC–MS, Missing data, Resampling, Supervised machine learning",

author = "Paul Mooijman and Cagatay Catal and Bedir Tekinerdogan and Arjen Lommen and Marco Blokland",

year = "2023",

month = jan,

doi = "10.1016/j.asoc.2022.109853",

language = "English",

volume = "132",

journal = "Applied Soft Computing",

issn = "1568-4946",

publisher = "Elsevier",

}

Mooijman, P, Catal, C, Tekinerdogan, B, Lommen, A 2023, 'The effects of data balancing approaches: A case study', Applied Soft Computing, vol. 132, 109853. https://doi.org/10.1016/j.asoc.2022.109853

The effects of data balancing approaches: A case study. / Mooijman, Paul; Catal, Cagatay; Tekinerdogan, Bedir et al.
In: Applied Soft Computing, Vol. 132, 109853, 01.2023.

Research output: Contribution to journalArticleAcademicpeer-review

TY - JOUR

T1 - The effects of data balancing approaches

T2 - A case study

AU - Mooijman, Paul

AU - Catal, Cagatay

AU - Tekinerdogan, Bedir

AU - Lommen, Arjen

AU - Blokland, Marco

PY - 2023/1

Y1 - 2023/1

N2 - Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.

AB - Imbalanced datasets affect the performance of machine learning algorithms adversely. To cope with this problem, several resampling methods have been developed recently. In this article, we present a case study approach for investigating the effects of data balancing approaches. The case study concerns the discrimination between growth hormone treated and non-treated animals using Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS) data. Our LC-HRMS dataset contains 1241 bovine urine samples, of which only 65 specimens were from animal studies and guaranteed to contain growth-stimulating hormones while the rest has been reported to be untreated, making it a ∼5% imbalanced dataset. In this research, classification algorithms, combined with resampling strategies and dimensionality reduction methods, were investigated to find a prediction model to correctly identify the samples of treated animals. Furthermore, to cope with a large number of missing data points in the given dataset, a replacement with random low values strategy was applied. Our results showed that the replacement method was effective, and LogisticRegression combined with the oversampling algorithms SMOTE or ADASYN, GaussianProcessClassifier with the oversampling algorithm SMOTE, and LinearDiscriminantAnalysis were the best performing models after log transformation of the dataset was followed by Recursive Feature Elimination.

KW - Cattle

KW - Classification

KW - Feature selection

KW - Hormone abuse detection

KW - Imbalanced dataset

KW - LC–MS

KW - Missing data

KW - Resampling

KW - Supervised machine learning

U2 - 10.1016/j.asoc.2022.109853

DO - 10.1016/j.asoc.2022.109853

M3 - Article

AN - SCOPUS:85143699222

SN - 1568-4946

VL - 132

JO - Applied Soft Computing

JF - Applied Soft Computing

M1 - 109853

ER -

Mooijman P, Catal C, Tekinerdogan B, Lommen A, Blokland M. The effects of data balancing approaches: A case study. Applied Soft Computing. 2023 Jan;132:109853. doi: 10.1016/j.asoc.2022.109853

The effects of data balancing approaches: A case study (2024)

References

Top Articles
Marquette Gas Prices
دانلود بازی Starfield – ElAmigos/DODI - UPDATE v1.9.71
The Tribes and Castes of the Central Provinces of India, Volume 3
Metra Union Pacific West Schedule
Mcgeorge Academic Calendar
Robinhood Turbotax Discount 2023
Hertz Car Rental Partnership | Uber
Nordstrom Rack Glendale Photos
Gameplay Clarkston
2021 Tesla Model 3 Standard Range Pl electric for sale - Portland, OR - craigslist
Xm Tennis Channel
Catsweb Tx State
Craigslist Dog Kennels For Sale
Palace Pizza Joplin
What is the difference between a T-bill and a T note?
Gwdonate Org
Uhcs Patient Wallet
104 Whiley Road Lancaster Ohio
Chastity Brainwash
Grayling Purnell Net Worth
Wausau Marketplace
Why Does Lawrence Jones Have Ptsd
Milanka Kudel Telegram
Grimes County Busted Newspaper
Masterkyngmash
Riversweeps Admin Login
Craigslist Maryland Trucks - By Owner
From This Corner - Chief Glen Brock: A Shawnee Thinker
Xpanas Indo
Rural King Credit Card Minimum Credit Score
Los Amigos Taquería Kalona Menu
Palmadise Rv Lot
Suspect may have staked out Trump's golf course for 12 hours before the apparent assassination attempt
Royals op zondag - "Een advertentie voor Center Parcs" of wat moeten we denken van de laatste video van prinses Kate?
Kvoa Tv Schedule
Directions To 401 East Chestnut Street Louisville Kentucky
Maxpreps Field Hockey
Tiny Pains When Giving Blood Nyt Crossword
Babbychula
Jail View Sumter
Vocabulary Workshop Level B Unit 13 Choosing The Right Word
Dogs Craiglist
sacramento for sale by owner "boats" - craigslist
Craigs List Hartford
Thotsbook Com
The Great Brian Last
Rescare Training Online
Craigslist Chautauqua Ny
10 Best Tips To Implement Successful App Store Optimization in 2024
Makemkv Key April 2023
The Plug Las Vegas Dispensary
BYU Football: Instant Observations From Blowout Win At Wyoming
Latest Posts
Article information

Author: Greg Kuvalis

Last Updated:

Views: 5749

Rating: 4.4 / 5 (55 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Greg Kuvalis

Birthday: 1996-12-20

Address: 53157 Trantow Inlet, Townemouth, FL 92564-0267

Phone: +68218650356656

Job: IT Representative

Hobby: Knitting, Amateur radio, Skiing, Running, Mountain biking, Slacklining, Electronics

Introduction: My name is Greg Kuvalis, I am a witty, spotless, beautiful, charming, delightful, thankful, beautiful person who loves writing and wants to share my knowledge and understanding with you.