Grid Search Evaluation of ML Algorithms for Early Disease Detection in Imbalanced Medical Datasets

Document Type : Original Article

Authors

1 Information and Decision Support Center, Egyptian Cabinet, Egypt

2 Faculty of Engineering at Shoubra, Benha University & Benha national university, Egypt

3 Faculty of Engineering at Shoubra, Benha University, Egypt

Abstract

As the prevalence of chronic diseases rises, it's critical to identify them in their early stages to initiate effective treatments, as they may otherwise become incurable and deadly. Due to this reason, machine learning approaches are being used in these types of situations where a crucial data analysis needs to be performed on medical data to reveal hidden relationships or abnormalities, which are not visible to humans and need a consortium of experts to be revealed. Implementing algorithms to perform such tasks is difficult, but what makes it even more challenging is achieving higher accuracy. This paper applies several machine learning algorithms, including Logistic Regression, Random Forest, XGBoost, Multilayer Perceptron (MLP), and Naïve Bayes, to datasets from the University of California Machine Learning Repository and Kaggle. The main challenge is that the classifiers are biased towards the majority class, which can lead to misdiagnosis. We address this challenge using grid search to optimize key hyperparameters. This process significantly enhances model performance. The project analyzes and pre-processes disease datasets so that they can be used in the model. The models are evaluated, and the one with the best accuracy is selected. By tuning hyperparameters, we successfully minimized false negatives, which is critical for medical predictions. These findings suggest that grid search is an essential tool for improving model accuracy on imbalanced medical datasets. The study recommends utilizing hyperparameter optimization techniques, such as grid search, to improve the performance of models on imbalanced medical datasets, with a specific focus on minimizing false negatives, which are critical in clinical applications. It also highlights the importance of adopting comprehensive evaluation metrics, such as recall, accuracy, and F1-score, to ensure robust model assessment. Furthermore, the study advocates for the use of powerful models like XGBoost and Random Forest, where the former provides a balance between performance and execution time, while the latter achieves the highest accuracy at the expense of longer execution times.

Keywords