Machine Learning · Researcher

Diabetes Prediction (Stacking).

A stacking ensemble for early diabetes prediction on the PIMA dataset: six hyperparameter-tuned base classifiers feed a Random Forest meta-learner. Co-authored and published at IEEE AIMV 2021.

Role: Researcher
When: 2021
Stack: Python, Scikit-learn, Pandas, NumPy
Scale: 6 + 1 model ensemble

GitHub ↗

6 base models · RF meta-learner

6 + 1model ensemble

74.46%test accuracy

768 x 8PIMA dataset

IEEE 2021co-authored paper

The problem

Single classifiers on the small PIMA Indians Diabetes dataset trade off against each other: one is better on some patients, another on others. The question this project asked was whether stacking, letting a meta-model learn how to combine several tuned base classifiers, would predict diabetes more reliably than any one model on its own.

What it does

Six base classifiers, Gaussian Naive Bayes, Random Forest, Decision Tree, SVM, an ANN (MLP), and Logistic Regression, each hyperparameter-tuned with randomized search and cross-validation.
A stacking ensemble where the six base models' predictions become inputs to a Random Forest meta-learner (scikit-learn StackingClassifier, cv=4).
Standard preprocessing on the PIMA Indians dataset (768 patients, 8 clinical features) with a 70/30 train/test split.

Impact

Co-authored and published at the 2021 International Conference on Artificial Intelligence and Machine Vision (AIMV 2021) on IEEE Xplore (DOI 10.1109/AIMV53313.2021.9670920).
The committed notebook reaches 74.46% accuracy on the 30% held-out test set, against 87.9% on the training set, a gap that is itself informative on a dataset this small.
A single reproducible notebook: data, preprocessing, six tuned base models, the stacking ensemble, and per-model evaluation end to end.