Research Project

Diabetes prediction model based on structured electronic health record

Overview

This project was developed for a Kaggle monthly competition and focused on building a machine learning workflow for diabetes prediction. Multiple machine learning models were trained and combined to construct the final ensemble model.

  1. 1.Feature engineering: Engineered 40+ clinical features including blood pressure risk levels, and metabolic indicators such as visceral fat, mean arterial pressure, and the atherogenic index of plasma.
  2. 2.AutoGluon baseline: Established a strong performance baseline using AutoGluon. It automated the end-to-end workflow through multi-layer ensemble stacking and repeated k-fold bagging on clinical features.
  3. 3.Optimized stacking and weighting: Built an ensemble meta-model using a hill climbing optimization algorithm to determine the optimal blending weights for four machine learning models (XGBoost, LightGBM, CatBoost, and YDF). This stacking strategy reached an Out-of-Fold AUC of 0.7295.
Diabetes prediction model based on structured electronic health record

Code Availability