An end-to-end Machine Learning project that predicts whether an individualβs income exceeds $50K/yr based on census data. Features a full pipeline from raw data to SQL normalization, MLflow tracking, and Dockerized deployment.
This project goes beyond simple model training by implementing a robust MLOps workflow. It ingests raw census data, normalizes it into a 3NF SQLite Database, performs advanced feature engineering, tracks experiments using MLflow (via DagsHub), and serves the best model via a FastAPI backend and Streamlit frontend.
Dataset: Adult Census Income Evaluation Dataset
Source: UCI Machine Learning Repository - Adult Dataset
adult.data (training set, 32,561 instances)income_evaluation.csvIncome_Classification/ folder or update the path in notebookincome_evaluation.csv
adult.data and renamed/content/income_evaluation.csvfile_path = 'your/path/to/income_evaluation.csv'cleaned_income_evaluation.csv (automatically generated after running data cleaning cells)The dataset contains these exact columns:
age, workclass, fnlwgt, education, education-num, marital-statusoccupation, relationship, race, sex, capital-gain, capital-losshours-per-week, native-country, income (<=50K or >50K)First row: 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
The project pipeline consists of five major stages:
income_evaluation.csv from UCI ML Repository.Personal_Details, Employment_Details, Education_Details, Financial_Details, Location_Details.file_path = '/content/income_evaluation.csv' in notebook Cell 2 to your local CSV location.Age_to_hours_ratio, Age_squared).```text income-classification-ml/ βββ Income_Classification/ β βββ app/ β β βββ fastapi/ β β β βββ main.py # API endpoint for inference β β β βββ random_forest_model.joblib # Serialized model β β βββ streamlit/ β β βββ income.py # User Interface β βββ Classification.ipynb # Main notebook: Data -> SQL -> MLflow β βββ income_evaluation.csv # [REQUIRED] Raw dataset (download from UCI) β βββ cleaned_income_evaluation.csv # Generated after data cleaning β βββ income_evaluation.db # SQLite 3NF database (auto-generated) β βββ Dockerfile # Docker build for the app β βββ Dockerfile-fastapi # Docker build for API β βββ requirements.txt # Python dependencies βββ ML_Project_Plan.docx # Project planning document βββ README.md # Documentation
Experiments were tracked using MLflow. Below are the results from the top-performing models on the test set:
Model Feature Selection F1-Score (CV Mean) XGBoost Variance Threshold 0.8087 XGBoost Correlation Threshold 0.8072 Random Forest Feature Importance 0.7802 Ridge Classifier None 0.7490 Logistic Regression None 0.6309
##Champion Model: XGBClassifier with Variance Threshold feature selection was chosen for deployment due to its superior stability and F1-score.