A comprehensive machine learning pipeline for predicting student success outcomes at Bishop State Community College.
This project implements five machine learning models to predict various aspects of student success:
- Retention Prediction - Will the student be retained?
- Early Warning System - Is the student at risk?
- Time-to-Credential - How long until graduation?
- Credential Type - What credential will they earn?
- Course Success - What will their GPA be?
The models use demographic, academic preparation, enrollment, and course performance data to generate actionable predictions for student support services.
codebenders-datathon/
βββ ai_model/ # Machine learning models and scripts
β βββ __init__.py # Package initialization
β βββ complete_ml_pipeline.py # Main ML pipeline (5 models)
β βββ generate_bishop_state_data.py # Synthetic data generation
β βββ merge_bishop_state_data.py # Data merging script
β
βββ data/ # Data files (CSV and Excel)
β βββ ar_bscc_with_zip.csv # AR data with zip codes
β βββ bishop_state_cohorts_with_zip.csv # Student cohort data
β βββ bishop_state_courses.csv # Course enrollment data
β βββ bishop_state_student_level_with_zip.csv # Student-level aggregated data
β βββ bishop_state_student_level_with_predictions.csv # Student-level with predictions
β βββ bishop_state_merged_with_predictions.csv # Course-level with predictions
β βββ De-identified PDP AR Files.xlsx # Original Excel data
β
βββ codebenders-dashboard/ # Next.js web application
βββ operations/ # Database utilities and configuration
βββ DATA_DICTIONARY.md # Detailed data field descriptions
βββ ML_MODELS_GUIDE.md # Machine learning models guide
βββ requirements.txt # Python dependencies
βββ LICENSE # MIT License
βββ README.md # This file
- Retention Risk Assessment: Identify students at risk of not returning
- Early Warning Alerts: Four-level alert system (URGENT, HIGH, MODERATE, LOW)
- Graduation Timeline: Predict time to credential completion
- Credential Path: Forecast credential type (Certificate, Associate's, Bachelor's)
- Academic Performance: Predict expected GPA and identify over/underperformers
- XGBoost & Random Forest: State-of-the-art ensemble methods
- Feature Engineering: 40+ engineered features from raw data
- Comprehensive Evaluation: Multiple metrics for each model
- Production-Ready: Generates predictions for all students
- Detailed Reporting: Automated summary reports with model performance
- Python 3.8 or higher
- pip package manager
- Postgres database access via Supabase (for saving predictions)
-
Clone the repository
git clone https://github.com/devcolor/codebenders-datathon.git cd codebenders-datathon -
Create and activate virtualenv
python -m venv venv source venv/bin/activate -
Install dependencies
pip install -r requirements.txt
-
Configure database (Optional - will fallback to CSV if not configured)
Copy
codebenders-dashboard/env.exampleto.envand update:DB_HOST=127.0.0.1 DB_USER=postgres DB_PASSWORD=postgres DB_PORT=54332 DB_NAME=postgres DB_SSL=false
-
Start local Supabase (for local development)
supabase start
-
Test database connection
python -m operations.test_db_connection
-
Verify data files Ensure all required CSV files are in the
data/folder.
Run the complete ML pipeline:
cd ai_model
python complete_ml_pipeline.pyThis will:
- Test database connection
- Load and preprocess data
- Train all 5 models
- Generate predictions for all students
- Save results to Postgres database (or CSV files as fallback)
- Save model performance metrics to database
- Create a summary report
If you need to re-merge the source data files:
cd ai_model
python merge_bishop_state_data.py- Data loading: ~30 seconds
- Model training: ~5-10 minutes
- Prediction generation: ~1 minute
- Total: ~10-15 minutes
The pipeline uses an efficient batch upload system to save predictions to Postgres:
Features:
- Automatic batching: Large datasets are split into manageable chunks (1,000 records per batch)
- Progress tracking: Real-time progress updates during upload
- Connection pooling: SQLAlchemy engine with connection pooling for reliability
- Error handling: Automatic fallback to CSV if database connection fails
- Verification: Automatic record count verification after upload
Example Output:
Saving 99,559 records to table 'course_predictions'...
β Successfully saved to 'course_predictions'
- Records: 99,559
- Columns: 45
- Verified: 99,559 records in database
Configuration:
- Default batch size: 1,000 records per chunk
- Adjustable via
chunksizeparameter insave_dataframe_to_db() - Located in
operations/db_utils.py
Tables Created:
student_predictions- Student-level predictions (~4,000 records)course_predictions- Course-level predictions (~99,559 records)ml_model_performance- Model metrics and training history
For more details, see operations/README.md.
Algorithm: XGBoost Classifier Target: Binary (Retained / Not Retained) Features: 40+ demographic, academic, and performance features
Output:
retention_probability: Probability of retention (0-1)retention_prediction: Binary prediction (0/1)retention_risk_category: Risk level (Critical/High/Moderate/Low)
Algorithm: Composite Risk Score Target: Binary (At Risk / Not At Risk) Approach: Combines retention probability with performance metrics
Risk Factors:
- Retention probability (50% weight)
- GPA performance (20% weight)
- Course completion rate (20% weight)
- Credit progress (10% weight)
Output:
risk_score: Comprehensive risk score (0-100)at_risk_alert: Alert level (URGENT/HIGH/MODERATE/LOW)at_risk_probability: Risk probability (0-1)at_risk_prediction: Binary prediction (0/1)
Algorithm: XGBoost Regressor Target: Continuous (Years to credential)
Output:
predicted_time_to_credential: Years to completionpredicted_graduation_year: Expected graduation year
Algorithm: Random Forest Classifier Target: Multi-class (No Credential / Certificate / Associate's / Bachelor's)
Output:
predicted_credential_type: Numeric code (0-3)predicted_credential_label: Text labelprob_no_credential,prob_certificate,prob_associate,prob_bachelor: Class probabilities
Algorithm: Random Forest Regressor Target: Continuous (GPA 0-4 scale)
Output:
predicted_gpa: Expected GPA (0-4 scale)gpa_performance: Performance vs. expected (Above/Below/As Expected)
| File | Description | Records |
|---|---|---|
ar_bscc_with_zip.csv |
AR data with zip codes | ~4K |
bishop_state_cohorts_with_zip.csv |
Student cohort information | ~4K |
bishop_state_courses.csv |
Course enrollment records | ~100K |
bishop_state_student_level_with_zip.csv |
Aggregated student-level data | ~4K |
- Demographics: Age, race, ethnicity, gender, first-generation status
- Academic Preparation: Math/English/Reading placement levels
- Enrollment: Type, intensity, attendance status, cohort term
- Course Performance: Credits, grades, completion rates, gateway courses
- Financial: Pell grant status
- Geographic: Zip code information
Predictions are saved to Postgres (Supabase):
-
student_predictions(Table)- Student-level data with all predictions
- One row per student (~4,000 records)
-
course_predictions(Table)- Course-level data with predictions
- One row per course enrollment (~99,559 records)
-
ml_model_performance(Table)- Model performance metrics for each training run
If database connection fails, predictions are saved to CSV:
bishop_state_student_level_with_predictions.csvbishop_state_merged_with_predictions.csvML_PIPELINE_REPORT.txt
- DATA_DICTIONARY.md: Detailed descriptions of all data fields
- ML_MODELS_GUIDE.md: In-depth guide to machine learning models
- DOCKER_SETUP.md: Docker Compose setup for local Postgres
- Model Code: Extensively commented Python scripts in
ai_model/
Edit complete_ml_pipeline.py to adjust:
- XGBoost parameters:
n_estimators,max_depth,learning_rate - Random Forest parameters:
n_estimators,max_depth,n_jobs - Train-test split:
test_size,random_state - Risk thresholds: Alert levels in
assign_alert_level()
This project was developed for the Bishop State Datathon. Contributions are welcome!
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
CodeBenders Team Bishop State Datathon 2025
- Bishop State Community College
- Datathon organizers and mentors
- Open-source ML community (scikit-learn, XGBoost, pandas)
For questions or support, please open an issue on GitHub or contact the team.
Built with β€οΈ for student success