Stop building Titanic survival models that every recruiter has seen a thousand times. Here's the project roadmap that actually gets you hired.
You've finished your Python basics. You know what a DataFrame is. You've watched a dozen tutorials. And now you're staring at a blank Jupyter notebook wondering: What should I actually build?
This guide is the answer — not a vague "build something you're passionate about" answer, but a specific, level-by-level roadmap of projects that data science hiring managers actually care about, the tools each one should demonstrate, and exactly why each project earns you a closer look.
Before we talk projects, understand what you're trying to prove. A recruiter reviewing your portfolio is asking five questions:
Build in this sequence. Don't skip levels — each stage teaches you something the next one assumes you already know.
Pick a raw, imperfect dataset — government open data, a Kaggle dataset outside the top-10 most popular, or your own scraped data. Write a complete EDA: missing value analysis, distribution plots, outlier detection, correlation heatmaps, and a written story summarising what the data tells you. The output is a clean, well-commented notebook with a narrative — not just code.
Use a telecom or SaaS churn dataset (e.g. IBM Telco dataset from Kaggle). Build a binary classifier that predicts whether a customer will leave. Go beyond accuracy — use confusion matrix, precision-recall, ROC-AUC, and explain which features drive churn and why. Add a brief business interpretation: "If the company acts on customers with churn probability above 0.7, it retains X% more revenue."
Use the Ames Housing dataset (more complex than Boston Housing, which is deprecated). The goal isn't just to run Linear Regression — it's to show your feature engineering thinking. Create new features (price per sq ft, neighbourhood ranking, age of house), handle skewed distributions with log transforms, compare at least 3 models, and visualise feature importances clearly.
Scrape product reviews from Amazon or app store reviews using BeautifulSoup or an API, then build a sentiment classifier. Go beyond positive/negative — include neutral, and try aspect-based sentiment (e.g. "battery life is bad" → negative sentiment on the 'battery' aspect). Deploy the model as a simple Streamlit app where users can type a review and get a sentiment score instantly.
Build both a collaborative filtering (user-item matrix) and a content-based filtering approach, then compare them. Use the MovieLens dataset or a public e-commerce review dataset. Implement a hybrid approach as a bonus. Build a simple UI where a user can input a movie/product name and get recommendations. Discuss cold-start problems and how you'd solve them in production.
Use a publicly available time series dataset (Walmart sales, airline passenger data, or a public API like Open-Meteo for weather). Implement classical decomposition (trend, seasonality, residuals), then model with ARIMA/SARIMA, and compare against Facebook Prophet. Show your understanding of stationarity, ACF/PACF plots, and model diagnostics. Include a confidence interval in your forecast visualisation.
Load a relational dataset (e.g. Northwind or a public e-commerce database) into PostgreSQL or SQLite. Write complex SQL queries (CTEs, window functions, subqueries) to extract KPIs, then pull results into Python and build an interactive dashboard with Plotly Dash or Streamlit. The key is showing the full pipeline: database → SQL → Python → visualisation — not just Python alone.
Take any classification or regression problem and build it as a proper ML pipeline: data ingestion → preprocessing → feature engineering → model training → evaluation → model registry. Use MLflow to track experiments, log metrics, and version models. Add a basic CI/CD step with GitHub Actions that retrains and re-evaluates the model on any code push. Deploy the final model as a REST API with FastAPI.
Use the public credit card fraud dataset (highly imbalanced — 99.8% legitimate, 0.2% fraud). Show exactly how naive accuracy is useless here. Implement SMOTE, class weighting, and threshold tuning. Compare models on precision-recall AUC rather than ROC-AUC. Discuss the real-world trade-off: higher recall means catching more fraud but also flagging more legitimate transactions — how does a business decide where to set the threshold?
Build a Retrieval-Augmented Generation (RAG) application using a public document corpus (research papers, a company's annual reports, Wikipedia exports). Chunk and embed documents using sentence-transformers, store in a vector database (ChromaDB or FAISS), and connect to an open-source LLM (via HuggingFace or Ollama locally). Build a Streamlit interface where users ask natural language questions and the app retrieves relevant chunks and generates a grounded answer.
Every project should progressively build your command of these libraries. Don't try to learn them all at once — learn each one in the context of a project.
Pandas, NumPy — non-negotiable. You should be able to write complex groupby, merge, reshape and window operations without googling every line.
Matplotlib + Seaborn for static plots. Plotly for interactive charts. One of these must appear in every project — a notebook with no visuals is a red flag.
Scikit-learn as the foundation. XGBoost / LightGBM for tree-based models. Understand pipelines, cross-validation, and hyperparameter tuning — not just model.fit().
NLTK for basics, spaCy for production NLP, HuggingFace Transformers for state-of-the-art models. Know TF-IDF, word embeddings, and at least one transformer model.
Streamlit for quick demos. FastAPI for REST APIs. Docker for containerisation. Even one deployed project is worth ten static notebooks in an interview.
PostgreSQL or SQLite with SQLAlchemy. Window functions, CTEs, joins. SQL fluency is tested in almost every data science interview — treat it as seriously as Python.
MLflow for experiment tracking. GitHub Actions for CI/CD. Basic cloud exposure (AWS S3, GCP BigQuery, or Azure ML). Increasingly expected even at mid-level roles.
PyTorch or TensorFlow for image and sequence tasks. LangChain / LlamaIndex for LLM applications. HuggingFace for pre-trained models. One project using these is enough at the entry level.
A brilliant project with a poor GitHub presentation is an invisible project. Here is what every repository must have:
data/, notebooks/, src/, models/. A flat dump of files signals a beginner immediately.Generic projects get generic responses. Domain-specific projects get callbacks. Match at least one or two projects to the industry you want to enter:
| Target Industry | Recommended Project Domain | Dataset Sources |
|---|---|---|
| Fintech / Banking | Credit risk scoring, fraud detection, loan default prediction | Kaggle, UCI ML Repository, LendingClub |
| E-commerce / Retail | Customer segmentation (RFM), churn prediction, recommendation engine | Olist, Instacart, Kaggle retail datasets |
| Healthcare | Disease prediction, patient readmission, medical image classification | MIMIC-III, Kaggle health datasets, NIH |
| EdTech | Student performance prediction, dropout risk, content recommendation | Open University Learning Analytics Dataset |
| Logistics / Supply Chain | Demand forecasting, route optimisation, inventory prediction | Kaggle supply chain datasets, M5 competition |
| Media / Entertainment | Content recommendation, sentiment on reviews, engagement prediction | MovieLens, Spotify API, YouTube Data API |
| HR / Talent Analytics | Employee attrition prediction, resume screening, compensation analysis | IBM HR Analytics dataset (Kaggle) |
| Project | Recruiter Reaction | Better Alternative |
|---|---|---|
| Titanic Survival Prediction | Seen it 500 times. Skip. | Customer churn or credit default |
| MNIST Digit Classifier | Tutorial copy. No signal. | Custom image dataset (scrape your own) |
| Iris Flower Classification | Day 1 exercise. Not a portfolio piece. | Multi-class product categorisation |
| Boston Housing (deprecated) | Outdated dataset with known issues | Ames Housing with feature engineering |
| Tutorial notebook, different data | Obvious copy. Damages credibility. | Original problem with your own framing |
| 10+ unfinished notebooks | "Can't complete projects" signal | 3 polished, complete projects |
| Stock price prediction (LSTM) | Overfit, misleading results, cliché | Demand or sales forecasting with ARIMA/Prophet |
If you build exactly this combination, you will have a portfolio that stands out for entry to mid-level data science roles:
Each project should have a clean README, clear metrics, and ideally a live link. Five projects like this will get you more interviews than fifty unfinished Kaggle notebooks ever will.
Three to five well-documented, end-to-end projects outperform a GitHub with twenty unfinished notebooks every time. Recruiters spend fewer than three minutes on your profile — make each project immediately legible, with a clear problem statement, metrics, and ideally a live demo link.
Both — and in that order. Start with strong EDA and data storytelling projects to prove data fluency. Then move to classification and regression with model evaluation. Recruiters are suspicious of candidates who can only run models but can't explore and explain data.
No — not as a primary portfolio project. Every recruiter has seen hundreds of Titanic notebooks. Use it as a learning exercise, then build something with a real, less-common dataset that shows you picked a problem independently.
Not all of them, but at least one deployed project is strongly recommended. Even a simple Streamlit app signals that you understand the end-to-end pipeline beyond Jupyter notebooks — and gives interviewers something live to interact with, which is far more memorable.
At minimum: Pandas and NumPy for data manipulation, Matplotlib or Seaborn for visualisation, and Scikit-learn for modelling. For intermediate roles, add XGBoost/LightGBM, SQL integration via SQLAlchemy, and Streamlit for deployment. For advanced roles, include MLflow, FastAPI, and exposure to HuggingFace or LangChain.
Both, ideally. Use Kaggle notebooks for competition work and community visibility. Use GitHub for your primary portfolio — a well-structured repository with a clear README, folder hierarchy, and requirements file signals engineering maturity that a Kaggle notebook alone cannot.