Should I build machine learning projects or data analysis projects for a data science job?

Both. A strong data science portfolio includes at least one solid EDA and storytelling project (shows data fluency), one or two machine learning projects with model evaluation and deployment, and ideally one real-world dataset project from a domain relevant to your target industry. Pure ML without good data understanding is a red flag for experienced interviewers.

What Python libraries should I use in data science projects?

Core libraries every project should demonstrate: Pandas and NumPy for data manipulation, Matplotlib and Seaborn for visualization, Scikit-learn for classical ML, and either Flask or Streamlit for deployment. For advanced roles, add XGBoost/LightGBM, SQL integration, and exposure to TensorFlow or PyTorch for deep learning tasks.

Is the Titanic dataset good enough for a data science project portfolio?

No — at least not as a standalone project. Every recruiter has seen hundreds of Titanic notebooks. It is a fine learning exercise, but it should not be in your portfolio as a primary project. Use real, less-common datasets from Kaggle, government open data portals, or scrape your own data to show genuine initiative and problem-solving.

Do I need to deploy my data science projects to get a job?

Not all of them, but at least one deployed project (even a simple Streamlit app) signals that you understand the full pipeline from raw data to usable output. Deployment sets you apart from candidates who only know Jupyter notebooks. It also gives interviewers something live to interact with — which is far more memorable than a static notebook.

What domain should my data science projects be in?

Match your projects to your target industry. Applying to fintech? Build a credit risk or fraud detection model. Healthcare? Build a disease prediction or patient readmission model. E-commerce? Build a recommendation system or churn predictor. Domain-specific projects demonstrate that you can apply data science to real business problems — which is exactly what companies hire for.

← Back to Data Science Career

What Python Projects Should I Build to Get a Data Science Job?

Q: How many Python projects do I need to get a data science job?

Quality beats quantity every time. Three to five well-documented, end-to-end projects with clear problem statements, proper EDA, model selection rationale, and deployed outputs will outperform a GitHub with twenty unfinished notebooks. Most recruiters spend less than 3 minutes per GitHub profile — make every project count.

By Rohan Mehta · May 17, 2026 · 14-min read · Data Science Career

You've finished your Python basics. You know what a DataFrame is. You've watched a dozen tutorials. And now you're staring at a blank Jupyter notebook wondering: What should I actually build?

This guide is the answer — not a vague "build something you're passionate about" answer, but a specific, level-by-level roadmap of projects that data science hiring managers actually care about, the tools each one should demonstrate, and exactly why each project earns you a closer look.

3–5

Projects needed in your portfolio

<3 min

Time recruiters spend on your GitHub

Deployed project minimum

Titanic models that impress anyone

What Data Science Recruiters Actually Look For
Project Roadmap by Level
Beginner Projects (0–3 months exp.)
Intermediate Projects (3–9 months exp.)
Advanced Projects (9+ months exp.)
The Python Stack You Must Know
How to Present Projects on GitHub
Matching Projects to Your Target Industry
Projects That Waste Your Time
FAQs

What Data Science Recruiters Actually Look For

Before we talk projects, understand what you're trying to prove. A recruiter reviewing your portfolio is asking five questions:

Can you frame a real problem? Not just run a model, but articulate what question you're answering and why it matters.
Do you understand your data? EDA, data quality checks, feature understanding — the grunt work that separates real practitioners from tutorial followers.
Can you choose and justify the right model? Not just "I tried Random Forest" — but why, compared to what alternatives, evaluated how.
Can you communicate findings clearly? Visualisations, narratives, clean notebooks. Data science is communication, not just computation.
Can you take it to production? Even a basic Streamlit app or Flask endpoint shows you understand the full pipeline.

⚠️ The tutorial trap: Running a pre-written notebook with a different dataset is not a project. Changing column names in someone else's code is not a project. Recruiters can tell the difference instantly. Every project in your portfolio must show your decisions — about the problem, the data, the model, and the interpretation.

Project Roadmap by Experience Level

Build in this sequence. Don't skip levels — each stage teaches you something the next one assumes you already know.

🟢 Beginner (0–3 months)

Real-world EDA with storytelling
Regression: house / salary prediction
Classification: churn or loan default

🟡 Intermediate (3–9 months)

Sentiment analysis with NLP
Recommendation system
Time series forecasting
SQL + Python dashboard

🔴 Advanced (9+ months)

End-to-end ML pipeline + MLflow
Image classifier with CNN
Fraud detection (imbalanced data)
LLM / RAG-powered app

Beginner Projects — Build These First

1. Exploratory Data Analysis on a Real, Messy Dataset Beginner

Pick a raw, imperfect dataset — government open data, a Kaggle dataset outside the top-10 most popular, or your own scraped data. Write a complete EDA: missing value analysis, distribution plots, outlier detection, correlation heatmaps, and a written story summarising what the data tells you. The output is a clean, well-commented notebook with a narrative — not just code.

Pandas Matplotlib Seaborn NumPy Jupyter

✅ Why it matters: Shows data intuition and communication — the two most undervalued skills at the entry level. Most beginners skip EDA. Doing it well instantly stands out.

2. Customer Churn Prediction Beginner

Use a telecom or SaaS churn dataset (e.g. IBM Telco dataset from Kaggle). Build a binary classifier that predicts whether a customer will leave. Go beyond accuracy — use confusion matrix, precision-recall, ROC-AUC, and explain which features drive churn and why. Add a brief business interpretation: "If the company acts on customers with churn probability above 0.7, it retains X% more revenue."

Scikit-learn Pandas Logistic Regression Random Forest Seaborn

✅ Why it matters: Churn prediction is one of the most common real business problems. It shows you can translate a model output into a business decision — not just produce a number.

3. House Price Prediction with Feature Engineering Beginner

Use the Ames Housing dataset (more complex than Boston Housing, which is deprecated). The goal isn't just to run Linear Regression — it's to show your feature engineering thinking. Create new features (price per sq ft, neighbourhood ranking, age of house), handle skewed distributions with log transforms, compare at least 3 models, and visualise feature importances clearly.

Scikit-learn Pandas XGBoost Feature Engineering Cross-validation

✅ Why it matters: Feature engineering is where data science jobs are actually won. Showing you can create meaningful features — not just pass raw columns into a model — is a significant differentiator.

Intermediate Projects — Add These Next

4. Sentiment Analysis on Real Social / Review Data Intermediate

Scrape product reviews from Amazon or app store reviews using BeautifulSoup or an API, then build a sentiment classifier. Go beyond positive/negative — include neutral, and try aspect-based sentiment (e.g. "battery life is bad" → negative sentiment on the 'battery' aspect). Deploy the model as a simple Streamlit app where users can type a review and get a sentiment score instantly.

NLTK / spaCy Streamlit Scikit-learn BeautifulSoup TF-IDF HuggingFace

✅ Why it matters: NLP skills are in high demand. Adding a live Streamlit deployment makes this interactive for interviewers — they can actually try it, which is far more memorable than a static notebook.

5. Movie or Product Recommendation System Intermediate

Build both a collaborative filtering (user-item matrix) and a content-based filtering approach, then compare them. Use the MovieLens dataset or a public e-commerce review dataset. Implement a hybrid approach as a bonus. Build a simple UI where a user can input a movie/product name and get recommendations. Discuss cold-start problems and how you'd solve them in production.

Surprise / implicit Pandas Cosine Similarity Matrix Factorisation Streamlit

✅ Why it matters: Recommendation systems power Netflix, Amazon, and Spotify. Showing you understand the underlying maths — not just a library call — signals real competence.

6. Time Series Forecasting — Sales, Traffic, or Weather Intermediate

Use a publicly available time series dataset (Walmart sales, airline passenger data, or a public API like Open-Meteo for weather). Implement classical decomposition (trend, seasonality, residuals), then model with ARIMA/SARIMA, and compare against Facebook Prophet. Show your understanding of stationarity, ACF/PACF plots, and model diagnostics. Include a confidence interval in your forecast visualisation.

Prophet statsmodels ARIMA Plotly Pandas

✅ Why it matters: Almost every business has time-indexed data — sales, traffic, inventory, demand. Time series is one of the most requested data science skills in job descriptions.

7. SQL + Python End-to-End Analysis Dashboard Intermediate

Load a relational dataset (e.g. Northwind or a public e-commerce database) into PostgreSQL or SQLite. Write complex SQL queries (CTEs, window functions, subqueries) to extract KPIs, then pull results into Python and build an interactive dashboard with Plotly Dash or Streamlit. The key is showing the full pipeline: database → SQL → Python → visualisation — not just Python alone.

SQL (PostgreSQL) Plotly / Dash SQLAlchemy Pandas Streamlit

✅ Why it matters: 90%+ of data science job descriptions ask for SQL. Most candidates say they know SQL — this project proves it with something tangible.

Advanced Projects — Make the Shortlist

8. End-to-End ML Pipeline with MLflow + CI/CD Advanced

Take any classification or regression problem and build it as a proper ML pipeline: data ingestion → preprocessing → feature engineering → model training → evaluation → model registry. Use MLflow to track experiments, log metrics, and version models. Add a basic CI/CD step with GitHub Actions that retrains and re-evaluates the model on any code push. Deploy the final model as a REST API with FastAPI.

MLflow FastAPI GitHub Actions Scikit-learn Docker Pandas

✅ Why it matters: This is what separates data scientists from ML engineers — and most companies want both in one person. Showing MLOps awareness immediately puts you in the top 10% of candidates.

9. Credit Card Fraud Detection — Imbalanced Dataset Problem Advanced

Use the public credit card fraud dataset (highly imbalanced — 99.8% legitimate, 0.2% fraud). Show exactly how naive accuracy is useless here. Implement SMOTE, class weighting, and threshold tuning. Compare models on precision-recall AUC rather than ROC-AUC. Discuss the real-world trade-off: higher recall means catching more fraud but also flagging more legitimate transactions — how does a business decide where to set the threshold?

imbalanced-learn XGBoost SMOTE Scikit-learn Precision-Recall

✅ Why it matters: Imbalanced data is the norm in real-world ML (fraud, disease, churn, rare events). Showing you know how to handle it — and can explain the business trade-offs — is a massive signal of maturity.

10. LLM-Powered Q&A App — RAG Pipeline Advanced

Build a Retrieval-Augmented Generation (RAG) application using a public document corpus (research papers, a company's annual reports, Wikipedia exports). Chunk and embed documents using sentence-transformers, store in a vector database (ChromaDB or FAISS), and connect to an open-source LLM (via HuggingFace or Ollama locally). Build a Streamlit interface where users ask natural language questions and the app retrieves relevant chunks and generates a grounded answer.

LangChain / LlamaIndex ChromaDB / FAISS HuggingFace Sentence Transformers Streamlit

✅ Why it matters: RAG and LLM integration is the single hottest skill in data science hiring in 2025–26. One working RAG app in your portfolio signals that you're current, not behind.

The Python Stack You Must Know

Every project should progressively build your command of these libraries. Don't try to learn them all at once — learn each one in the context of a project.

🐍 Core Data Science Stack

Data Manipulation

Pandas, NumPy — non-negotiable. You should be able to write complex groupby, merge, reshape and window operations without googling every line.

Visualisation

Matplotlib + Seaborn for static plots. Plotly for interactive charts. One of these must appear in every project — a notebook with no visuals is a red flag.

Machine Learning

Scikit-learn as the foundation. XGBoost / LightGBM for tree-based models. Understand pipelines, cross-validation, and hyperparameter tuning — not just model.fit().

NLP

NLTK for basics, spaCy for production NLP, HuggingFace Transformers for state-of-the-art models. Know TF-IDF, word embeddings, and at least one transformer model.

Deployment

Streamlit for quick demos. FastAPI for REST APIs. Docker for containerisation. Even one deployed project is worth ten static notebooks in an interview.

Databases & SQL

PostgreSQL or SQLite with SQLAlchemy. Window functions, CTEs, joins. SQL fluency is tested in almost every data science interview — treat it as seriously as Python.

MLOps (advanced)

MLflow for experiment tracking. GitHub Actions for CI/CD. Basic cloud exposure (AWS S3, GCP BigQuery, or Azure ML). Increasingly expected even at mid-level roles.

Deep Learning (advanced)

PyTorch or TensorFlow for image and sequence tasks. LangChain / LlamaIndex for LLM applications. HuggingFace for pre-trained models. One project using these is enough at the entry level.

How to Present Projects on GitHub — The Recruiter View

A brilliant project with a poor GitHub presentation is an invisible project. Here is what every repository must have:

1
A clear README with a problem statement. First line: what problem does this solve and for whom? Not "this is a machine learning project" — that tells a recruiter nothing.
2
A results summary up top. Put your key metric right in the README: "Model achieves 94.2% ROC-AUC on test set, outperforming baseline by 11 points." Recruiters scan — lead with impact.
3
A clear folder structure. Separate data/, notebooks/, src/, models/. A flat dump of files signals a beginner immediately.
4
A requirements.txt or environment.yml. If I can't reproduce your project, it effectively doesn't exist. Always include dependency files.
5
Clean, commented notebooks. Use markdown cells between code blocks to explain your thinking. The notebook should read like a technical blog post, not a string of unexplained code cells.
6
A live demo link (if deployed). Link to your Streamlit app, Hugging Face Space, or a recorded Loom walkthrough. Interactive beats static every single time.
7
A "Key Insights" section. What did you learn? What would you do differently? What are the model's limitations? This proves you think critically, not just mechanically.

Matching Projects to Your Target Industry

Generic projects get generic responses. Domain-specific projects get callbacks. Match at least one or two projects to the industry you want to enter:

Target Industry	Recommended Project Domain	Dataset Sources
Fintech / Banking	Credit risk scoring, fraud detection, loan default prediction	Kaggle, UCI ML Repository, LendingClub
E-commerce / Retail	Customer segmentation (RFM), churn prediction, recommendation engine	Olist, Instacart, Kaggle retail datasets
Healthcare	Disease prediction, patient readmission, medical image classification	MIMIC-III, Kaggle health datasets, NIH
EdTech	Student performance prediction, dropout risk, content recommendation	Open University Learning Analytics Dataset
Logistics / Supply Chain	Demand forecasting, route optimisation, inventory prediction	Kaggle supply chain datasets, M5 competition
Media / Entertainment	Content recommendation, sentiment on reviews, engagement prediction	MovieLens, Spotify API, YouTube Data API
HR / Talent Analytics	Employee attrition prediction, resume screening, compensation analysis	IBM HR Analytics dataset (Kaggle)

💡 Power move: Research the company you're applying to before each interview. If they're in fintech, make sure your fraud detection or credit risk project is at the top of your portfolio and GitHub pinned. Tailoring your README introduction to mention the domain you're targeting takes 10 minutes and meaningfully increases recruiter interest.

Projects That Waste Your Time (Or Actively Hurt You)

Project	Recruiter Reaction	Better Alternative
Titanic Survival Prediction	Seen it 500 times. Skip.	Customer churn or credit default
MNIST Digit Classifier	Tutorial copy. No signal.	Custom image dataset (scrape your own)
Iris Flower Classification	Day 1 exercise. Not a portfolio piece.	Multi-class product categorisation
Boston Housing (deprecated)	Outdated dataset with known issues	Ames Housing with feature engineering
Tutorial notebook, different data	Obvious copy. Damages credibility.	Original problem with your own framing
10+ unfinished notebooks	"Can't complete projects" signal	3 polished, complete projects
Stock price prediction (LSTM)	Overfit, misleading results, cliché	Demand or sales forecasting with ARIMA/Prophet

🎯 The Portfolio That Gets You Hired

If you build exactly this combination, you will have a portfolio that stands out for entry to mid-level data science roles:

1 strong EDA project — proves data fluency and communication
1 classification project with business interpretation — churn, fraud, or loan default
1 NLP or time series project — deployed as a Streamlit app
1 SQL + Python dashboard — proves database literacy
1 advanced project — end-to-end pipeline with MLflow, or an LLM/RAG app

Each project should have a clean README, clear metrics, and ideally a live link. Five projects like this will get you more interviews than fifty unfinished Kaggle notebooks ever will.

Frequently Asked Questions

Q: How many Python projects do I need to get a data science job?

Three to five well-documented, end-to-end projects outperform a GitHub with twenty unfinished notebooks every time. Recruiters spend fewer than three minutes on your profile — make each project immediately legible, with a clear problem statement, metrics, and ideally a live demo link.

Q: Should I build ML projects or data analysis projects?

Both — and in that order. Start with strong EDA and data storytelling projects to prove data fluency. Then move to classification and regression with model evaluation. Recruiters are suspicious of candidates who can only run models but can't explore and explain data.

Q: Is the Titanic dataset good enough for my portfolio?

No — not as a primary portfolio project. Every recruiter has seen hundreds of Titanic notebooks. Use it as a learning exercise, then build something with a real, less-common dataset that shows you picked a problem independently.

Q: Do I need to deploy my data science projects?

Not all of them, but at least one deployed project is strongly recommended. Even a simple Streamlit app signals that you understand the end-to-end pipeline beyond Jupyter notebooks — and gives interviewers something live to interact with, which is far more memorable.

Q: What Python libraries should every data science project demonstrate?

At minimum: Pandas and NumPy for data manipulation, Matplotlib or Seaborn for visualisation, and Scikit-learn for modelling. For intermediate roles, add XGBoost/LightGBM, SQL integration via SQLAlchemy, and Streamlit for deployment. For advanced roles, include MLflow, FastAPI, and exposure to HuggingFace or LangChain.

Q: Should my projects be on Kaggle or GitHub?

Both, ideally. Use Kaggle notebooks for competition work and community visibility. Use GitHub for your primary portfolio — a well-structured repository with a clear README, folder hierarchy, and requirements file signals engineering maturity that a Kaggle notebook alone cannot.

Rohan Mehta Data Science Career Coach, PredictCollege · Ex-Data Scientist at a Pune-based fintech · 4+ years of hiring-side experience reviewing data science portfolios · Kaggle Contributor

What Python Projects Should I Build to Get a Data Science Job?

What Python Projects Should I Build to Get a Data Science Job?

In This Article

What Data Science Recruiters Actually Look For

Project Roadmap by Experience Level

🟢 Beginner (0–3 months)

🟡 Intermediate (3–9 months)

🔴 Advanced (9+ months)

Beginner Projects — Build These First

Intermediate Projects — Add These Next

Advanced Projects — Make the Shortlist

The Python Stack You Must Know

🐍 Core Data Science Stack

Data Manipulation

Visualisation

Machine Learning

NLP

Deployment

Databases & SQL

MLOps (advanced)

Deep Learning (advanced)

How to Present Projects on GitHub — The Recruiter View

Matching Projects to Your Target Industry

Projects That Waste Your Time (Or Actively Hurt You)

🎯 The Portfolio That Gets You Hired

Frequently Asked Questions

📚 Related Articles