← Back to Data Science Career

What Python Projects Should I Build to Get a Data Science Job?

By Rohan Mehta · May 17, 2026 · 14-min read · Data Science Career

You've finished your Python basics. You know what a DataFrame is. You've watched a dozen tutorials. And now you're staring at a blank Jupyter notebook wondering: What should I actually build?

This guide is the answer — not a vague "build something you're passionate about" answer, but a specific, level-by-level roadmap of projects that data science hiring managers actually care about, the tools each one should demonstrate, and exactly why each project earns you a closer look.

3–5
Projects needed in your portfolio
<3 min
Time recruiters spend on your GitHub
1
Deployed project minimum
0
Titanic models that impress anyone

In This Article

  1. What Data Science Recruiters Actually Look For
  2. Project Roadmap by Level
  3. Beginner Projects (0–3 months exp.)
  4. Intermediate Projects (3–9 months exp.)
  5. Advanced Projects (9+ months exp.)
  6. The Python Stack You Must Know
  7. How to Present Projects on GitHub
  8. Matching Projects to Your Target Industry
  9. Projects That Waste Your Time
  10. FAQs

What Data Science Recruiters Actually Look For

Before we talk projects, understand what you're trying to prove. A recruiter reviewing your portfolio is asking five questions:

⚠️ The tutorial trap: Running a pre-written notebook with a different dataset is not a project. Changing column names in someone else's code is not a project. Recruiters can tell the difference instantly. Every project in your portfolio must show your decisions — about the problem, the data, the model, and the interpretation.

Project Roadmap by Experience Level

Build in this sequence. Don't skip levels — each stage teaches you something the next one assumes you already know.

🟢 Beginner (0–3 months)

  • Real-world EDA with storytelling
  • Regression: house / salary prediction
  • Classification: churn or loan default

🟡 Intermediate (3–9 months)

  • Sentiment analysis with NLP
  • Recommendation system
  • Time series forecasting
  • SQL + Python dashboard

🔴 Advanced (9+ months)

  • End-to-end ML pipeline + MLflow
  • Image classifier with CNN
  • Fraud detection (imbalanced data)
  • LLM / RAG-powered app

Beginner Projects — Build These First

1. Exploratory Data Analysis on a Real, Messy Dataset Beginner

Pick a raw, imperfect dataset — government open data, a Kaggle dataset outside the top-10 most popular, or your own scraped data. Write a complete EDA: missing value analysis, distribution plots, outlier detection, correlation heatmaps, and a written story summarising what the data tells you. The output is a clean, well-commented notebook with a narrative — not just code.

Pandas Matplotlib Seaborn NumPy Jupyter
✅ Why it matters: Shows data intuition and communication — the two most undervalued skills at the entry level. Most beginners skip EDA. Doing it well instantly stands out.
2. Customer Churn Prediction Beginner

Use a telecom or SaaS churn dataset (e.g. IBM Telco dataset from Kaggle). Build a binary classifier that predicts whether a customer will leave. Go beyond accuracy — use confusion matrix, precision-recall, ROC-AUC, and explain which features drive churn and why. Add a brief business interpretation: "If the company acts on customers with churn probability above 0.7, it retains X% more revenue."

Scikit-learn Pandas Logistic Regression Random Forest Seaborn
✅ Why it matters: Churn prediction is one of the most common real business problems. It shows you can translate a model output into a business decision — not just produce a number.
3. House Price Prediction with Feature Engineering Beginner

Use the Ames Housing dataset (more complex than Boston Housing, which is deprecated). The goal isn't just to run Linear Regression — it's to show your feature engineering thinking. Create new features (price per sq ft, neighbourhood ranking, age of house), handle skewed distributions with log transforms, compare at least 3 models, and visualise feature importances clearly.

Scikit-learn Pandas XGBoost Feature Engineering Cross-validation
✅ Why it matters: Feature engineering is where data science jobs are actually won. Showing you can create meaningful features — not just pass raw columns into a model — is a significant differentiator.

Intermediate Projects — Add These Next

4. Sentiment Analysis on Real Social / Review Data Intermediate

Scrape product reviews from Amazon or app store reviews using BeautifulSoup or an API, then build a sentiment classifier. Go beyond positive/negative — include neutral, and try aspect-based sentiment (e.g. "battery life is bad" → negative sentiment on the 'battery' aspect). Deploy the model as a simple Streamlit app where users can type a review and get a sentiment score instantly.

NLTK / spaCy Streamlit Scikit-learn BeautifulSoup TF-IDF HuggingFace
✅ Why it matters: NLP skills are in high demand. Adding a live Streamlit deployment makes this interactive for interviewers — they can actually try it, which is far more memorable than a static notebook.
5. Movie or Product Recommendation System Intermediate

Build both a collaborative filtering (user-item matrix) and a content-based filtering approach, then compare them. Use the MovieLens dataset or a public e-commerce review dataset. Implement a hybrid approach as a bonus. Build a simple UI where a user can input a movie/product name and get recommendations. Discuss cold-start problems and how you'd solve them in production.

Surprise / implicit Pandas Cosine Similarity Matrix Factorisation Streamlit
✅ Why it matters: Recommendation systems power Netflix, Amazon, and Spotify. Showing you understand the underlying maths — not just a library call — signals real competence.
6. Time Series Forecasting — Sales, Traffic, or Weather Intermediate

Use a publicly available time series dataset (Walmart sales, airline passenger data, or a public API like Open-Meteo for weather). Implement classical decomposition (trend, seasonality, residuals), then model with ARIMA/SARIMA, and compare against Facebook Prophet. Show your understanding of stationarity, ACF/PACF plots, and model diagnostics. Include a confidence interval in your forecast visualisation.

Prophet statsmodels ARIMA Plotly Pandas
✅ Why it matters: Almost every business has time-indexed data — sales, traffic, inventory, demand. Time series is one of the most requested data science skills in job descriptions.
7. SQL + Python End-to-End Analysis Dashboard Intermediate

Load a relational dataset (e.g. Northwind or a public e-commerce database) into PostgreSQL or SQLite. Write complex SQL queries (CTEs, window functions, subqueries) to extract KPIs, then pull results into Python and build an interactive dashboard with Plotly Dash or Streamlit. The key is showing the full pipeline: database → SQL → Python → visualisation — not just Python alone.

SQL (PostgreSQL) Plotly / Dash SQLAlchemy Pandas Streamlit
✅ Why it matters: 90%+ of data science job descriptions ask for SQL. Most candidates say they know SQL — this project proves it with something tangible.

Advanced Projects — Make the Shortlist

8. End-to-End ML Pipeline with MLflow + CI/CD Advanced

Take any classification or regression problem and build it as a proper ML pipeline: data ingestion → preprocessing → feature engineering → model training → evaluation → model registry. Use MLflow to track experiments, log metrics, and version models. Add a basic CI/CD step with GitHub Actions that retrains and re-evaluates the model on any code push. Deploy the final model as a REST API with FastAPI.

MLflow FastAPI GitHub Actions Scikit-learn Docker Pandas
✅ Why it matters: This is what separates data scientists from ML engineers — and most companies want both in one person. Showing MLOps awareness immediately puts you in the top 10% of candidates.
9. Credit Card Fraud Detection — Imbalanced Dataset Problem Advanced

Use the public credit card fraud dataset (highly imbalanced — 99.8% legitimate, 0.2% fraud). Show exactly how naive accuracy is useless here. Implement SMOTE, class weighting, and threshold tuning. Compare models on precision-recall AUC rather than ROC-AUC. Discuss the real-world trade-off: higher recall means catching more fraud but also flagging more legitimate transactions — how does a business decide where to set the threshold?

imbalanced-learn XGBoost SMOTE Scikit-learn Precision-Recall
✅ Why it matters: Imbalanced data is the norm in real-world ML (fraud, disease, churn, rare events). Showing you know how to handle it — and can explain the business trade-offs — is a massive signal of maturity.
10. LLM-Powered Q&A App — RAG Pipeline Advanced

Build a Retrieval-Augmented Generation (RAG) application using a public document corpus (research papers, a company's annual reports, Wikipedia exports). Chunk and embed documents using sentence-transformers, store in a vector database (ChromaDB or FAISS), and connect to an open-source LLM (via HuggingFace or Ollama locally). Build a Streamlit interface where users ask natural language questions and the app retrieves relevant chunks and generates a grounded answer.

LangChain / LlamaIndex ChromaDB / FAISS HuggingFace Sentence Transformers Streamlit
✅ Why it matters: RAG and LLM integration is the single hottest skill in data science hiring in 2025–26. One working RAG app in your portfolio signals that you're current, not behind.

The Python Stack You Must Know

Every project should progressively build your command of these libraries. Don't try to learn them all at once — learn each one in the context of a project.

🐍 Core Data Science Stack

Data Manipulation

Pandas, NumPy — non-negotiable. You should be able to write complex groupby, merge, reshape and window operations without googling every line.

Visualisation

Matplotlib + Seaborn for static plots. Plotly for interactive charts. One of these must appear in every project — a notebook with no visuals is a red flag.

Machine Learning

Scikit-learn as the foundation. XGBoost / LightGBM for tree-based models. Understand pipelines, cross-validation, and hyperparameter tuning — not just model.fit().

NLP

NLTK for basics, spaCy for production NLP, HuggingFace Transformers for state-of-the-art models. Know TF-IDF, word embeddings, and at least one transformer model.

Deployment

Streamlit for quick demos. FastAPI for REST APIs. Docker for containerisation. Even one deployed project is worth ten static notebooks in an interview.

Databases & SQL

PostgreSQL or SQLite with SQLAlchemy. Window functions, CTEs, joins. SQL fluency is tested in almost every data science interview — treat it as seriously as Python.

MLOps (advanced)

MLflow for experiment tracking. GitHub Actions for CI/CD. Basic cloud exposure (AWS S3, GCP BigQuery, or Azure ML). Increasingly expected even at mid-level roles.

Deep Learning (advanced)

PyTorch or TensorFlow for image and sequence tasks. LangChain / LlamaIndex for LLM applications. HuggingFace for pre-trained models. One project using these is enough at the entry level.

How to Present Projects on GitHub — The Recruiter View

A brilliant project with a poor GitHub presentation is an invisible project. Here is what every repository must have:

Matching Projects to Your Target Industry

Generic projects get generic responses. Domain-specific projects get callbacks. Match at least one or two projects to the industry you want to enter:

Target Industry Recommended Project Domain Dataset Sources
Fintech / BankingCredit risk scoring, fraud detection, loan default predictionKaggle, UCI ML Repository, LendingClub
E-commerce / RetailCustomer segmentation (RFM), churn prediction, recommendation engineOlist, Instacart, Kaggle retail datasets
HealthcareDisease prediction, patient readmission, medical image classificationMIMIC-III, Kaggle health datasets, NIH
EdTechStudent performance prediction, dropout risk, content recommendationOpen University Learning Analytics Dataset
Logistics / Supply ChainDemand forecasting, route optimisation, inventory predictionKaggle supply chain datasets, M5 competition
Media / EntertainmentContent recommendation, sentiment on reviews, engagement predictionMovieLens, Spotify API, YouTube Data API
HR / Talent AnalyticsEmployee attrition prediction, resume screening, compensation analysisIBM HR Analytics dataset (Kaggle)
💡 Power move: Research the company you're applying to before each interview. If they're in fintech, make sure your fraud detection or credit risk project is at the top of your portfolio and GitHub pinned. Tailoring your README introduction to mention the domain you're targeting takes 10 minutes and meaningfully increases recruiter interest.

Projects That Waste Your Time (Or Actively Hurt You)

Project Recruiter Reaction Better Alternative
Titanic Survival Prediction Seen it 500 times. Skip. Customer churn or credit default
MNIST Digit Classifier Tutorial copy. No signal. Custom image dataset (scrape your own)
Iris Flower Classification Day 1 exercise. Not a portfolio piece. Multi-class product categorisation
Boston Housing (deprecated) Outdated dataset with known issues Ames Housing with feature engineering
Tutorial notebook, different data Obvious copy. Damages credibility. Original problem with your own framing
10+ unfinished notebooks "Can't complete projects" signal 3 polished, complete projects
Stock price prediction (LSTM) Overfit, misleading results, cliché Demand or sales forecasting with ARIMA/Prophet

🎯 The Portfolio That Gets You Hired

If you build exactly this combination, you will have a portfolio that stands out for entry to mid-level data science roles:

Each project should have a clean README, clear metrics, and ideally a live link. Five projects like this will get you more interviews than fifty unfinished Kaggle notebooks ever will.

Frequently Asked Questions

Q: How many Python projects do I need to get a data science job?

Three to five well-documented, end-to-end projects outperform a GitHub with twenty unfinished notebooks every time. Recruiters spend fewer than three minutes on your profile — make each project immediately legible, with a clear problem statement, metrics, and ideally a live demo link.

Q: Should I build ML projects or data analysis projects?

Both — and in that order. Start with strong EDA and data storytelling projects to prove data fluency. Then move to classification and regression with model evaluation. Recruiters are suspicious of candidates who can only run models but can't explore and explain data.

Q: Is the Titanic dataset good enough for my portfolio?

No — not as a primary portfolio project. Every recruiter has seen hundreds of Titanic notebooks. Use it as a learning exercise, then build something with a real, less-common dataset that shows you picked a problem independently.

Q: Do I need to deploy my data science projects?

Not all of them, but at least one deployed project is strongly recommended. Even a simple Streamlit app signals that you understand the end-to-end pipeline beyond Jupyter notebooks — and gives interviewers something live to interact with, which is far more memorable.

Q: What Python libraries should every data science project demonstrate?

At minimum: Pandas and NumPy for data manipulation, Matplotlib or Seaborn for visualisation, and Scikit-learn for modelling. For intermediate roles, add XGBoost/LightGBM, SQL integration via SQLAlchemy, and Streamlit for deployment. For advanced roles, include MLflow, FastAPI, and exposure to HuggingFace or LangChain.

Q: Should my projects be on Kaggle or GitHub?

Both, ideally. Use Kaggle notebooks for competition work and community visibility. Use GitHub for your primary portfolio — a well-structured repository with a clear README, folder hierarchy, and requirements file signals engineering maturity that a Kaggle notebook alone cannot.

RM
Rohan Mehta Data Science Career Coach, PredictCollege · Ex-Data Scientist at a Pune-based fintech · 4+ years of hiring-side experience reviewing data science portfolios · Kaggle Contributor