Paintings & Museum Analytics — SQL, PySpark & Pandas
SQLPySparkPandasPython
Solved 20+ competitive data modeling challenges using SQL, PySpark, and Pandas. Built JOIN-heavy queries, window functions, and aggregation pipelines on a multi-table museum dataset — reducing query execution time by 35% through indexing and DataFrame caching.
This project tackled 20+ competitive analytics challenges on a Famous Paintings & Museum dataset using three complementary tools to compare their strengths.
Key Techniques
- ›INNER / LEFT / FULL OUTER JOINs across 8 tables
- ›Window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG/LEAD
- ›Aggregations with GROUP BY + HAVING filters
- ›String operations: UPPER, TRIM, SUBSTRING, REGEXP
- ›Data cleaning: NULL handling, deduplication, type casting
Impact
Reduced average query time by 35% through strategic indexing and PySpark DataFrame caching. Produced a reusable comparison guide adopted by 3 team members.