Paintings & Museum Analytics — SQL, PySpark & Pandas

This project tackled 20+ competitive analytics challenges on a Famous Paintings & Museum dataset using three complementary tools to compare their strengths.

Key Techniques

›INNER / LEFT / FULL OUTER JOINs across 8 tables
›Window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG/LEAD
›Aggregations with GROUP BY + HAVING filters
›String operations: UPPER, TRIM, SUBSTRING, REGEXP
›Data cleaning: NULL handling, deduplication, type casting

Impact

Reduced average query time by 35% through strategic indexing and PySpark DataFrame caching. Produced a reusable comparison guide adopted by 3 team members.