Back to Projects
Data
Data Warehouse Modernization
Cut infrastructure costs by 40% ($120K/year) and improved query performance 3x by migrating legacy Hive/EMR to a modern cloud warehouse.
HiveSparkGoogle CloudSQLEMRPython
The Problem
Legacy Hive/EMR infrastructure had ballooning operational costs, 4–6 hour query runtimes for business-critical reports, and no pathway to support AI/ML workloads.
The Solution
Designed and executed a phased migration to a modern cloud data warehouse. Implemented columnar storage, intelligent partitioning, and query optimization. Rebuilt ETL pipelines in Spark for 10x throughput. Established AI-ready data layers for downstream ML consumption.
The Impact
Reduced infrastructure costs by 40% ($120K annual savings). Improved average query performance by 3x. Enabled the first ML feature pipeline on clean, structured data.
Tech Details
- Phased migration strategy: shadow-run new warehouse in parallel before cutover
- Columnar storage format (Parquet) replacing row-based Hive tables
- Intelligent partitioning by date + publisher_id reducing scan volume by 70%
- Spark-based ETL pipelines replacing Hive MapReduce — 10x throughput improvement
- Query optimization: predicate pushdown, broadcast joins, materialized views
- AI-ready data layer: cleaned, normalized feature tables for downstream ML pipelines
- Cost monitoring dashboards tracking compute and storage spend post-migration