Case StudyTransUnion

TB-Scale Search Product

Built a search product handling multiple terabytes of data with under 7-second latency using DuckDB, Google Cloud Dataproc, and Trino query engines.

DuckDBTrinoGoogle Cloud DataprocKubernetesJavaSpring Boot

The Challenge

Users needed reliable search on multi-terabyte proprietary datasets with strict latency targets and region-sensitive data constraints.

System Architecture

Architecture & Approach

Hybrid query stack combining DuckDB for local analytical paths, Trino for federated distributed queries, and Dataproc for heavy workloads, orchestrated by a routing and scaling layer.

Profiled query classes, matched each class to the best execution engine, and introduced autoscaling policies based on estimated scan volume before execution.

My Role & Contributions

Built core query-routing logic, implemented dynamic pod scaling strategy, and developed geofencing behavior for user and data locality constraints.

Key Technical Decisions

Used engine specialization (DuckDB vs Trino vs Dataproc) instead of a single universal engine for better latency consistency.
Introduced predictive pod scaling on estimated data volume to reduce cold-start impact on large searches.
Implemented geofencing rules at routing time to enforce data residency constraints before execution.

Results & Impact

<7s

Search Latency

TB-Scale

Data Volume

Dynamic

Pod Scaling Model

Delivered under 7-second search latency for core user journeys.
Reduced infra waste during low traffic with adaptive scaling.
Enabled compliant region-aware search over proprietary datasets.

The system met enterprise performance expectations at scale while preserving compliance constraints and cost-aware operation.

Lessons Learned

At very large data volumes, intelligent workload routing and elasticity policies drive performance more than micro-optimizations in a single query engine.

AI Agents for Research Perspective Enterprise Parsing Engine