Product-Oriented Technical Operations Leader with 10+ years operating at the intersection of Support, Engineering, and Product within cloud-based data platforms.
I translate production failure patterns and enterprise escalations into structured product improvements. I care about how systems actually behave in the real world, not just how they look in architecture diagrams.
I operate in the "messy middle" between customers and code:
- π Pattern Recognition β Spotting recurring failure modes across 1000+ customers and converting them into reliability initiatives
- π Product Thinking β Translating support escalations into actionable roadmap items that prevent future issues
- π Data-Driven Decisions β Using production telemetry, MTTR trends, and customer impact data to prioritize engineering work
- π Bridge Building β Aligning Support, SRE, Product, and Engineering teams around shared reliability goals
TL;DR: I turn "why did this break?" into "how do we make sure it never breaks again?"
- Distributed data pipelines (ingestion, transformation, destinations)
- Source connector reliability (PostgreSQL, MySQL, SaaS APIs)
- CDC, schema evolution, offset management
- Handling real-world edge cases: rate limits, network partitions, zombie connections
- 24x7 global support operations for cloud ELT systems
- P0/P1 production incident command & resolution
- SLA/MTTR optimization through automation and process improvements
- Enterprise escalation management
- AI-assisted ticket classification & RCA extraction
- Operational dashboards & observability improvements
- Knowledge base deflection strategies
- Self-service diagnostic tools
- Documentation, code snippets, utilities
- Real-world debugging scenarios
- Lessons from production incidents
- Python β automation, APIs, data processing, PySpark
- SQL β PostgreSQL, MySQL, Snowflake, Redshift, BigQuery
- Bash β scripting, operational glue, incident response
- Databases: PostgreSQL, MySQL, Snowflake, Redshift
- Streaming: Kafka, Debezium, CDC patterns
- Cloud Platforms: AWS, GCP, Azure environments
- ELT Tools: Experience debugging distributed ingestion systems
- REST APIs, OAuth flows, webhook systems
- Rate limiting, pagination, retry strategies
- Third-party connector troubleshooting (20+ integrations)
- Docker, CI/CD pipelines
- Incident management frameworks
- RCA documentation & post-mortem culture
- Observability & monitoring strategies
- Automated ticket classification using LLMs
- RCA pattern extraction across 10,000+ production incidents
- Integration with Google Sheets for stakeholder reporting
- Source connector edge case handling (auth failures, schema drift, CDC lag)
- Data consistency validation across sources and destinations
- API rate limit & network timeout resilience patterns
- Ticket summarization & issue categorization
- Focus on deterministic outputs and guardrails (not "magic")
- Reducing support engineer toil through intelligent automation
- Production debugging: timeouts, data loss, retry storms
- Root-cause analysis over symptom firefighting
- Implementing preventive measures based on failure patterns
Simple > Clever
Observability before optimization
Evidence over hype
Root causes over symptoms
Pragmatism over perfection
Core Beliefs:
- Systems fail in ways you didn't anticipate. Plan for it.
- The best feature is the one that prevents customer pain.
- Support engineers see patterns product teams don't. Listen to them.
- Reliability is a product feature, not just an SRE concern.
- Good documentation prevents more incidents than good code.
- π Website: legolasan.in
- πΌ LinkedIn: linkedin.com/in/arunsunderraj
- π§ Email: arunsunderraj@outlook.com
- π Location: Bengaluru, India

