Senior Data Platform Reliability Engineer
H
Hammerjack Pty Ltd
📍 , , Philippines, , , Philippines, Philippines
Job Description
Your Role
- Run managed services, not just systems. Operate multi-tenant data/AI platforms (Spark, Airflow, Flink, Jupyter) with clear SLAs/SLIs/SLOs, cost guardrails, and capacity plans across AWS/GCP + Kubernetes.
- Be the face of reliability. Lead incidents end-to-end, own customer comms and post-incident reviews (RCA with actions customers can see and feel).
- Design for Customer experience. Help Data scientists and customers reduce failed/slow jobs, improve time-to-data, and optimize costs so customers notice faster pipelines and fewer surprises.
- Standardize & scale. Build service runbooks, golden paths, and automation that make onboarding and daily ops predictable across customers.
- Automate the toil away. Ship tooling (Bash/Python, GitOps, CI/CD) for backups, DR drills, upgrades, access, and environment bootstrapping.
- Make signals meaningful. Instrument platforms with metrics/logs/traces; tune alerting to cut noise...