Vacancy expired!
- Work with global 24X7 SRE team members
- Own Reliability/SRE and Incident Management, considering reactive and proactive approaches
- Foster a culture of blameless incident analysis and learning beyond root cause with focus on reducing MTTD, MTTR and MTBF
- Execute on the reliability and resiliency roadmap that delivers on tooling and overarching projects that will make the system more resilient
- Drive large initiatives by collaborating across multiple teams and stakeholders
- Evaluate and drive improvements in current systems for capabilities, performance, scale, challenges, and growth regularly
- Promote adoption of modern software and infrastructure development standards
- Drive simplicity, increase resilience and scale of the services
- Strong experience running AWS infrastructure at scale with IaC
- Strong experience in building immutable infrastructure with packer, terraform and ansible or Salt
- GitOps CI/CD experience with Gitlab and Kubernetes
- Experience with application/infrastructure monitoring and observability using OpenTelemetry
- Experience in CDN and caching technologies
- Strong architecture and system design sense – experience developing and deploying microservices and API’s at scale
- Influence team on built-in quality
- Experience in delivering complex software and infrastructure releases on tight schedules with high quality
- Bachelor's or Master s degree in Computer Science, Computer or Electrical Engineering, Mathematics, or a related field.