Vacancy expired!
- To perform this job successfully, an individual must be able to perform each primary duty satisfactorily.
- Collaborate with development, operations and infrastructure teams to ensure availability of services, and to work through implementation issues.
- Develop automation for incident response and to prevent problem recurrence
- Create and enhance runbooks to respond to service outages or degradations
- Assess the production readiness of services
- Define and track operational metrics for production performance, reliability, scalability and availability
- Develop and maintain shared services and tools to improve reliability and reduce toil across the organization
- Contribute to the team’s continuous improvement through research, retrospectives, discussion groups, code
- experience with maintaining and troubleshooting large-scale distributed systems
- Experience with Agile / Scrum methodology
- Experience managing infrastructure in public cloud environments like AWS (preferred), Azure or Google Cloud Platform
- Experience providing visibility using monitoring and alerting tools like Splunk, SignalFx, AppDynamics, Datadog, StackDriver, Sysdig, Prometheus or Grafana
- Programming/scripting experience in languages like Java, Bash, Python or Go
- Experience with distributed messaging systems like Kafka, RabbitMQ, or ActiveMQ
- Experience with container orchestration systems like Kubernetes, Mesos, Docker Swarm or Rancher
- Experience with using Continuous Integration and Continuous Delivery (CI/CD) tools like Jenkins, Travis, Harness, Spinnaker, Appveyor, CodeBuild or CodePipeline.
- Bachelor’s Degrees
- Minimum of 3-5 years of experience in Site Reliability Engineering / DevOps