Vacancy expired!
- Responsible for how code is deployed, configured, and monitored, as well as the availability, latency, change management, emergency response, and capacity management of services already in / going to
- Design, code, test and deliver software to automate manual operational work, develop self-service, auto-detection and healing
- Develop software for reliability and scale, ensuring minimal refactoring or changes
- Define, monitor and defend SLOs
- Deploying closed-loop remediation – continuous testing and remediation—to fix problems in pre-production before software is released to production.
- Build custom tooling from scratch to meet specific needs in the incident management workflow.
- Complex incident resolution across public cloud, private cloud, 3rd party, and on-premise tech.
- Leverage Chaos Engineering to find and prevent future problems and to confirm fixes from past incidents function as intended.
- Focus on end-user experiences and partner with development teams to implement changes to increase uptime and performance based on empirical evidence.
- Troubleshoot priority incidents, facilitate blameless post-incident evaluations and ensure permanent closure of incidents
- Identify application patterns and analytics in support of better service level objectives
- Design performance tests, identify bottlenecks and opportunities for optimization and capacity demands, and present solutions for continuous improvements
- Design best in class monitoring frameworks to accomplish end-to-end flow monitoring and noiseless alerting
- Design automated software and product upgrades, change management and release management solutions
- Bachelor’s degree or equivalent experience in a software engineering discipline
- 2-3 years of SRE or System Engineering experience.
- Expert in at least one technology stack designing, coding, testing, delivering software e.g., Java, Python, C, Go, etc.
- Deep knowledge of Internet protocols and web services technologies e.g., HTTP, DNS, TCP/UDP, SOAP, JSON, Apache, Tomcat and REST
- Experience working with containers e.g., Docker, Kubernetes, Cloud Foundry, etc.
- Experience in working with automation tools e.g., Ansible, Puppet, Selenium etc.
- In-Depth OS Experience e.g., RHEL, Ubuntu, Windows Server with strong debugging, troubleshooting, and problem-solving skills
- Testing and build automation with a continuous integration/continuous delivery (CI/CD) pipeline e.g., Travis CI, Maven, Gradle, Groovy, Git, Terraform, Jenkins etc.
- Experience deploying and managing services on modern platforms e.g., AWS, Google Cloud Platform, Azure.
- Strong experience in using industry standard monitoring tools e.g., AppDynamics, Dynatrace, APICA, Splunk, ELK, FluentD, Prometheus, Kibana, Elasticsearch, Grafana, Nagios, Datadog, New Relic, etc.
- Advanced understanding of application monitoring stack (Logs, Events Metrics & Alerts) and ability to visualize and setup end-to-end observability
- Certified in one or more cloud technology e.g., AWS, Azure, Google Cloud Platform or RedHat is a big plus.