Vacancy expired!
Job Description
The specific responsibilities of an SRE managing a large, distributed application built on microservices, spring boot, and Google Cloud may include:- Strong background in software development and systems administration, as well as excellent problem-solving and communication skills.
- Run the production environment by monitoring availability and taking a holistic view of system health.
- Developing, improving, and operating the deployment and orchestration of a complex distributed system
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Collaborating with development teams to design, build, and operate scalable and resilient software systems
- Automating deployment, monitoring, and incident response processes
- Performing root cause analysis of production incidents and implementing preventive measures
- Conducting performance analysis and optimization of the system
- Implementing and maintaining disaster recovery processes
- Participating in an on-call rotation for incident response and support.
- Four-year college degree in Computer Science or Equivalent.
- 4+ years' experience with JAVA, J2EE, NoSQL/SQL Datastore, Spring Boot, Google Cloud Platform/AWS/Azure & Docker/K8 in developing multi-tier applications.
- Programming skills (Perl, Python, Ruby, Java/Scala or C).
- Experience with RESTful APIs and microservices platform is a must
- Working knowledge of the TCP/IP stack, internet routing and load balancing
- 2-3 Years of experience with any of APM and other monitoring tools such as Dynatrace, New Relic, ELK, Splunk, Prometheus, Sensu, Nagios, Kafka, DataDog, PagerDuty.
- Experience with product & development teams to establish error budgets by identifying the right SLOs (Service level objective), SLIs (Service level indicators), KPIs (Key performance indicators) and effectively drive the use of the budget to ensure maximum domain availability/uptime.
- Debug production issues across services and levels of the stack.
- Thorough understanding of software development cycle and agile programming environment.
- Architect, design & develop automation to reduce toil, improve recoverability, availability, latency & scalability of supported applications.
- Triage, analyze and provide solution to critical & high priority technical issues occurring in the ecosystem, optimize incident management processes.
- Respond, react & communicate as per the ITSM incident management process. This process involves detection of the incident, timely communication to leadership during the life of the incident, service restoration, followed by root cause analysis to prevent the incident from occurring in the future.
- Practice destructive testing for discovering vulnerabilities in environments powered by Distributed software systems.