Site Reliability Engineer

29 Jun 2024

Vacancy expired!

Job Title : Site Reliability Engineer (SRE)

Location : Pleasanton, CA (Remote till Covid)

Duration : Long term contract

Roles & Responsibilities
  • SRE hands on good experience
  • Web base application (java) support experience
  • APM Tool experience

Application Production Support exposure
  • Experience with application operation, cloud platform, system uptimes, system recovery, performance, Latency, monitoring, and root cause analysis.
  • 4-6 + year experience as automation and tooling engineer.
  • Solid knowledge and experience of scripting (Python / Bash) for java/NodeJS runtime environment.
  • Deep understanding and experience of microservices, API and Web Services.
  • Strong hands-on experience developing applications using Java, NodeJS / AngularJS, Python, GO, etc.
  • Experience with cloud native applications, docker, Kubernetes, etc.
  • Experience writing clean, modular Typescript code using external libraries or custom code.
  • Experience with CI\CD pipeline using Jenkins and Github.
  • Good to have experience with tools such as BlueTriangle, writing splunk query, and monitoring tool such as Dynatrace.
  • Excellent verbal and written communication skills.
  • Prior experience in supporting web and mobile apps
  • Basic knowledge of CDN (Akamai)
  • Exposure to Monitoring tools (APM, Synthetic & Log Monitoring etc.)
  • Azure exposure (any cloud)
  • Unix & Scripting for Automation
  • eCommerce experience (supporting web applications etc)

Roles & Responsibilities
  • Responsible for Toil Reduction, implementing identified improvement opportunities, handling minor enhancement and non-ticketed activity.
  • Prior experience in supporting web and mobile apps
  • Basic knowledge of CDN (Akamai)
  • Exposure to Monitoring tools (APM, Synthetic & Log Monitoring etc.)
  • Azure exposure (any cloud)
  • Unix & Scripting for Automation
  • eCommerce experience ( supporting web applications etc)
  • Define and monitor service level metrics that include incident management KPIs like: MTTD, MTTR, MTBF, MTTF, Unavailability rate, Incident count, etc.
  • Create rules to optimize incident response by metrics, streamlining alert flows, and collaboration and communication across squads.
  • Proactively identify the issues that might disrupt the service in production
  • Address incoming service request to their support groups/Jira tool
  • Create and maintain alerts
  • Change validation or change planning related requests
  • Assist business stake holder in determining SLO or adjusting threshold limits
  • Demand and capacity management & make corrections to SLI/SLO threshold limits
  • Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
  • Partner with development teams to improve services through rigorous testing and release procedures
  • Participate in system design consulting, platform management, and capacity planning
  • Create sustainable systems and services through automation and uplifts
  • Balance feature development speed and reliability with well-defined service level objective (SLO, SLI)
  • Debug production issues across services and levels of the stack.

  • ID: #43687038
  • State: California Pleasanton 94566 Pleasanton USA
  • City: Pleasanton
  • Salary: Depends on Experience
  • Job type: Contract
  • Showed: 2022-06-29
  • Deadline: 2022-08-27
  • Category: Et cetera