Lead Site Reliability Engineer

10 May 2024

Vacancy expired!

Job Description

The specific

responsibilities of an SRE managing a large, distributed application built on microservices, spring boot, and Google Cloud may include:
  • Strong background in software development and systems administration, as well as excellent problem-solving and communication skills.
  • Run the production environment by monitoring availability and taking a holistic view of system health.
  • Developing, improving, and operating the deployment and orchestration of a complex distributed system
  • Improve reliability, quality, and time-to-market of our suite of software solutions
  • Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
  • Provide primary operational and engineering Support for multiple large, distributed software applications
  • Identify and reduce or eliminate toil via automation to maximize the time spent on engineering and innovation
  • Collaborating with development teams to design, build, and operate scalable and resilient software systems
  • Automating deployment, monitoring, and incident response processes
  • Performing root cause analysis of production incidents and implementing preventive measures
  • Conducting performance analysis and optimization of the system
  • Ensuring compliance with security and regulatory standards
  • Implementing and maintaining disaster recovery processes
  • Providing technical guidance and mentorship to other team members
  • Participating in an on-call rotation for incident response and support.

Qualifications
  • Four-year college degree in Computer Science or Equivalent.
  • 7 - 9 years' experience with JAVA, J2EE, NoSQL/SQL Datastore, Spring Boot, Google Cloud Platform/AWS/Azure & Docker/K8 in developing multi-tier applications.
  • Programming skills (Perl, Python, Ruby, Java/Scala or C).
  • Experience with RESTful APIs and microservices platform is a must
  • Working knowledge of the TCP/IP stack, internet routing and load balancing
  • 4 - 5 Years of experience with any of APM and other moniotoring tools such as Dynatrace, New Relic, ELK, Splunk, Prometheus, Sensu, Nagios, Kafka, DataDog, PagerDuty.
  • Experience with product & development teams to establish error budgets by identifying the right SLOs (Service level objective), SLIs (Service level indicators), KPIs (Key performance indicators) and effectively drive the use of the budget to ensure maximum domain availability/uptime.
  • Regularly review key site technical metrics such as transactions errors, logging, response times, caching strategies, conversion/bounce rates, capacity & resource utilization.
  • Debug production issues across services and levels of the stack.
  • Proactively identify stability risks & work with engineering leadership to establish appropriate mitigation plans.
  • Recognize, validate & evangelize emerging technologies & architectures that align with business objectives
  • Solve complex architecture/design & business problems, work to simplify, optimize, remove bottlenecks, etc.
  • Collaborate closely with architects & other cross functional teams to create secure, reliable, and scalable software solutions.
  • Thorough understanding of software development cycle and agile programming environment.
  • Architect, design & develop automation to reduce toil, improve recoverability, availability, latency & scalability of supported applications.
  • Triage, analyze and provide solution to critical & high priority technical issues occurring in the ecosystem, optimize incident management processes.
  • Respond, react & communicate as per the ITSM incident management process. This process involves detection of the incident, timely communication to leadership during the life of the incident, service restoration, followed by root cause analysis to prevent the incident from occurring in the future.
  • Drive blameless postmortem culture.
  • Practice destructive testing for discovering vulnerabilities in environments powered by Distributed software systems.
  • Implement effective observability strategy, to improve MTTD (Mean Time to Detection) & MTTR (Mean Time to Resolution).
  • Maintain knowledge repository that includes Standard operating procedure, Release checklists, Runbooks for incident recovery

What you'll receive in return: As part of the Ford family, you'll enjoy excellent compensation and a comprehensive benefits package that includes generous PTO, retirement, savings and stock investment plans, incentive compensation and much more. You'll also experience exciting opportunities for professional and personal growth and recognition. Candidates for positions with Ford Motor Company must be legally authorized to work in the United States permanently. Verification of employment eligibility will be required at the time of hire. Visa sponsorship is available for this position. We are an Equal Opportunity Employer committed to a culturally diverse workforce. All qualified applicants will receive consideration for employment without regard to race, religion, color, age, sex, national origin, sexual orientation, gender identity, disability status or protected veteran status. For information on Ford's salary and benefits, please visit: At Ford, the health and safety of our employees is our top priority. Vaccination has been proven to play a critical role in combating COVID-19. As a result, Ford has made the decision to require U.S. salaried employees to be fully vaccinated against COVID-19, unless employees require an accommodation for religious or medical reasons. Being fully vaccinated means that an individual is at least two weeks past their final dose of an authorized COVID-19 vaccine regimen. As a condition of employment, newly hired employees will be required to provide proof of their COVID-19 vaccination or an approved medical or religious exemption. We are an Equal Opportunity Employer committed to a culturally diverse workforce. All qualified applicants will receive consideration for employment without regard to race, religion, color, age, sex, national origin, sexual orientation, gender identity, disability status or protected veteran status.

$desc3

  • ID: #49902500
  • State: Michigan Dearborn 48126 Dearborn USA
  • City: Dearborn
  • Salary: USD TBD TBD
  • Job type: Permanent
  • Showed: 2023-05-10
  • Deadline: 2023-07-09
  • Category: Et cetera