Site Reliability Engineer

04 Feb 2025

Vacancy expired!

Hi,Greetings from ApexonTitle: SRE/Site Reliability Engineer.Location: Dallas, TX.

Required Skills & Experience
  • Experience with monitoring tools (Datadog, Splunk, New Relic, Prometheus, Grafana, Nagios, etc.), and any experience/exposure to modern DevOps is a plus (AWS, Kubernetes, Terraform)
  • Utilize existing tools to create telemetry streams from each system that DevOps maintains.
  • Track trends of key metrics to build a repeatable snapshot of the current state of all systems within DevOps and predict failures.
  • Correlate data from disparate systems to determine underlying causes to issues that may be occurring in seemingly unrelated parts of the enterprise.
  • Monitor existing logging and monitoring systems and reduce unnecessary logging or improperly tuned monitor probes.
  • Develop a suite of dashboards and tools that enable the SRE to track all incoming metrics and surface the most pressing issues.
  • Continually improve these dashboards to make their information more useful in real time as well as for after-the-fact analysis.
  • Generate "Postmortem" reports for unplanned outages or system failures.
  • Prepare "Scope of Impact" reports for upcoming planned outages or system changes.
  • Work with the other members of DevOps and the Infrastructure team to ensure that underlying resources are ready for failover and to help plan for future growth.
  • Maintain failover documentation and S.O.P.s.
  • Perform regularly scheduled failover testing in conjunction with the rest of the DevOps team, Infrastructure, and our business teams.
  • Continually seek to improve our failover procedures.

Desired Skills & Experience
  • Mastery in at least two or more software languages (e.g., Python, Java, Go, etc.) with respect to designing, coding, testing, and software delivery.
  • At least two years of experience working with data systems.
  • The SRE is the "Control Tower" of DevOps. As such, they need to be familiar with how our data systems work and interact with one another.
  • The candidate should have a basic understanding of computer programming and data systems architecture.
  • Ability to interact with various groups within the business to inform them of the basic details of upcoming changes or to communicate the current state of system failures or outages.
  • Ability to interact with other developers and management to help define, implement, and enforce patterns for proper metric telemetry from systems, proper logging, and resilient failover patterns.
  • Should always be seeking to improve our system telemetry, uptime, and recoverability.

  • ID: #49015004
  • State: Texas Dallas / fort worth 75201 Dallas / fort worth USA
  • City: Dallas / fort worth
  • Salary: Depends on Experience
  • Job type: Permanent
  • Showed: 2023-02-04
  • Deadline: 2023-03-30
  • Category: Et cetera