Vacancy expired!
Hi,Greetings from ApexonTitle: SRE/Site Reliability Engineer.Location: Dallas, TX.
Required Skills & Experience- Experience with monitoring tools (Datadog, Splunk, New Relic, Prometheus, Grafana, Nagios, etc.), and any experience/exposure to modern DevOps is a plus (AWS, Kubernetes, Terraform)
- Utilize existing tools to create telemetry streams from each system that DevOps maintains.
- Track trends of key metrics to build a repeatable snapshot of the current state of all systems within DevOps and predict failures.
- Correlate data from disparate systems to determine underlying causes to issues that may be occurring in seemingly unrelated parts of the enterprise.
- Monitor existing logging and monitoring systems and reduce unnecessary logging or improperly tuned monitor probes.
- Develop a suite of dashboards and tools that enable the SRE to track all incoming metrics and surface the most pressing issues.
- Continually improve these dashboards to make their information more useful in real time as well as for after-the-fact analysis.
- Generate "Postmortem" reports for unplanned outages or system failures.
- Prepare "Scope of Impact" reports for upcoming planned outages or system changes.
- Work with the other members of DevOps and the Infrastructure team to ensure that underlying resources are ready for failover and to help plan for future growth.
- Maintain failover documentation and S.O.P.s.
- Perform regularly scheduled failover testing in conjunction with the rest of the DevOps team, Infrastructure, and our business teams.
- Continually seek to improve our failover procedures.
- Mastery in at least two or more software languages (e.g., Python, Java, Go, etc.) with respect to designing, coding, testing, and software delivery.
- At least two years of experience working with data systems.
- The SRE is the "Control Tower" of DevOps. As such, they need to be familiar with how our data systems work and interact with one another.
- The candidate should have a basic understanding of computer programming and data systems architecture.
- Ability to interact with various groups within the business to inform them of the basic details of upcoming changes or to communicate the current state of system failures or outages.
- Ability to interact with other developers and management to help define, implement, and enforce patterns for proper metric telemetry from systems, proper logging, and resilient failover patterns.
- Should always be seeking to improve our system telemetry, uptime, and recoverability.
- ID: #49015004
- State: Texas Dallas / fort worth 75201 Dallas / fort worth USA
- City: Dallas / fort worth
- Salary: Depends on Experience
- Job type: Permanent
- Showed: 2023-02-04
- Deadline: 2023-03-30
- Category: Et cetera