IT Site Reliability Engineer (SRE)

30 Nov 2024

Vacancy expired!

#6593

Seeking an IT Site Reliability Engineer (SRE) in the Costa Mesa, CA area for a full-time position.

Critical features of this job are described under the Headings below. These features may be subject to change at any time, due to reasonable accommodation or for other reasons. Nothing in this job description shall restrict management's right to assign or reassign job duties and responsibilities at any time.

The primary objective of the IT Site Reliability Engineer (SRE) will be to provide day-to-day multi-cloud operations support of business-critical large-scale enterprise. Strategic provisioning, governance, security, and availability in coordination with Network and DevOps Engineers to create, maintain, and support the public and private cloud server and network systems infrastructures that meet the technical demands of the company.

Reporting Relationships: This job reports to: Manager, Cloud Systems. Reporting directly (or indirectly) to this position are the following job titles:

Duties & Responsibilities:

  • Develops enterprise-wide instrumentation strategy to support real time observability, health checks and remediations.
  • Continuously improves observability tools by actively collaborating with Software Engineering to identify and resolve visibility gaps.
  • Collaborates with Engineering teams to ensure applications are emitting the right metrics, and debugs issues to better understand how to improve and automate tool visibility, usability and enables them to quickly set up instrumentation.
  • Conducts presentations and trains engineers on observability tool usage.
  • Serves as subject matter expert on Observability.
  • Owns monitoring, logging, and alerting.
  • Provides advanced troubleshooting skills to resolve technical problems.
  • Provides day-to-day support of production critical applications and ensure highest 24 hour/7-day week critical system availability.
  • Proactively collaborates with the Company to expand the monitoring environment to develop a specific, early warning monitoring environment.
  • Develops operational dashboards and reviews reports tracking key performance indicators (KPIs) and trends to present to team and management.
  • Provides analysis and consultation to appropriate teams.
  • Promotes automation to replace manual processes and implement changes to improve processes and workflows.
  • Performs preventative maintenance tasks and creates, maintains, and updates runbooks.
  • Participates in ongoing development, patching, and updating of released products through use of Agile process.
  • Building and designing web services in the cloud, along with implementing the set-up of geographically redundant services.
  • Identifying and implementing system improvements by evaluating system performance; upgrading, installing, tuning, and configuring the system
  • Using your knowledge of APIs to design RESTful services, and integrate them with existing data providers, using JSON or XML as needed.
  • Researching and recommending new solutions to business and management problems.
  • Working with vendors and business units on department and company projects to accomplish goals.
  • Managing business continuity in cloud-based environments, including server, file backup and recovery; preparing and testing disaster recovery procedures
  • Assist in maintaining company compliance with internal policies and regulatory standards.
  • Staying current with industry trends, making recommendations as needed to help the company excel.
  • Performing other duties or special projects as assigned

Minimum Qualifications:

  • Three to five years of experience in a Site Reliability Engineer (SRE) role or related position.
  • Customer service oriented
  • Knowledge of regulatory frameworks and their impact on design considerations (HIPAA, PCI, ITAR, etc.)
  • Experience with Load Balancing, Autoscaling, multi-zone network operations and software applications
  • Experience with common AWS services (EC2, RDS, S3, VPC, CloudFormation etc.)
  • Experience with monitoring systems and APMs including but not limited to SolarWinds, DataDog, AppDynamics, CloudWatch, Dynatrace, and PagerDuty.
  • Knowledge of networking and internet protocols, including TCP/IP, DNS, SMTP, HTTP and distributed networks.
  • Good communication skills, both verbal and written; Experience with documentation of processes
  • Excellent problem-solving ability, technical and analytical skills
  • Ability to work independently, and with moderate supervision.

Preferred Qualifications:

  • Experience deploying Infrastructure as a Code using tools such as Terraform and Cloud Formation.
  • Experience with Kubernetes, Enterprise Kubernetes management tools such as Rancher.
  • Experience working with OpenStack, Linux, Rackspace, Docker and Microsoft Azure.
  • AWS Cloud Security Certification, and/or OpenStack Administrator Certification a plus.
  • Experience with performance management of database engines (DynamoDB, MongoDB, Elasticsearch, Kafka) is a plus
  • Experience with mobile operating system platforms.
  • Multiple cloud platform, AWS, Microsoft, certifications are a definite plus.

Education: Bachelor's degree preferred, preferably in Computer Science, Engineering, or Minimum five (5) years of related work experience, or any equivalent education and/or experience from which comparable knowledge, skills and abilities have been demonstrated/achieved.

Physical requirements/Work Environment

This position primarily works in an office environment. It requires frequent sitting, standing and walking. Daily use of a computer and other computing and digital devices is required. May stand for extended periods when facilitating meetings or walking in the facilities. Some local travel is necessary, so the ability to operate a motor vehicle and maintain a valid Driver's license is required.

Ability to bend, squat, crawl or climb and lift up to 70-100 pounds

The physical demands of the position described herein are essential functions of the job and employees must be able to successfully perform these tasks for extended periods. Reasonable accommodations may be made for those individuals with real or perceived disabilities to perform the essential functions of the job described.

No Corp to Corp No Sponsorship No third party candidates considered for this position Local candidates are encouraged to apply Vaccinated for Covid is required for employment

If qualified and interested in this opportunity, please reply to JO#6593 along with a copy of your updated resume.

#MantekPriority