Observability Monitoring Engineer

17 Oct 2024
Apply

(CANNOT WORK C2C) Must work W2Candidates must be willing to work onsite 3 days a week. No exceptions Job Description:We are seeking a highly skilled Observability Monitoring Engineer with expert knowledge in Prometheus, Grafana, or Git. This role involves developing and managing telemetry for large-scale datasets and implementing strategies to enhance AI system reliability and performance, as well as assisting in capacity management.Key Responsibilities:

Develop and manage telemetry systems for large-scale datasets.

Implement monitoring and alerting solutions to ensure system reliability.

Collect and analyze data to improve AI system performance.

Automate processes to enhance efficiency and reduce manual intervention.

Manage and maintain Kubernetes clusters and Docker containers.

Utilize Prometheus and Grafana for monitoring and visualization.

Work with DCGM/DCGM Exporter (Nvidia Stack) for telemetry.

Collaborate with data scientists to support AI/ML platforms.

Troubleshoot and resolve issues related to telemetry systems.

Primary Skills:

Telemetry/Observability, Monitoring and Alerting, Data Collection and Analysis, Automation

Prometheus and Grafana

JSON/YAML

Kubernetes and Docker/Container Technologies

DCGM/DCGM Exporter (Nvidia Stack)

Solid understanding of telemetry concepts, metrics, logs, and tracing

Benefits:

Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. B

Benefits are subject to change and may be subject to specific elections, plan, or program terms.

If eligible, the benefits available for this temporary role may include the following:

Medical, dental & vision

Critical Illness, Accident, and Hospital

401(k) Retirement Plan – Pre-tax and Roth post-tax

contributions available

Life Insurance (Voluntary Life & AD&D for the

employee and dependents)

Short and long-term disability

Health Spending Account (HSA)

Transportation benefits

Employee Assistance Program

Time Off/Leave (PTO, Vacation or Sick Leave)

About TEKsystems:

We're partners in transformation. We help clients activate ideas and solutions to take advantage of a new world of opportunity. We are a team of 80,000 strong, working with over 6,000 clients, including 80% of the Fortune 500, across North America, Europe and Asia. As an industry leader in Full-Stack Technology Services, Talent Services, and real-world application, we work with progressive leaders to drive change. That's the power of true partnership. TEKsystems is an Allegis Group company. The company is an equal opportunity employer and will consider all applications without regards to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law.

Full-time
  • ID: #52722092
  • State: New Jersey Hopewell 08525 Hopewell USA
  • City: Hopewell
  • Salary: USD TBD TBD
  • Showed: 2024-10-17
  • Deadline: 2024-12-16
  • Category: Et cetera
Apply