Principal SRE / Vanilla Kubernetes / AWS / Monitoring

15 Jan 2022

We are looking for an experienced Site Reliability Engineer to join our Technical Operations team. Site Reliability Engineers are hybrid software/systems engineers whose overarching goal is to ensure that Production Services are "Always On." They strive to build the most reliable and performant systems on the planet. SREs work closely cross-functional teams to ensure we have the right set of tools to generate, collect, analyze, visualize and alert on operational data, so we know exactly what happens across the ecosystem and can see problems before they occur and address them as quickly as possible.

  • Supervise capacity & utilization and work closely with cross-functional teams to orchestrate scale-up/down of the services
  • Own & operate critical open-source services like Elasticsearch, Kafka, RabbitMQ, Redis
  • Build tools and design processes that help improve observability and system resiliency of the platform
  • Triage Site Availability Incidents and proactively work towards reducing MTTR for customer impacting incidents
  • Partner with Service owners to implement Service Level Metrics & Service Level Objectives that act as service level health indicators
  • Establish design patterns for monitoring, benchmarking and deploying new features for the backend services
  • Develop and maintain technical documentation, network diagrams, runbooks, and procedures
  • Driving initiatives to evolve our current platform to increase efficiency and keep it in line with current standards and best practices
  • Responding to production incidents and using your experience in software development, systems engineering, and networking to proactively prevent repeatable issues
  • Provide relief and sustainable resolution to issues within our infrastructure
  • Drive initiatives with partner teams to improve the reliability and performance of the infrastructure through improved system design

Skills and Qualifications
  • Systematic problem-solving approach, combined with a sense of ownership and drive
  • Full-stack debugging and performance optimization ability, including knowledge of Cloud systems (load balancing, caching, content distribution, etc.), continuous integration/build systems, Java, SQL and NoSQL databases
  • Track record monitoring and analyzing system performance, isolating issues or bottlenecks that could impact reliability, performance and scalability
  • Strong experience with observability tools such as Grafana, Prometheus, Zabbix etc
  • Good experience in any of the scripting/programming languages: Python, GoLang etc
  • Experience with one or more OSS technologies like Elasticsearch, Kafka and Redis
  • Familiar with container technology, such as: Docker, Kubernetes, Mesos, etc.
  • Understanding and experience with SRE concepts and practices, including being an advocate for the elimination of toil and drive simple solutions
  • Good verbal and written communication skills, and be able to work effectively with geographically remote teams

Good to have
  • Experience with big data related component operation and maintenance experience (hadoop/yarn/hbase/hive/spark, etc.)
  • Solid understanding of Linux system is a big plus