Vacancy expired!
- Hands-on experiences with infrastructure provision and configuration automation (Terraform, Ansible)
- Hands-on experiences with Kubernetes (AKS preferred) administration
- Validate the current High Availability, Disaster Recovery design
- Design and conduct failover test
- Support legacy on-prem and cloud-native applications
- Map the external system dependencies and develop a fault tolerant capability
- Review and sign-off new changes, environment/system upgrades, and major releases
- Take ownership in non-functional designs for performance, reliability, and scalability
- Design load testing and chaos testing
- Conduct capacity planning exercises
- Interact with both IT and business teams to understand business demands and ensure platform readiness for peak
- Trouble-shooting efforts during critical incident and post incident reviews
- Publish and follow-through on the post incident review action items
- Continuously monitoring the system performance and detect deviation from the baseline
- Improve the effectiveness and coverage of monitoring systems.
- Keep track of known issues and risks and KPIs on Service Level Objectives (SLO), Service Level Agreements (SLA), Recovery Time Objective (RTO), Recovery Point Objective (RPO), Mean Time to Repair (MTTR), Mean Time Between Failure (MTBF)
- At lease 10 years experiences in IT industry with track records of career progression
- Experiences designing and supporting large scale, mission critical, customer facing web and mobile applications
- Experiences trouble-shoot failures and performance bottlenecks within Database, application, or infrastructure
- Experiences with capacity planning and load testing
- Experiences with public cloud, Azure preferred
- Experiences architecting for performance, scalability, and reliability
- Understand concepts with observability: metrics, logs, and distributed tracing
- Experiences with application performance monitoring (APM) and observability tools (at least one of the following AppDynamics, DataDog, Dynatrace, New Relic, Prometheus, Grafana, ELK, or Splunk)