Vacancy expired!
data-cke-eol="1"
<b</bHTC Global Services wants you. Come build new things with us and advance your career. At HTC Global you'll collaborate with experts. You'll join successful teams contributing to our clients' success. You'll work side by side with our clients and have long-term opportunities to advance your career with the latest emerging technologies.At HTC Global Services our consultants have access to a comprehensive benefits package. Benefits can include Paid-Time-Off, Paid Holidays, 401K matching, Life and Accidental Death Insurance, Short & Long Term Disability Insurance, and a variety of other perks.Position Description:This role is for a Software Reliability Engineer (SRE). The Command Center mission is to help maintain a stable production environment through effective change, incident, and problem management. We do this by quickly identifying, communicating, and facilitating containment of unplanned application and infrastructure outages. We are looking for a strong communicator and problem solver to join our team and help us transform through our SRE journey so we can improve reliability of our software and serve our customers more effectively. If you are a team-player, have a passion for problem solving, want to learn new skills and tools then this may be the role for you.Responsibilities include:- Lead critical situation bridges to facilitate the containment of outages that impact our operations and facilitate blameless post-mortems for major incidents.
- Serve as a liaison between Dev and Ops teams to ensure reliability is built into our software platforms.
- Assist in design of SRE standards for new application onboarding and monitoring of existing applications and infrastructure.
- Work with the Change Enablement team to ensure only quality changes are released into production.
- Partner with corporate and business liaisons to improve our change enablement processes.
- Perform follow-up of incidents to ensure resolution and gather all required metric information.
- Help identify and eliminate toil by process redesign and automation.
- Ensure the right tools are in place to assess availability, latency, performance, efficiency, monitoring capabilities, emergency response actions, and capacity planning.
- Utilize effective problem management to ensure permanent corrective action is implemented and repeat incidents are avoided.
- Proactively monitor application health using Dynatrace and Splunk.
- Knowledge of RDBMS, cloud technologies (preferably Google Cloud Platform), automation tools and programming experience.
- Experience with monitoring tools such as Dynatrace and Splunk.
- Understanding of various operating systems including Unix/Linux, various network protocols and databases.
- Experience with cloud technologies and various reporting / analytic tools.
- Self-starter, motivated, ability to work independently and in a fast-paced environment.
- Proven ability to develop strong working relationships.
- Capable of influencing and motivating people.
- Strong analytical skills with a logical mindset and problem-solving approach.
- Excellent ability to manage multiple high priority efforts, competing priorities and the flexibility to adjust to changing requirements and schedules.
- Minimum 5-years’ experience with application monitoring, advanced telemetry, and relational database management systems.
- Experience in Java and other development technologies.
- Expertise in designing, analyzing, and troubleshooting distributed systems.
- Ability to debug, optimize code, and automate routine tasks.
- Familiarity with cybersecurity tools, processes, and controls.
- Prior Rally/PDO experience and familiarity with ITIL ITSM processes.
- Bachelor's degree in Computer Science, a related technical field involving programming, or equivalent practical experience.