Site Reliability Engineer III

This description is a summary of our understanding of the job description. Click on ‘Apply’ button to find out more.

Role Description

As a Site Reliability Engineer at Guidewire, you’ll join a passionate team dedicated to automating every process to ensure our systems run efficiently. Our Platform team is fully committed to developing and managing software that enhances the reliability of production systems—systems that serve hundreds of customers and support millions of transactions every day.

Drive Reliability & Automation: Take a dedicated SRE approach to managing shared multi-tenant infrastructure for resilient SaaS microservice-based systems and customer-centric applications.
Oversee and continuously enhance our team’s presence in AWS by automating deployment and operational tasks.
Innovate and Improve Core Systems: Contribute to the development of our core infrastructure systems—adding features, fixing bugs, and implementing reliability enhancements.
Engineer and maintain a complex single sign-on (SSO) authentication platform based on SAML/OAuth to ensure secure, seamless access for our users.
Enhance Observability & Incident Management: Build and maintain comprehensive observability tooling, metrics, and dashboards to support our global platform infrastructure.
Improve our incident management lifecycle by identifying, mitigating, and learning from reliability risks, while helping to create a self-healing environment.
Empower the Team: Develop system documentation and training materials to educate and empower your teammates.
Collaborate with various engineering teams, providing valuable feedback and contributing code when needed to enhance our products.

Qualifications

Bachelor’s Degree in Computer Science or a related field.
Proven software engineering and automation skills using Bash, Python, and/or Go.
Well-versed in agile development methodologies (Scrum, Kanban, etc.) and have a deep background in Linux systems.

Requirements

Significant experience in automating and managing systems on Amazon Web Services (AWS) and supporting live production environments (Java/Apache/Tomcat).
Proficient with Infrastructure as Code (IaC) tools such as Terraform, Terragrunt, or Terraspace.
Experience with devops/gitops tools (Git, Bitbucket, Flux CD, TeamCity) for smooth code promotions.
Hands-on experience in containerization (Docker, Helm, Kubernetes/EKS, CNI, and Ingress networking).
Strong understanding of Single-Sign On, SAML, and OAuth (bonus if you’ve worked with Okta).
Experienced with observability tools (Datadog, CloudWatch, PagerDuty) and familiar with event store/stream-processing technologies like Kafka or AWS SQS.
Worked with relational databases such as Aurora Postgres or Oracle RDS.
Possess advanced exposure to application development, web UI design, JSON, and overall application architecture.
Exposure to Open Application Model systems like KubeVela or Crossplane is a plus.

Benefits

Opportunity to make a direct impact by ensuring our cloud platform is both robust and customer-focused.
Exciting challenges of solving problems at scale with technologies like AWS, Kubernetes, and Aurora.

#Site #Reliability #Engineer #III

Leave a Comment Cancel reply