This description is a summary of our understanding of the job description. Click on ‘Apply’ button to find out more.
Role Description
As a Staff Site Reliability Engineer (Staff SRE) at SailPoint, you will be a key member on our Reliability Engineering team, driving reliability practices servicing the Identity Security Cloud platform. You are immensely passionate about reliability practices and operational excellence.
- Make it easy for everyone to create, consume, manage, and scale reliable cloud production services to achieve more
- Keep up with industry trends to improve end-to-end reliability and maintainability for all services
- Coach engineering teams on observability best practices such as setting up well-defined Service Level Objectives (SLOs)
- Analyze performance of services and recommend infrastructure/code changes that will improve capacity and performance
- Enable our engineering teams to scale our enterprise operations by providing guidance, best practices, and support as part of an SRE Center of Excellence
- Manage cross-functional requirements working with Engineering, Product, Services, and other departments
- Be a mentor of quality for design reviews, code, test cases, automation, observability, root cause analysis, and self-healing
- Influence architectural design, implementation, consolidation, and simplification for global scale
- Drive operational excellence to deliver frictionless operation, happy on call, and optimal customer experience
Qualifications
- 8+ years experience in SRE or DevOps production operations supporting a highly available environment for SaaS software or cloud service provider
- Strong proficiency with one or more programming languages (Java, Python, Go, etc.)
- Bachelor’s degree in Computer Science or other technical discipline, or equivalent experience is preferred, not required
Requirements
- Due to FedRAMP requirements, US Citizenship is required to be considered for this role
- Experience with cloud infrastructure environments, preferably AWS, and Infrastructure as code, preferably Terraform
- Strong proficiency with containerization technology and/or Kubernetes
- In-depth experience with metrics, tracing, and logging observability tools such as Prometheus, Grafana, Honeycomb, and Kibana
- Experience with incident management, including conducting incident reviews
- Strong understanding of Linux, software development, systems, networking, and Cloud concepts
- A positive and collaborative demeanor, combined with the ability to coach, mentor, and delegate
- Excellent communication skills
- Life-long learner – you stay up to date with technology trends, spend time learning new technologies, and share your learnings with your team
Benefits
- Health and wellness coverage: Medical, dental, and vision insurance
- Disability coverage: Short-term and long-term disability
- Life protection: Life insurance and Accidental Death & Dismemberment (AD&D)
- Flexible spending accounts for health care, and dependent care; limited purpose flexible spending account
- Financial security: 401(k) Savings and Investment Plan with company matching
- Time off benefits: Flexible vacation policy
- Holidays: 8 paid holidays annually
- Sick leave
- Parental support: Paid parental leave
- Employee Assistance Program (EAP) and Care Counselors
- Voluntary benefits: Legal Assistance, Critical Illness, Accident, Hospital Indemnity and Pet Insurance options
- Health Savings Account (HSA) with employer contribution
#Staff #Site #Reliability #Engineer