At Weights & Biases, our mission is to build the best tools for AI developers. We founded our company on the insight that while there were excellent tools for developers to build better code, there were no similarly great tools to help ML practitioners build better models. Starting with our first experiment tracking product, we have since expanded our solution into a comprehensive AI developer platform for organizations focused on building their own deep learning models and generative AI applications.
Weights & Biases is a Series C company with $250M in funding and over 200 employees. We proudly serve over 1,000 customers and more than 30 foundation model builders including customers such as OpenAI, NVIDIA, Microsoft, and Toyota.
We are seeking a highly skilled and experienced SA/SRE (Solutions Architect / Site Reliability Engineering) Manager to lead our efforts in managing customer-managed deployments, improving on-premise deployments, streamlining upgrades, and building scalable systems and processes through automation. This role requires a strong technical background, leadership capabilities, and a deep understanding of deployment automation and reliability engineering.
Responsibilities:
Lead and manage a team of SA engineers focused on supporting and scaling customer-managed and on-premise deployments.
Design, implement, and enhance deployment architectures to improve reliability, scalability, and security.
Develop and optimize upgrade processes to minimize downtime and operational risk.
Build and maintain automation frameworks to streamline deployment, monitoring, and incident management.
Collaborate closely with product and engineering teams to enhance software deliverability and maintainability for on-premise environments.
Establish and enforce best practices for configuration management, infrastructure as code (IaC), and CI/CD pipelines.
Lead incident response and root cause analysis for critical production issues, ensuring continuous improvement and proactive problem prevention.
Drive a culture of operational excellence, automation, and continuous improvement across the organization.
Customer empathy is vital and timely communication with customer stakeholders
Requirements:
7+ years of experience in SRE, DevOps, or Solutions Architecture roles, with at least 2+ years in a managerial or leadership capacity.
Strong background in managing on-premise and customer-managed deployments at scale.
Proficiency in infrastructure as code (Terraform, Ansible, or similar tools) and CI/CD automation.
Experience with Kubernetes, Docker, and cloud/on-prem hybrid architectures.
Expertise in monitoring, logging, and alerting tools (Prometheus, Grafana, ELK, etc.).
Strong scripting and programming skills (Python, Go, Bash, etc.).
Experience with security and compliance considerations in enterprise software deployments.
Excellent communication and stakeholder management skills, with the ability to influence technical and business decisions.
Experience working in SaaS and enterprise environments is a plus.
Why Join Us?
Opportunity to drive large-scale transformation in enterprise software deployment and automation.
Work with cutting-edge technology and a team of talented engineers.
Competitive salary, benefits, and career growth opportunities.
Our Benefits:
🏝️ Flexible time off
🩺 Medical, Dental, and Vision for employees and Family Coverage
🏠 Remote first culture with in-office flexibility in San Francisco
💵 Home office budget with a new high-powered laptop
🥇 Truly competitive salary and equity
🚼 12 weeks of Parental leave (U.S. specific)
📈 401(k) (U.S. specific)
Supplemental benefits may be available depending on your location
Explore benefits by country
We encourage you to apply even if your experience doesn’t perfectly align with the job description as we seek out diverse and creative perspectives. Team members who love to learn and collaborate in an inclusive environment will flourish with us. We are an equal opportunity employer and do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. If you need additional accommodations to feel comfortable during your interview process, reach out at
[email protected].
#LI-Remote
#Manager #Global #Solution #Architecture #Customer #Platform #Engineering