- Страна
- США
- Зарплата
- 145 000 $ – 180 000 $
Откликайтесь
на вакансии с ИИ

Staff Site Reliability Engineer
Высокая оценка обусловлена прозрачным диапазоном зарплаты, удаленным форматом работы и социально значимой миссией в сфере здравоохранения. Роль предлагает значительное влияние на архитектуру и культуру компании.
Сложность вакансии
Роль требует исключительного уровня ответственности за доступность 99.99% и глубоких знаний архитектуры Kubernetes и GCP. Кандидат должен обладать опытом проектирования систем аварийного восстановления (DR) и навыками менторства инженеров уровня Senior.
Анализ зарплаты
Предлагаемая зарплата ($145k - $180k) соответствует рыночным стандартам для позиции Staff SRE в США, хотя для топовых технологических хабов верхняя граница может быть выше. Наличие опционов и гибкого отпуска повышает общую ценность компенсационного пакета.
Сопроводительное письмо
I am writing to express my strong interest in the Staff Site Reliability Engineer position at Synthesis Health. With over 8 years of experience in SRE and DevOps, specifically focusing on high-scale Kubernetes environments and multi-region disaster recovery, I am confident in my ability to help your team achieve and maintain the 99.99% availability SLA for your critical healthcare services.
In my previous roles, I have successfully architected active-active multi-region failover strategies and implemented sophisticated auto-scaling policies on GKE that significantly reduced operational toil. I am particularly drawn to Synthesis Health’s mission of revolutionizing healthcare and your commitment to a blameless post-mortem culture. My expertise in Terraform, Go, and observability stacks like Prometheus and Datadog aligns perfectly with your technical requirements.
I am excited about the opportunity to serve as a technical leader and mentor within your organization, embedding SRE principles into the development lifecycle. Thank you for considering my application; I look forward to the possibility of discussing how my background in building resilient, HIPAA-compliant infrastructures can contribute to your mission-driven team.
Составьте идеальное письмо к вакансии с ИИ-агентом

Откликнитесь в synthesishealth уже сейчас
Присоединяйтесь к Synthesis Health, чтобы проектировать отказоустойчивые системы, спасающие жизни, в полностью удаленном формате!
Описание вакансии
Synthesis Health
Who We Are
We’re a mission- and values-driven company with tremendous dedication to our customers. Our 100% remote team is dedicated to a common goal – to revolutionize healthcare through innovation, collaboration, and commitment to our core values and behaviors.
About the Opportunity
We are looking for a Staff Site Reliability Engineer (SRE) to serve as the guardian of our platform’s availability and the architect of our operational maturity.
In this high-impact role, you will own the strategy and execution required to achieve and maintain a 99.99% availability SLA for our critical healthcare services. You will not just respond to incidents; you will build the automated systems that prevent them. You will design the auto-scaling architectures and disaster recovery protocols that allow us to handle bursty medical imaging traffic and catastrophic failures without flinching.
This is a hands-on leadership role. You will define the standards for reliability engineering across the organization, mentor Senior (L4) engineers, and embed SRE principles into our development culture. You will serve as the technical face of reliability to our enterprise customers, providing the architectural assurances they need to trust us with their most critical workflows.
If you are obsessed with automation, intolerant of manual toil, and ready to lead the reliability strategy for a life-critical platform, we want to hear from you.
Key Responsibilities
Uptime & Reliability Strategy
- Own the 99.99% Target: You will define the Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for our critical user journeys. You will be accountable for tracking our Error Budgets and governing the release velocity based on platform stability.
- Incident Management & Forensics: You will own the incident response process, serving as the ultimate escalation point for complex production outages. You will lead blameless post-mortems (RCAs) to identify root causes and ensure systemic fixes are implemented to prevent recurrence.
- Eliminate Toil: You will ruthlessly identify and automate manual operational tasks. Your goal is to engineer yourself out of operations work so you can focus on high-value reliability architecture.
Business Continuity & Disaster Recovery (BC/DR)
- Architect for Catastrophe: You will design and implement our Business Continuity and Disaster Recovery strategy. You will orchestrate our regional failover capabilities, ensuring we meet aggressive Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO).
- Enterprise-Grade Resilience: You will build the technical credibility required to win grueling enterprise audits. You will demonstrate that our platform is robust, stable, and resistant to unexpected failures through rigorous documentation and proof-of-concept demonstrations.
- "Game Day" Simulations: You will lead regular disaster recovery drills and chaos engineering experiments to validate our failover mechanisms, ensuring our team is practically prepared for real-world scenarios.
Scalability & Performance
- Intelligent Auto-Scaling: You will design and implement sophisticated auto-scaling strategies (HPA/VPA/Cluster Autoscaler) on Kubernetes (GKE) to handle unpredictable spikes in medical data ingestion.
- Capacity Planning: You will lead capacity planning and cost optimization initiatives, ensuring our infrastructure scales efficiently with our business growth.
Architectural Leadership
- Resilience Patterns: You will work with the Architecture Review Board (ARB) to enforce resilience patterns (circuit breakers, retries, fallbacks, bulkheads) in our application code and service mesh.
- Mentorship & Culture: You will advocate for SRE culture across the engineering organization, mentoring feature teams on how to build operable, observable, and reliable software.
What We’re Looking For
- Deep SRE Experience: 8+ years of engineering experience, with a significant focus on Site Reliability Engineering or DevOps in a high-scale, 24/7 production environment.
- BC/DR Orchestration: Proven experience designing active-passive or active-active multi-region architectures. You have successfully executed regional failovers and managed the complexities of data replication and consistency during outages.
- Kubernetes Mastery: Deep, hands-on expertise with Kubernetes (GKE preferred). You understand the internals of scheduling, networking (CNI), and storage (CSI).
- Infrastructure as Code: You treat infrastructure as software. You have expert-level proficiency with Terraform or similar IaC tools.
- Observability Expert: You have deep experience implementing and tuning observability stacks (Prometheus, Grafana, Datadog, or similar). You know how to extract meaningful signals from noise.
- Coding Proficiency: You are a capable coder in Go, Python, or TypeScript. You can dive into application code to debug production issues or build complex automation tooling.
- Cloud Native: Deep experience with public cloud providers (GCP preferred) and their managed services.
Preferred Qualifications
- Healthcare Experience: Experience supporting HIPAA-compliant environments or handling PHI (Protected Health Information).
- Global Traffic Management: Experience with multi-region architectures, global load balancing, and CDN tuning.
- Chaos Engineering: Experience designing and running chaos experiments to validate system resilience.
Why You Should Join Us
- Solve Our Toughest Puzzles: This is a high-leverage role. You will be working on the most impactful technical challenges that are critical to the company's success.
- Define the Architecture: You won't just be maintaining a system; you will be a primary author of its future state, with the autonomy to make it happen.
- Lead from the Front: This is a chance to establish yourself as a key technical voice in a rapidly growing company.
- Competitive Compensation & Benefits: We offer a strong salary, a 100% remote culture, and significant opportunities for growth.
We are a values-driven company. Our values:
- Clinical service first.
- Collaborate with our customers.
- Listen, respect, learn.
- Innovate to excel.
The behaviors we look for:
- Be nice.
- Be creative.
- Be honest.
- Be helpful.
Compensation and Benefits
Typical salary range for this position is $145,000 - $180,000. However, Synthesis participates in location based hiring and salary ranges can be adjusted based on candidate's residence.
Other benefits include, but are not limited to: Medical, Dental, Vision, “Use as needed” vacation policy, and participation in our employee option program.
Synthesis Health is an Equal Employment/Affirmative Action employer. We do not discriminate in hiring on the basis of sex, gender identity, sexual orientation, race, color, religious creed, national origin, physical or mental disability, protected veteran status, or any other characteristic protected by federal, state, or local law.
Создайте идеальное резюме с помощью ИИ-агента

Навыки
- TypeScript
- Python
- Terraform
- GCP
- Kubernetes
- Prometheus
- Grafana
- Infrastructure as Code
- Chaos Engineering
- Go
- Service Mesh
- Datadog
- GKE
Возможные вопросы на собеседовании
Проверка опыта проектирования отказоустойчивых систем, что является ключевым требованием вакансии.
Опишите ваш опыт проектирования и реализации стратегии аварийного восстановления (DR) для многорегиональной архитектуры. С какими сложностями репликации данных вы столкнулись?
Вакансия требует глубокого понимания K8s для обработки резких скачков трафика медицинских данных.
Как бы вы настроили стратегии автомасштабирования (HPA/VPA) в GKE для обработки непредсказуемых всплесков трафика при сохранении стабильности системы?
Роль подразумевает владение Error Budgets и SLO.
Как вы подходите к управлению бюджетом ошибок (Error Budget) и в каких случаях вы бы рекомендовали остановить выпуск новых фич в пользу стабилизации платформы?
Одной из задач является устранение рутины (toil).
Приведите пример сложного процесса, который вы успешно автоматизировали, чтобы 'исключить себя' из операционной деятельности. Каков был измеримый результат?
Позиция уровня Staff предполагает лидерство и внедрение культуры SRE.
Как вы внедряете принципы SRE в продуктовые команды, которые исторически были сфокусированы только на скорости разработки, а не на надежности?
Похожие вакансии
DevOps Middle +/ Senior
Senior DevOps/Mlops
Devops Middle+ / Senior
Senior DevOps/SRE Engineer (On-Premise инфраструктура)
DevOps - senior
Junior+ / Middle DevOps Engineer
1000+ офферов получено
Устали искать работу? Мы найдём её за вас
Quick Offer улучшит ваше резюме, подберёт лучшие вакансии и откликнется за вас. Результат — в 3 раза больше приглашений на собеседования и никакой рутины!
- Страна
- США
- Зарплата
- 145 000 $ – 180 000 $