Страна: Нидерланды

+500% приглашений

Откликайтесь
на вакансии с ИИ

УдалённоПолная занятость

Technical Product Manager - Mission Control

Name: Quick Offer — сервис для поиска работы на hh.ru
Brand: Quick Offer
SKU: quick-offer-saas
Availability: InStock
Rating: 4.9 (682 reviews)

Nebius — один из самых перспективных игроков на рынке AI-облаков с уникальной инфраструктурой. Роль предлагает работу с передовыми технологиями (NVIDIA GPUs, InfiniBand) и высокую степень влияния на продукт в быстрорастущей компании.

Вакансия из Quick Offer Global, списка международных компаний

Пожаловаться

Сложность вакансии

ЛегкоСложно

Роль требует глубоких технических знаний в области HPC, GPU-платформ и распределенных систем. Кандидат должен уметь работать на стыке хардверной инфраструктуры и софтверных ML-решений, что значительно повышает порог входа.

Анализ зарплаты

Медиана110 000 €

Рынок90 000 € – 140 000 €

Предлагаемая роль в Nebius соответствует уровню Senior/Lead Technical PM в европейском технологическом секторе. Ожидаемая зарплата для Амстердама в этой нише обычно выше среднего по рынку из-за дефицита специалистов на стыке ML и инфраструктуры.

I am writing to express my strong interest in the Technical Product Manager position for Mission Control at Nebius. With a solid background in distributed systems and cloud infrastructure, I am excited by the opportunity to lead the reliability and performance initiatives for a next-generation AI compute platform. My experience in managing complex technical projects and my deep understanding of ML orchestration environments like Kubernetes and Slurm align perfectly with the requirements of this role.

Throughout my career, I have focused on bridging the gap between advanced engineering research and scalable product capabilities. I am particularly impressed by Nebius's commitment to building its own full-stack infrastructure, from hardware to UI, and I am eager to apply my analytical skills to optimize cluster experience metrics such as Goodput and MFU. I am confident that my technical foundation and product mindset will allow me to drive cross-functional execution and deliver high-impact results for your global AI economy customers.

+250% к просмотрам

Составьте идеальное письмо к вакансии с ИИ-агентом

Откликнитесь в nebius уже сейчас

Присоединяйтесь к Nebius, чтобы определять будущее AI-инфраструктуры и управлять надежностью крупнейших GPU-кластеров в Европе!

Описание вакансии

Why work at NebiusNebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources our customers need to solve real-world challenges and transform industries, without massive infrastructure costs or the need to build large in-house AI/ML teams. Our employees work at the cutting edge of AI cloud infrastructure alongside some of the most experienced and innovative leaders and engineers in the field.

Where we workHeadquartered in Amsterdam and listed on Nasdaq, Nebius has a global footprint with R&D hubs across Europe, North America, and Israel. The team of over 1400 employees includes more than 400 highly skilled engineers with deep expertise across hardware and software engineering, as well as an in-house AI R&D team.

The role

At Nebius, we’re building a next-generation AI compute platform for large-scale ML training and inference — from a few nodes to thousands of GPUs.

We’re looking for a Technical Product Manager to lead Mission Control — the product area responsible for reliability and performance across the full infrastructure stack.

As PM for Mission Control, you will own foundational capabilities that determine how well AI infrastructure performs in real-world training and inference workloads — from bare metal and networking to scheduler/runtime behavior and user-facing outcomes. This is a deeply technical PM role.

Prior PM title is not mandatory: strong candidates from HPC, ML infrastructure, distributed systems, SRE, cloud engineering, or ML solution architecture who want to grow into product are welcome.

Your responsibilities will include:

• Own reliability and performance opportunities across the Nebius stack: from bare metal to applications.

• Define product direction end-to-end: problem discovery → design → delivery → adoption.

• Drive cross-functional execution across compute, networking, storage, observability, platform, and hardware teams.

• Lead deep problem research using customer interviews, analytics, workload studies, and logs investigations.

• Identify and prioritize bottlenecks affecting large-scale training/inference performance and stability.

• Translate advanced ML/infrastructure research into practical, scalable product capabilities.

• Define and operationalize product metrics for cluster experience (e.g. reliability, efficiency, latency-to-start, utilization, throughput).

We expect you to have:

• 3–5+ years of experience in one or more of: product management, HPC, ML infrastructure/MLOps, distributed systems, SRE, cloud architecture, or GPU platforms.

• Strong technical foundation in distributed systems, cloud infrastructure, or ML platforms.

• Hands-on familiarity with ML orchestration environments (e.g. Slurm, Kubernetes, Ray, or similar).

• Experience delivering technically complex initiatives with multiple engineering teams.

• Strong communication skills and ability to influence engineering, research, and customer stakeholders.

• Experience using analytics and data to prioritize roadmap decisions.

• High ownership, learning speed, and comfort in fast-evolving AI infrastructure environments.

It will be an added bonus if you have:

• Experience with GPU platforms and HPC technologies (InfiniBand/RDMA, topology-aware systems).

• Familiarity with modern ML training stacks (PyTorch, DeepSpeed, FSDP/ZeRO, NCCL).

• Understanding of training efficiency metrics and operational signals (Goodput, MFU, scheduling quality, health checks).

• Exposure to large-scale LLM training or inference systems.

• Background in observability, performance tuning, or reliability engineering.

• Customer-facing technical experience supporting ML or infrastructure workloads.

About Nebius

Nebius AI is an AI cloud platform with one of the largest GPU capacities in Europe. Launched in November 2023, the Nebius AI platform provides high-end, training-optimized infrastructure for AI practitioners. As an NVIDIA preferred cloud service provider, Nebius AI offers a variety of NVIDIA GPUs for training and inference, as well as a set of tools for efficient multi-node training.

Nebius AI owns a data center in Finland, built from the ground up by the company’s R&D team and showcasing our commitment to sustainability. The data center is home to ISEG, the most powerful commercially available supercomputer in Europe and the 16th most powerful globally (Top 500 list, November 2023).

Nebius’s headquarters are in Amsterdam, Netherlands, with teams working out of R&D hubs across Europe and the Middle East.

Nebius AI is built with the talent of more than 500 highly skilled engineers with a proven track record in developing sophisticated cloud and ML solutions and designing cutting-edge hardware. This allows all the layers of the Nebius AI cloud – from hardware to UI – to be built in-house, distictly differentiating Nebius AI from the majority of specialized clouds: Nebius customers get a true hyperscaler-cloud experience tailored for AI practitioners. We’re growing and expanding our products every day.

What we offer

Competitive salary and comprehensive benefits package.
Opportunities for professional growth within Nebius.
Flexible working arrangements.
A dynamic and collaborative work environment that values initiative and innovation.

We’re growing and expanding our products every day. If you’re up to the challenge and are excited about AI and ML as much as we are, join us!

+400% к собеседованиям

Создайте идеальное резюме с помощью ИИ-агента

Навыки

Product Management
HPC
Machine Learning Infrastructure
MLOps
Distributed Systems
SRE
Cloud Infrastructure
GPU
Slurm
Kubernetes
Ray
Infiniband
RDMA
PyTorch
DeepSpeed
Observability

Возможные вопросы на собеседовании

Проверка понимания специфики работы с GPU и сетевыми задержками в контексте обучения больших моделей.

Как бы вы подошли к приоритизации задач по оптимизации пропускной способности сети (InfiniBand/RDMA) по сравнению с надежностью отдельных узлов в кластере для обучения LLM?

Оценка способности кандидата определять успех продукта через конкретные метрики.

Какие ключевые метрики (KPI) вы бы внедрили для оценки эффективности 'Mission Control' и как бы вы объяснили их важность стейкхолдерам?

Проверка опыта работы с оркестрацией в высоконагруженных системах.

Опишите ваш опыт работы с Kubernetes или Slurm: с какими самыми сложными проблемами производительности вы сталкивались и как их решали?

Оценка навыков взаимодействия между командами разработки железа и софта.

Как вы будете выстраивать процесс взаимодействия между командами hardware и software, если их приоритеты в вопросах надежности инфраструктуры разойдутся?

Проверка продуктового мышления и умения работать с неопределенностью.

Представьте, что крупный заказчик жалуется на низкий показатель Goodput. Каков будет ваш пошаговый план исследования проблемы и разработки продуктового решения?

Устали искать работу? Мы найдём её за вас

Quick Offer улучшит ваше резюме, подберёт лучшие вакансии и откликнется за вас. Результат — в 3 раза больше приглашений на собеседования и никакой рутины!

Нидерланды