- Страна
- США
- Зарплата
- 180 000 $ – 440 000 $
Откликайтесь
на вакансии с ИИ

Member of Technical Staff - Site Infrastructure (US Government)
Исключительная вакансия для инженеров высшего уровня: работа с передовым стеком (NVIDIA B200), плоская структура компании и очень высокий зарплатный диапазон, хотя требования к безопасности и мобильности крайне жесткие.
Сложность вакансии
Максимальная сложность обусловлена необходимостью наличия действующего допуска TS/SCI с полиграфом, готовностью к командировкам до 75% времени и глубокими знаниями на стыке «железа» (GPU B200/GB200) и сложного ПО в изолированных средах.
Анализ зарплаты
Предлагаемая зарплата ($180k - $440k) значительно превышает рыночные показатели для стандартных SRE ролей, что оправдано уникальным сочетанием требований к безопасности (TS/SCI) и глубокой экспертизы в ИИ-инфраструктуре.
Сопроводительное письмо
I am writing to express my strong interest in the Member of Technical Staff - Site Infrastructure position at xAI. With over five years of experience in infrastructure engineering and a deep background in managing bare-metal Kubernetes deployments, I am particularly drawn to the challenge of scaling AI inference platforms within air-gapped, classified environments. My technical expertise aligns perfectly with your requirements for racking GPU servers, provisioning via PXE, and maintaining complex network fabrics like InfiniBand and RoCE.
Throughout my career, I have developed a robust understanding of the full infrastructure stack, from physical hardware to container orchestration. I have extensive experience working with NVIDIA GPU stacks and implementing security compliance measures such as DISA STIGs and FIPS 140-3. Having operated in secure facilities, I am comfortable with the unique constraints of classified work and possess the active TS/SCI with CI Poly clearance required to hit the ground running. I am eager to bring my hands-on approach and commitment to engineering excellence to the xAI team.
Составьте идеальное письмо к вакансии с ИИ-агентом

Откликнитесь в xai уже сейчас
Присоединяйтесь к xAI, чтобы строить передовую ИИ-инфраструктуру для государственных проектов национального масштаба.
Описание вакансии
About xAI
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.
ABOUT THE ROLE:
You will be the person who turns a hardware listing and a software bundle into a running AI inference platform — from bare metal to serving production traffic. This is a hands-on role at the intersection of physical datacenter infrastructure and platform engineering. You will rack GPU servers, cable network fabrics, provision bare metal via PXE, deploy Kubernetes clusters, stand up monitoring and network telemetry stacks, and validate end-to-end inference pipelines — all in air-gapped, classified environments with no internet access.
You are the high side. Everything the platform engineering team builds on the unclassified side — deployment tooling, signed software bundles, switch configurations, OS images — you execute on classified infrastructure. You own the full stack from physical hardware through running GPU workloads, including the cross-domain solution (CDS) receive pipeline that automates software delivery into the classified environment. When something breaks on-site, you fix it. When an update arrives through the data diode or on physical media, you apply it. You are the bridge between xAI's engineering organization and the classified compute facilities where our infrastructure operates.
This role requires significant time on-site at classified compute facilities. You will work closely with customer IT and security teams, cleared facility personnel, and xAI's uncleared platform engineering team (via approved communication channels).
RESPONSIBILITIES:
- Rack, cable, and power GPU server infrastructure (e.g., Dell XE9680L with NVIDIA B200/GB200) and network switching fabric (NVIDIA SN5600, Mellanox QM9700, management switches) in classified data center environments.
- Execute bare metal provisioning using PXE/OSP: deploy squashfs boot images, NVIDIA drivers, MOFED/DOCA packages, and join nodes to Kubernetes clusters (RKE2/kubeadm) — all from pre-staged air-gap bundles with zero internet access.
- Deploy and operate the full Kubernetes platform stack: GPU/network operators, engine-operator, podgroup-operator, xai-scheduler, ingress controllers, storage provisioners, and RBAC.
- Deploy and operate the monitoring and network telemetry stack: VictoriaMetrics, VMAlert, AlertManager, NetCollector, Grafana — configured for local operation without central dependencies.
- Set up and maintain the CDS receive pipeline: data diode receive proxy, local container registry (Harbor), cosign signature verification, and bundle application automation.
- Apply signed software update bundles to classified infrastructure, verify acceptance tests pass, and execute rollback procedures when needed.
- Validate network fabric correctness using LLDP verification, BGP peering checks, and InfiniBand fabric topology validation after initial deployment and hardware changes. Serve as the keyboard operator for network troubleshooting directed by the Network Architect — you execute commands on classified network devices while the architect directs the session on-site or via approved channels.
- Execute compliance and security validation: run STIG scans (OpenSCAP) against deployed systems, verify FIPS 140-3 mode on all nodes, validate AV agent status, and execute pre-admission security checklists before nodes are allowed to serve classified workloads. Document and report compliance status for ATO packages.
- Troubleshoot GPU inference workloads (SGLang, engine-operator, sampling-loadbalancer) in classified environments, working with uncleared engineering teams via approved channels for guidance on complex issues.
- Interface with customer IT, security, and facility teams. Participate in change control board (CCB) processes for classified system modifications. Train customer operations teams on monitoring dashboards, alert response procedures, and basic operational runbooks during deployment handoff.
- Maintain and create operational documentation: site-specific runbooks, deployment validation reports, incident response procedures, and post-deployment handoff materials.
- Participate in on-call rotation for classified site incident response.
- Up to 75% travel to classified compute facilities required.
BASIC QUALIFICATIONS:
- Active Top Secret / SCI (TS/SCI) security clearance with Counterintelligence Polygraph (CI Poly).
- 5+ years of experience in infrastructure engineering, site reliability engineering, or systems engineering, with hands-on datacenter experience (racking, cabling, power, iDRAC/BMC).
- Deep understanding of the Kubernetes stack: container runtimes, CNI (Calico, Cilium), CSI, CRI, Helm, and operator patterns.
- Experience with bare metal Linux provisioning: PXE boot, cloud-init, disk partitioning, driver installation, kernel configuration.
- Proficiency with Infrastructure-as-Code tools (Pulumi, Terraform, or Ansible).
- Experience deploying and operating monitoring stacks (Prometheus, VictoriaMetrics, Grafana, AlertManager).
- Comfortable working independently in classified environments with limited real-time support from uncleared teams.
- Excellent communication and documentation skills — you will be the primary interface between classified operations and uncleared engineering.
PREFERRED SKILLS AND EXPERIENCE:
- Experience with NVIDIA GPU infrastructure: driver installation, CUDA, NCCL, InfiniBand/RoCE, ConnectX NICs, BlueField DPUs.
- Experience with air-gapped or disconnected deployments where all software must be pre-staged.
- Familiarity with network switch configuration and troubleshooting (Cumulus/NVUE, Junos, EOS, NX-OS).
- Experience with cross-domain solutions (CDS), data diodes, or secure transfer mechanisms in classified environments.
- Experience with container image signing and verification (cosign, Sigstore, SBOM tooling).
- Familiarity with RKE2, k3s, or kubeadm for Kubernetes cluster bootstrapping.
- Experience working in SCIF or other classified facility environments.
- Hands-on experience with DISA STIG scanning (OpenSCAP, SCAP Compliance Checker), FIPS 140-3 validation, and CIS benchmark execution (kube-bench).
- Experience with ATO package preparation and change control board (CCB) processes in classified environments.
COMPENSATION AND BENEFITS:
$180,000 - $440,000 USD
Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.
xAI is an equal opportunity employer. For details on data processing, view ourRecruitment Privacy Notice.
Создайте идеальное резюме с помощью ИИ-агента

Навыки
- Linux
- Terraform
- Kubernetes
- Helm
- Prometheus
- Grafana
- Docker
- CUDA
- Ansible
- BGP
- Pulumi
- Infiniband
- PXE
- NVIDIA GPU
- Harbor
Возможные вопросы на собеседовании
Проверка опыта работы в условиях отсутствия интернета, что критично для данной роли.
Опишите ваш опыт развертывания и обновления Kubernetes-кластеров в полностью изолированных (air-gapped) средах. С какими основными трудностями вы сталкивались?
Роль подразумевает работу с новейшим оборудованием NVIDIA.
Как бы вы подошли к диагностике проблем с производительностью GPU-инференса, если подозрение падает на топологию InfiniBand или конфигурацию NCCL?
Безопасность является приоритетом для правительственных контрактов.
Каков ваш опыт прохождения проверок на соответствие STIG и подготовки документации для получения ATO (Authorization to Operate)?
Работа требует взаимодействия между закрытым контуром и основной командой.
Как вы организуете процесс передачи знаний и отчетности, работая на объекте, где запрещены стандартные средства связи?
Проверка навыков работы с инфраструктурой как кодом в специфических условиях.
Как вы адаптируете свои Terraform или Ansible скрипты для работы через Cross-Domain Solutions (CDS) и дата-диоды?
Похожие вакансии
DevOps Middle +/ Senior
Senior DevOps/Mlops
Devops Middle+ / Senior
Senior DevOps/SRE Engineer (On-Premise инфраструктура)
DevOps - senior
Junior+ / Middle DevOps Engineer
1000+ офферов получено
Устали искать работу? Мы найдём её за вас
Quick Offer улучшит ваше резюме, подберёт лучшие вакансии и откликнется за вас. Результат — в 3 раза больше приглашений на собеседования и никакой рутины!
- Страна
- США
- Зарплата
- 180 000 $ – 440 000 $