Страна: США

+500% приглашений

Откликайтесь
на вакансии с ИИ

В офисеПолная занятость

Applied Data Scientist, Evaluation & Model Behavior

Name: Quick Offer — сервис для поиска работы на hh.ru
Brand: Quick Offer
SKU: quick-offer-saas
Availability: InStock
Rating: 4.9 (682 reviews)

Исключительная возможность работать в 'stealth' стартапе с основателями из OpenAI и DeepMind. Высокий балл обусловлен сильной командой, топовыми инвесторами и работой над передовыми технологиями AGI.

Вакансия из Quick Offer Global, списка международных компаний

Пожаловаться

Сложность вакансии

ЛегкоСложно

Роль требует высокого уровня экспертизы в области LLM и статистики, а также умения работать в условиях стартапа на ранней стадии. Высокая планка ожиданий обусловлена необходимостью не просто применять модели, а создавать системы оценки их поведения с нуля.

Анализ зарплаты

Медиана195 000 $

Рынок165 000 $ – 240 000 $

В объявлении зарплата не указана, но для позиции Applied Data Scientist в Сан-Франциско в стартапе уровня Tier-1 рыночные вилки значительно выше средних по стране. Ожидается конкурентный оклад плюс значительный пакет опционов (equity).

I am writing to express my strong interest in the Applied Data Scientist position at agi-inc. With over three years of experience in machine learning and a deep focus on model evaluation and behavior, I am excited by your mission to build trustworthy, consumer-grade agents. My background in developing rigorous evaluation harnesses and managing complex data lifecycles aligns perfectly with your need for someone to define the technical standards of model quality.

In my previous roles, I have specialized in translating ambiguous product requirements into quantifiable technical metrics, much like the work you are doing with Computer Use Agents. I have extensive experience in prompt engineering, designing algorithms for data filtration, and conducting deep-root cause analysis of model regressions. I am particularly drawn to agi-inc's 'ship by default' culture and the opportunity to work in-person with a stealth team of industry leaders to solve the hardest problems in agent reliability.

+250% к просмотрам

Составьте идеальное письмо к вакансии с ИИ-агентом

Откликнитесь в agi-inc уже сейчас

Присоединяйтесь к элитной команде выходцев из OpenAI и DeepMind, чтобы создать AGI нового поколения — откликнитесь сейчас и получите ответ в течение 48 часов!

Описание вакансии

Think Different. Build the Future. 🚀

Our Mission

Build everyday AGI. Trustworthy, consumer-grade agents that redefine human–AI collaboration for millions. Software shouldn’t wait for commands; it should partner with you, amplifying what you can do every single day.

Why AGI, Inc.

We’re a stealth team of elite founders and AI researchers, with backgrounds spanning Stanford, OpenAI, and DeepMind. We’re industry leaders in mobile and computer-use agents, bringing these capabilities to consumer scale.

Grounded in years of agent research, our AI is designed with trustworthiness and reliability as core pillars, not afterthoughts.

We are supported by tier-1 investors who funded the first generation of AI giants; now they’re backing us to build the next: everyday AGI. (Watch the demo)

If you see possibility where others see limits, read on.

About the Role

As an Applied Scientist focused on Evaluation & Model Behavior, you will design and implement the systems used to measure and improve the performance of Computer Use Agents.

This is not a support role. You will be responsible for the technical definition of model quality, including the design of evaluation metrics, the curation of training datasets, and the engineering of system prompts. You'll work directly with the engineering team to translate product requirements into technical specifications and quantifiable benchmarks.

You'll focus on rigor, clarity, and impact, ensuring every metric, dataset, and prompt moves us toward more reliable, trustworthy agents.

What You'll Do

Model Behavior Design: Translate product requirements into technical specifications for model behavior. Engineer system prompts and few-shot examples to address specific capability gaps and behavioral failures.

Evaluation Design: Define metrics for reasoning, tool usage, and safety, and validate these metrics against human judgment to ensure statistical rigor.

Data Strategy: Design algorithms to filter, score, and select training data. Write Python scripts to sanitize inputs and manage the training data lifecycle from raw logs to high-quality datasets.

Failure Analysis: Investigate regressions in model benchmarks. Diagnose root causes, distinguishing between data quality issues, prompt instruction failures, or underlying model capability gaps and implement fixes.

Ground Truth Management: Define rubrics and guidelines for human annotation. Maintain reference datasets ("Golden Sets") to establish a consistent baseline for model performance evaluation.

Minimum Qualifications

Master's degree or PhD in Computer Science, Data Science, Statistics, or a related technical field, or equivalent practical experience
3+ years of experience in Data Science, Machine Learning, or Applied Science
Proficiency in Python, with experience writing production-quality code for data pipelines or evaluation harnesses
Experience with experimental design, A/B testing, or statistical analysis

Preferred Qualifications

Experience with Large Language Models (LLMs), including prompt engineering, fine-tuning, or RLHF workflows
Experience building automated evaluation systems or implementing model-based evaluation frameworks
Ability to translate product requirements into measurable technical metrics
Experience managing human-in-the-loop data pipelines or annotation quality control

Why This Role Matters

You can't improve what you can't measure. You can't ship what you can't trust.

You will define the technical definition of quality for our agents — the metrics that predict real-world success, the datasets that encode user intent, and the prompts that shape model behavior. Your work will directly determine how quickly we can iterate and how confidently we can ship.

Our Culture

🏢 All in, in person — work moves faster face-to-face

🚀 Ship by default — speed and polish can coexist

🤝 One band, one sound — radical candor, zero politics

Perks

🏥 Competitive company-sponsored medical, dental, and vision insurance

✈️ Top-tier relocation and immigration support

How to Apply

Send us:

A link — or 60-second video — of something you built and why it matters
Your resume or LinkedIn
Two sentences on the hardest problem you've cracked

Every exceptional candidate hears back within 48 hours.

If you see possibility where others see limits, we'd love to meet you.

+400% к собеседованиям

Создайте идеальное резюме с помощью ИИ-агента

Навыки

A/B Testing
Python
Machine Learning
Large Language Models
Fine-tuning
Statistics
Prompt Engineering
Data Pipelines
Data Science
RLHF

Возможные вопросы на собеседовании

Проверка навыков отладки и понимания работы LLM.

Как бы вы подошли к диагностике ситуации, когда модель внезапно начала выдавать галлюцинации в системных промтах после обновления обучающего датасета?

Оценка умения переводить бизнес-требования в метрики.

Как измерить 'надежность' (trustworthiness) агента, который управляет компьютером пользователя, используя количественные показатели?

Проверка опыта работы с данными и их качеством.

Опишите ваш подход к созданию 'Golden Set' для оценки способности модели рассуждать (reasoning). Как вы обеспечиваете отсутствие утечки данных?

Оценка навыков промпт-инжиниринга.

В чем разница между использованием few-shot примеров и тонкой настройкой (fine-tuning) для изменения специфического поведения агента в вашем опыте?

Проверка статистической грамотности.

Как вы валидируете автоматические метрики оценки (например, использование другой модели-судьи) на соответствие человеческим суждениям?

Устали искать работу? Мы найдём её за вас

Quick Offer улучшит ваше резюме, подберёт лучшие вакансии и откликнется за вас. Результат — в 3 раза больше приглашений на собеседования и никакой рутины!

Страна: США

Откликайтесь
на вакансии с ИИ

Applied Data Scientist, Evaluation & Model Behavior

Анализ зарплаты

Сопроводительное письмо

Составьте идеальное письмо к вакансии с ИИ-агентом

Откликнитесь в agi-inc уже сейчас

Описание вакансии

Think Different. Build the Future. 🚀

Our Mission

Why AGI, Inc.

About the Role

What You'll Do

Minimum Qualifications

Preferred Qualifications

Why This Role Matters

Our Culture

Perks

How to Apply

Создайте идеальное резюме с помощью ИИ-агента

Навыки

Возможные вопросы на собеседовании

Как бы вы подошли к диагностике ситуации, когда модель внезапно начала выдавать галлюцинации в системных промтах после обновления обучающего датасета?

Как измерить 'надежность' (trustworthiness) агента, который управляет компьютером пользователя, используя количественные показатели?

Опишите ваш подход к созданию 'Golden Set' для оценки способности модели рассуждать (reasoning). Как вы обеспечиваете отсутствие утечки данных?

В чем разница между использованием few-shot примеров и тонкой настройкой (fine-tuning) для изменения специфического поведения агента в вашем опыте?

Как вы валидируете автоматические метрики оценки (например, использование другой модели-судьи) на соответствие человеческим суждениям?

Похожие вакансии

ML разработчик (Senior)

Senior / Middle+ Data Scientist

MlOps / Python Backend Engineer (ML)

Data Scientist Senior

Senior/Middle Data Engineer

Senior Data Scientist

Устали искать работу? Мы найдём её за вас

Откликайтесьна вакансии с ИИ

Applied Data Scientist, Evaluation & Model Behavior

Анализ зарплаты

Сопроводительное письмо

Составьте идеальное письмо к вакансии с ИИ-агентом

Откликнитесь в agi-inc уже сейчас

Описание вакансии

Think Different. Build the Future. 🚀

Our Mission

Why AGI, Inc.

About the Role

What You'll Do

Minimum Qualifications

Preferred Qualifications

Why This Role Matters

Our Culture

Perks

How to Apply

Создайте идеальное резюме с помощью ИИ-агента

Навыки

Возможные вопросы на собеседовании

Как бы вы подошли к диагностике ситуации, когда модель внезапно начала выдавать галлюцинации в системных промтах после обновления обучающего датасета?

Как измерить 'надежность' (trustworthiness) агента, который управляет компьютером пользователя, используя количественные показатели?

Опишите ваш подход к созданию 'Golden Set' для оценки способности модели рассуждать (reasoning). Как вы обеспечиваете отсутствие утечки данных?

В чем разница между использованием few-shot примеров и тонкой настройкой (fine-tuning) для изменения специфического поведения агента в вашем опыте?

Как вы валидируете автоматические метрики оценки (например, использование другой модели-судьи) на соответствие человеческим суждениям?

Похожие вакансии

ML разработчик (Senior)

Senior / Middle+ Data Scientist

MlOps / Python Backend Engineer (ML)

Data Scientist Senior

Senior/Middle Data Engineer

Senior Data Scientist

Устали искать работу? Мы найдём её за вас

Откликайтесь
на вакансии с ИИ