-
Notifications
You must be signed in to change notification settings - Fork 1.6k
feat(evals): add OdysseysBench agent benchmark #2275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
miguelg719
wants to merge
5
commits into
main
Choose a base branch
from
miguelgonzalez/evals-odysseysbench
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
dd52819
feat(evals): add OdysseysBench agent benchmark
miguelg719 ce39f7d
fix(evals): address review on OdysseysBench suite
miguelg719 2170942
fix(evals): address cubic review on OdysseysBench
miguelg719 64bc653
fix(evals): register OdysseysBench in the modern CLI discovery override
miguelg719 29ccd1a
fix(evals): register OdysseysBench in the modern CLI + external-harne…
miguelg719 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| --- | ||
| "@browserbasehq/stagehand-evals": minor | ||
| --- | ||
|
|
||
| Add OdysseysBench as a supported agent benchmark in the evals CLI. OdysseysBench is a 200-task web-agent benchmark (45 easy / 46 medium / 109 hard); each task ships a weighted rubric that is baked into the verifier's `precomputed_rubric` format so process + outcome are scored against the published criteria. Run with `--eval-name agent/odysseysbench` (or the `external_agent_benchmarks` category); supports `EVAL_ODYSSEYSBENCH_LIMIT`, `EVAL_ODYSSEYSBENCH_SAMPLE`, `EVAL_ODYSSEYSBENCH_LEVEL`, and `EVAL_ODYSSEYSBENCH_IDS`. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
200 changes: 200 additions & 0 deletions
200
packages/evals/datasets/odysseysbench/OdysseysBench_data.jsonl
Large diffs are not rendered by default.
Oops, something went wrong.
8,532 changes: 8,532 additions & 0 deletions
8,532
packages/evals/datasets/odysseysbench/source/tasks.json
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,151 @@ | ||
| /** | ||
| * Build packages/evals/datasets/odysseysbench/OdysseysBench_data.jsonl from the | ||
| * published OdysseysBench task set. | ||
| * | ||
| * OdysseysBench (https://odysseysbench.com) is a 200-task web-agent benchmark | ||
| * (45 easy / 46 medium / 109 hard). Every task ships a weighted rubric whose | ||
| * weights sum to 1.0. This script converts each task's `rubrics` map into the | ||
| * verifier's `precomputed_rubric` shape ({ items: [{ criterion, description, | ||
| * max_points }] }) so the suite can hand it straight to V3Evaluator.verify() | ||
| * without generating a rubric. | ||
| * | ||
| * Source of truth is the committed snapshot at | ||
| * packages/evals/datasets/odysseysbench/source/tasks.json | ||
| * (mirrored from https://odysseysbench.com/assets/data/tasks.json). Re-fetch | ||
| * with `--fetch` to refresh that snapshot before rebuilding. | ||
| * | ||
| * Run after pulling the branch (or whenever the source snapshot changes): | ||
| * pnpm tsx packages/evals/scripts/build-odysseysbench-dataset.ts | ||
| * | ||
| * Idempotent — regenerates the JSONL deterministically from the snapshot. | ||
| */ | ||
| import fs from "node:fs/promises"; | ||
| import path from "node:path"; | ||
|
|
||
| const SOURCE_URL = "https://odysseysbench.com/assets/data/tasks.json"; | ||
|
|
||
| const DATASET_DIR = path.join( | ||
| path.resolve(import.meta.dirname, ".."), | ||
| "datasets", | ||
| "odysseysbench", | ||
| ); | ||
| const SOURCE_PATH = path.join(DATASET_DIR, "source", "tasks.json"); | ||
| const JSONL_PATH = path.join(DATASET_DIR, "OdysseysBench_data.jsonl"); | ||
|
|
||
| interface SourceRubric { | ||
| requirement: string; | ||
| verification: string; | ||
| weight: number; | ||
| } | ||
|
|
||
| interface SourceTask { | ||
| task_id: string; | ||
| confirmed_task: string; | ||
| website: string; | ||
| reference_length: number; | ||
| level: "easy" | "medium" | "hard"; | ||
| rubrics: Record<string, SourceRubric>; | ||
| categories?: string[]; | ||
| num_categories?: number; | ||
| } | ||
|
|
||
| interface RubricItem { | ||
| criterion: string; | ||
| description: string; | ||
| max_points: number; | ||
| } | ||
|
|
||
| interface OutputRow { | ||
| task_id: string; | ||
| confirmed_task: string; | ||
| website: string; | ||
| level: "easy" | "medium" | "hard"; | ||
| reference_length: number; | ||
| categories?: string[]; | ||
| precomputed_rubric: { items: RubricItem[] }; | ||
| } | ||
|
|
||
| /** Order rubric keys R1, R2, … R10 numerically rather than lexicographically. */ | ||
| function sortRubricKeys(keys: string[]): string[] { | ||
| return [...keys].sort((a, b) => { | ||
| const na = Number.parseInt(a.replace(/^\D+/, ""), 10); | ||
| const nb = Number.parseInt(b.replace(/^\D+/, ""), 10); | ||
| if (Number.isFinite(na) && Number.isFinite(nb) && na !== nb) return na - nb; | ||
| return a.localeCompare(b); | ||
| }); | ||
| } | ||
|
|
||
| /** | ||
| * Convert one OdysseysBench rubric entry into a verifier rubric item. | ||
| * | ||
| * `weight` (summing to 1.0 across a task) is scaled to integer points so the | ||
| * scoring model reasons over a natural 0–100 scale; the process score is a | ||
| * ratio, so the exact scale is immaterial. `max(1, …)` keeps every criterion | ||
| * worth at least one point. | ||
| */ | ||
| function toRubricItem(key: string, r: SourceRubric): RubricItem { | ||
| return { | ||
| criterion: r.requirement, | ||
| description: `${r.requirement}\n\nHow a grader verifies this: ${r.verification}`, | ||
| max_points: Math.max(1, Math.round(r.weight * 100)), | ||
| }; | ||
| } | ||
|
|
||
| async function loadSource(): Promise<SourceTask[]> { | ||
| if (process.argv.includes("--fetch")) { | ||
| const res = await fetch(SOURCE_URL); | ||
| if (!res.ok) { | ||
| throw new Error(`Failed to fetch ${SOURCE_URL}: ${res.status}`); | ||
| } | ||
| const text = await res.text(); | ||
| await fs.mkdir(path.dirname(SOURCE_PATH), { recursive: true }); | ||
| await fs.writeFile(SOURCE_PATH, text); | ||
| console.log(`Refreshed snapshot: ${SOURCE_PATH}`); | ||
| return JSON.parse(text) as SourceTask[]; | ||
| } | ||
| const text = await fs.readFile(SOURCE_PATH, "utf8"); | ||
| return JSON.parse(text) as SourceTask[]; | ||
| } | ||
|
|
||
| async function main(): Promise<void> { | ||
| const tasks = await loadSource(); | ||
| if (!Array.isArray(tasks) || tasks.length === 0) { | ||
| throw new Error("Source tasks.json is empty or not an array"); | ||
| } | ||
|
|
||
| const lines: string[] = []; | ||
| for (const task of tasks) { | ||
| const rubricKeys = sortRubricKeys(Object.keys(task.rubrics ?? {})); | ||
| if (rubricKeys.length === 0) { | ||
| throw new Error(`Task ${task.task_id} has no rubrics`); | ||
| } | ||
| const items = rubricKeys.map((k) => toRubricItem(k, task.rubrics[k])); | ||
|
|
||
| const row: OutputRow = { | ||
| task_id: task.task_id, | ||
| confirmed_task: task.confirmed_task, | ||
| website: task.website, | ||
| level: task.level, | ||
| reference_length: task.reference_length, | ||
| ...(Array.isArray(task.categories) && task.categories.length > 0 | ||
| ? { categories: task.categories } | ||
| : {}), | ||
| precomputed_rubric: { items }, | ||
| }; | ||
| lines.push(JSON.stringify(row)); | ||
| } | ||
|
|
||
| await fs.writeFile(JSONL_PATH, lines.join("\n") + "\n"); | ||
| const byLevel = tasks.reduce<Record<string, number>>((acc, t) => { | ||
| acc[t.level] = (acc[t.level] ?? 0) + 1; | ||
| return acc; | ||
| }, {}); | ||
| console.log( | ||
| `Wrote ${lines.length} rows to ${JSONL_PATH} (${JSON.stringify(byLevel)})`, | ||
| ); | ||
| } | ||
|
|
||
| main().catch((err) => { | ||
| console.error(err); | ||
| process.exit(1); | ||
| }); | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,139 @@ | ||
| import type { Testcase, EvalInput, AgentModelEntry } from "../types/evals.js"; | ||
| import { normalizeRubric, type AvailableModel } from "@browserbasehq/stagehand"; | ||
| import { tasksConfig } from "../taskConfig.js"; | ||
| import { getPackageRootDir } from "../runtimePaths.js"; | ||
| import { | ||
| readJsonlFile, | ||
| parseJsonlRows, | ||
| applySampling, | ||
| normalizeAgentModelEntries, | ||
| } from "../utils.js"; | ||
|
|
||
| /** | ||
| * Build OdysseysBench testcases. | ||
| * | ||
| * OdysseysBench (https://odysseysbench.com) is a 200-task web-agent benchmark | ||
| * spanning easy/medium/hard difficulty. Every task ships a weighted rubric | ||
| * (baked into `precomputed_rubric` by scripts/build-odysseysbench-dataset.ts), | ||
| * so the verifier scores against the published criteria directly rather than | ||
| * generating its own. | ||
| * | ||
| * Env knobs: | ||
| * - EVAL_MAX_K / EVAL_ODYSSEYSBENCH_LIMIT — cap the number of tasks (default 25). | ||
| * - EVAL_ODYSSEYSBENCH_SAMPLE — random sample size (overrides the limit cap). | ||
| * - EVAL_ODYSSEYSBENCH_LEVEL — comma-separated difficulty filter (easy,medium,hard). | ||
| * - EVAL_ODYSSEYSBENCH_IDS — comma-separated task_ids to run exactly, in order | ||
| * (ignores sampling / limit / level knobs). | ||
| */ | ||
| export const buildOdysseysBenchTestcases = ( | ||
| models: string[] | AgentModelEntry[], | ||
| ): Testcase[] => { | ||
| const odysseysbenchFilePath = | ||
| getPackageRootDir() + "/datasets/odysseysbench/OdysseysBench_data.jsonl"; | ||
|
|
||
| const lines = readJsonlFile(odysseysbenchFilePath); | ||
|
|
||
| const maxCases = process.env.EVAL_MAX_K | ||
|
cubic-dev-ai[bot] marked this conversation as resolved.
Outdated
|
||
| ? Number(process.env.EVAL_MAX_K) | ||
| : process.env.EVAL_ODYSSEYSBENCH_LIMIT | ||
| ? Number(process.env.EVAL_ODYSSEYSBENCH_LIMIT) | ||
| : 25; | ||
| const sampleCount = process.env.EVAL_ODYSSEYSBENCH_SAMPLE | ||
| ? Number(process.env.EVAL_ODYSSEYSBENCH_SAMPLE) | ||
| : undefined; | ||
|
|
||
| type OdysseysBenchRow = { | ||
| task_id: string; | ||
| confirmed_task: string; | ||
| website?: string; | ||
| level?: "easy" | "medium" | "hard"; | ||
| reference_length?: number; | ||
| categories?: string[]; | ||
| /** | ||
| * Per-task weighted rubric in verifier `{ items: [...] }` shape, produced | ||
| * from the published rubrics by scripts/build-odysseysbench-dataset.ts. | ||
| */ | ||
| precomputed_rubric?: unknown; | ||
| [key: string]: unknown; | ||
| }; | ||
|
|
||
| function isOdysseysBenchRow(parsed: unknown): parsed is OdysseysBenchRow { | ||
| if (parsed === null || typeof parsed !== "object") return false; | ||
| const obj = parsed as Record<string, unknown>; | ||
| return ( | ||
| typeof obj.task_id === "string" && typeof obj.confirmed_task === "string" | ||
| ); | ||
| } | ||
|
|
||
| const candidates = parseJsonlRows(lines, isOdysseysBenchRow); | ||
|
|
||
| // EVAL_ODYSSEYSBENCH_IDS restricts the suite to exactly those task IDs, | ||
| // preserving the order given and ignoring sampling / limit / level knobs. | ||
| const explicitIds = process.env.EVAL_ODYSSEYSBENCH_IDS | ||
| ? process.env.EVAL_ODYSSEYSBENCH_IDS.split(",") | ||
| .map((s) => s.trim()) | ||
| .filter(Boolean) | ||
| : null; | ||
|
|
||
| let rows: OdysseysBenchRow[]; | ||
| if (explicitIds && explicitIds.length > 0) { | ||
| const byId = new Map(candidates.map((r) => [r.task_id, r])); | ||
| rows = explicitIds | ||
| .map((id) => byId.get(id)) | ||
| .filter((r): r is OdysseysBenchRow => Boolean(r)); | ||
| } else { | ||
| // Optional difficulty filter, applied before sampling. | ||
| const levelFilter = process.env.EVAL_ODYSSEYSBENCH_LEVEL | ||
| ? new Set( | ||
| process.env.EVAL_ODYSSEYSBENCH_LEVEL.split(",") | ||
| .map((s) => s.trim().toLowerCase()) | ||
| .filter(Boolean), | ||
| ) | ||
| : null; | ||
| const filtered = levelFilter | ||
| ? candidates.filter((r) => r.level && levelFilter.has(r.level)) | ||
| : candidates; | ||
| rows = applySampling(filtered, sampleCount, maxCases); | ||
| } | ||
|
|
||
| const allTestcases: Testcase[] = []; | ||
| for (const modelEntry of normalizeAgentModelEntries(models)) { | ||
| for (const row of rows) { | ||
| const input: EvalInput = { | ||
| name: "agent/odysseysbench", | ||
| modelName: modelEntry.modelName as AvailableModel, | ||
| agentMode: modelEntry.mode, | ||
| isCUA: modelEntry.mode === "cua", | ||
| params: { | ||
| task_id: row.task_id, | ||
| confirmed_task: row.confirmed_task, | ||
| website: row.website, | ||
| level: row.level, | ||
| reference_length: row.reference_length, | ||
| precomputed_rubric: normalizeRubric(row.precomputed_rubric), | ||
| }, | ||
| }; | ||
| const taskCategories = | ||
| tasksConfig.find((t) => t.name === input.name)?.categories || []; | ||
| allTestcases.push({ | ||
| input, | ||
| name: input.name, | ||
| tags: [modelEntry.modelName, modelEntry.mode, "odysseysbench"], | ||
| metadata: { | ||
| model: modelEntry.modelName as AvailableModel, | ||
| test: `${input.name}:${row.task_id}`, | ||
| tier: "bench", | ||
| task: input.name, | ||
| category: taskCategories[0] || "agent", | ||
| categories: taskCategories, | ||
| dataset: "odysseysbench", | ||
| task_id: row.task_id, | ||
| task_category: row.level, | ||
| }, | ||
| expected: true, | ||
| }); | ||
| } | ||
| } | ||
|
|
||
| return allTestcases; | ||
| }; | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P3: Uses path.join for repo-internal dataset paths; this emits backslashes on Windows and violates the repo’s '/' path convention.
(Based on your team's feedback about forward-slash path separators.) .
View Feedback
Prompt for AI agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declining this one:
path.joinis the correct choice for runtime filesystem paths (it emits the OS-native separator, which is whatfswants on Windows), and it matches the sibling dev scriptscripts/backfill-webtailbench-rubrics.ts, which also usespath.joinfor its dataset paths. The forward-slash convention applies to in-code/URL/import paths; the suite loader that builds an embedded dataset path does use/. Keepingpath.joinhere for consistency with the existing converter script.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. This comment was influenced by this learning. Open the link to edit it, or reply here to edit or delete it.