Datasets:

anon
/

dsn26

problem stringclasses 36 values	Checkpoint number int64 1 8	language stringclasses 1 value	difficulty stringclasses 3 values	tags listlengths 7 15	instruction stringlengths 515 15.3k	tests stringlengths 5.59k 5.79k
problem_0001	1	cpp	Easy	[ "cli-tools", "config-management", "argument-parsing", "schema-validation", "http-integration", "watch-mode", "type-system", "deterministic-output" ]	## cfgpipe - Core Resolution ## 1. Purpose `cfgpipe` is a command-line configuration resolver. It reads a JSON schema document from disk, resolves each declared parameter from local sources, and writes the resolved configuration to stdout as JSON. Every resolved leaf value is a string. ## 2. Invocation ```text ./cf...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0001 / checkpoint_1. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain cfgpipe.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0001	2	cpp	Easy	[ "cli-tools", "config-management", "argument-parsing", "schema-validation", "http-integration", "watch-mode", "type-system", "deterministic-output" ]	## cfgpipe - Primary Store ## 1. Feature Increment Adds a network-backed primary store to the resolution pipeline. Global flags may now precede the schema path. Resolved leaf values remain strings. ```text ./cfgpipe [global-flags...] <schema-file> [arg-candidates...] ``` ## 2. Primary-Store Configuration `--primar...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0001 / checkpoint_2. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain cfgpipe.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0001	3	cpp	Easy	[ "cli-tools", "config-management", "argument-parsing", "schema-validation", "http-integration", "watch-mode", "type-system", "deterministic-output" ]	## cfgpipe - Groups and Custom Types ## 1. Feature Increment Introduces parameter groups, custom parameter types, and stricter schema validation. Parameters may appear under nested object containers. Every output path and every error path must use the full dotted path. Resolved leaf values remain strings. ## 2. Defi...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0001 / checkpoint_3. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain cfgpipe.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0001	4	cpp	Easy	[ "cli-tools", "config-management", "argument-parsing", "schema-validation", "http-integration", "watch-mode", "type-system", "deterministic-output" ]	## cfgpipe - Watch Mode and Change Events ## 1. Feature Increment Adds watch mode, structured change events, primary-store runtime updates, and a poll-based secondary store. Resolved configuration leaf values remain strings. The built-in custom type `port` parses a base-10 integer string, accepts values 0-65535 inclu...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0001 / checkpoint_4. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain cfgpipe.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0001	5	cpp	Easy	[ "cli-tools", "config-management", "argument-parsing", "schema-validation", "http-integration", "watch-mode", "type-system", "deterministic-output" ]	## cfgpipe - Extended Type System ## 1. Feature Increment Expands the type system and switches resolved configuration output from string-only leaves to type-native JSON serialization. Change events continue using string representations. Watch mode, grouped output, and both store integrations remain in effect. ## 2. ...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0001 / checkpoint_5. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain cfgpipe.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0001	6	cpp	Easy	[ "cli-tools", "config-management", "argument-parsing", "schema-validation", "http-integration", "watch-mode", "type-system", "deterministic-output" ]	## cfgpipe - Store Refinements ## 1. Feature Increment Refines store access and runtime fault handling. Primary store gains folder-prefix composition. Secondary store gains dynamic key prefixing. Secondary-store monitoring isolates per-key operational failures. Grouped schemas, watch mode, type-native resolved output...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0001 / checkpoint_6. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain cfgpipe.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0002	1	cpp	Medium	[ "cli-tools", "parsing", "digital-circuits", "boolean-logic", "optimization", "graphviz", "equivalence-checking", "truth-tables" ]	# Part 1 — CLI basics, scalar `.circ` format, and `check` ## What this part adds The executable with `--help`, `--version`, JSON output mode, and the first command `check` for validating scalar `.circ` circuit files. --- ## Definitions ### Exit codes \| Code \| Category \| Description \| \|------\|----------\|----------...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0002 / checkpoint_1. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain circopt.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0002	2	cpp	Medium	[ "cli-tools", "parsing", "digital-circuits", "boolean-logic", "optimization", "graphviz", "equivalence-checking", "truth-tables" ]	# Part 2 — Scalar 2-valued evaluation: `eval` ## Error types \| Error type \| Exit code \| \|------------\|-----------\| \| `MissingInputError` \| 1 \| \| `UnknownInputError` \| 1 \| \| `InputValueParseError` \| 2 \| --- ## `eval` command ```bash ./circopt eval <file.circ> [--set name=value ...] [--default 0\|1] [--allow-extra] [...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0002 / checkpoint_2. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain circopt.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0002	3	cpp	Medium	[ "cli-tools", "parsing", "digital-circuits", "boolean-logic", "optimization", "graphviz", "equivalence-checking", "truth-tables" ]	# Part 3 — Vectors, richer expressions, and `--radix` ## What this part adds Vector signals in `.circ`, width checking, slicing/indexing/concatenation, richer literal formats, new operators (`MUX`, reductions, `EQ`), and `--radix` output formatting. --- ## New error types These errors use exit code 3 (semantic err...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0002 / checkpoint_3. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain circopt.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0002	4	cpp	Medium	[ "cli-tools", "parsing", "digital-circuits", "boolean-logic", "optimization", "graphviz", "equivalence-checking", "truth-tables" ]	# Part 4 — 3-valued evaluation (`--mode 3val`) ## What this part adds The `eval` command now supports 3-valued logic with `X` (unknown/don't-care) as an input value. Circuit files still cannot contain literal `X` values. --- ## New error types \| Error type \| Exit code \| Description \| \|------------\|-----------\|----...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0002 / checkpoint_4. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain circopt.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0002	5	cpp	Medium	[ "cli-tools", "parsing", "digital-circuits", "boolean-logic", "optimization", "graphviz", "equivalence-checking", "truth-tables" ]	# Part 5 — JSON and BENCH input formats ## What this part adds The `check` and `eval` commands can now read `.json` and `.bench` circuit formats, controlled by the `--format` flag. --- ## New error types Format-specific errors follow existing conventions: parse/syntax errors exit `2`, validation/semantic errors ex...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0002 / checkpoint_5. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain circopt.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0002	6	cpp	Medium	[ "cli-tools", "parsing", "digital-circuits", "boolean-logic", "optimization", "graphviz", "equivalence-checking", "truth-tables" ]	# Part 6 — Analysis commands: `stats`, `lint`, and `dot` ## What this part adds Three analysis commands: * `stats` — Circuit metrics and statistics * `lint` — Warnings about potential issues * `dot` — Graphviz DOT export All commands accept `--format` and work with all circuit formats. --- ## `stats` command ```b...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0002 / checkpoint_6. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain circopt.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0002	7	cpp	Medium	[ "cli-tools", "parsing", "digital-circuits", "boolean-logic", "optimization", "graphviz", "equivalence-checking", "truth-tables" ]	# Part 7 — Subcircuit extraction and equivalence: `cone`, `truth-table`, `equiv` ## What this part adds * `cone` — Extract subcircuit and write deterministically * `truth-table` — Exhaustive enumeration of outputs * `equiv` — Equivalence checking between circuits * `--seed` flag for reproducible randomness --- ## N...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0002 / checkpoint_7. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain circopt.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0002	8	cpp	Medium	[ "cli-tools", "parsing", "digital-circuits", "boolean-logic", "optimization", "graphviz", "equivalence-checking", "truth-tables" ]	# Part 8 — Optimization: `opt` ## What this part adds The `opt` command: a circuit optimizer with configurable passes, deterministic output, optional equivalence verification, and BENCH export support. --- ## `opt` command ```bash ./circopt opt <circuit_file> -o <out_file> [--format ...] [--out-format circ\|jso...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0002 / checkpoint_8. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain circopt.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid # clo...
problem_0003	1	cpp	Easy	[ "cli-tools", "ast-parsing", "multi-language", "pattern-matching", "metavariables", "code-transformation", "refactoring", "static-analysis", "regex", "code-search" ]	Build a command-line code searcher for Python codebases. It takes a directory of source files and a set of rules, then prints one JSON object per match (JSON Lines) to STDOUT. Supported language (by file extension only): Python (`.py`). Supported rule types (exactly these two): exact match and **va...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0003 / checkpoint_1. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain code_search.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid #...
problem_0003	2	cpp	Easy	[ "cli-tools", "ast-parsing", "multi-language", "pattern-matching", "metavariables", "code-transformation", "refactoring", "static-analysis", "regex", "code-search" ]	Extend your code searcher to support JavaScript and C++ source files. --- ## New Requirements ### File type → language Scan these extensions: \| Language \| Extensions \| \| ---------- \| -------------------------------------------- \| \| Python \| `.py` ...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0003 / checkpoint_2. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain code_search.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid #...
problem_0003	3	cpp	Easy	[ "cli-tools", "ast-parsing", "multi-language", "pattern-matching", "metavariables", "code-transformation", "refactoring", "static-analysis", "regex", "code-search" ]	Extend your code searcher to support structure-aware patterns with metavariables. Supported rule kinds: `exact`, `regex`, and `pattern`. # Deliverable Your existing executable `code_search.cpp` is extended to understand `kind: "pattern"` in the rules file: ``` ./code_search <root_dir> --rules <r...	#!/usr/bin/env bash set -euo pipefail # Benchmark test runner for problem_0003 / checkpoint_3. # Usage: # bash test.sh /path/to/submission [extra pytest args...] # # The submission directory should contain code_search.cpp and, optionally, # requirements.txt. Set BENCH_PROBLEMS_REPO to an existing checkout to avoid #...

End of preview. Expand in Data Studio

Anonymized Iterative Code Benchmark Problems

This anonymized dataset contains multi-checkpoint coding tasks for evaluating agents under iterative specification refinement. Problem identifiers, source repository identifiers, and author fields are intentionally omitted.

Dataset summary

Language subsets: python, js, cpp, rust, java
Problems: 36 problems
Rows: 980 rows
Columns: problem, Checkpoint number, language, difficulty, tags, instruction, tests

Rows per subset:

python: 196 rows
js: 196 rows
cpp: 196 rows
rust: 196 rows
java: 196 rows

Category distribution:

developer-tools: 8
web: 7
data-processing: 6
cli-tools: 5
algorithms: 2
configuration-management: 2
dsl: 2
databases: 1
file-systems: 1
networking: 1
simulation: 1

Difficulty distribution:

Easy: 12
Hard: 12
Medium: 12

Using the `tests` field

Each row includes a tests value containing a complete Bash test runner for that checkpoint. Save it as a file, then run it against a submission directory whose contents implement the entry file named in the instruction field.

Test script examples

Write and run one checkpoint from Python:

from pathlib import Path
import subprocess

from datasets import load_dataset

dataset = load_dataset("anon/dsn26", "python", split="test")
row = dataset[0]

test_path = Path("test.sh")
test_path.write_text(row["tests"])
test_path.chmod(0o755)

subprocess.run(["bash", str(test_path), "/path/to/submission"], check=True)

Run a saved test script directly:

bash test.sh /path/to/submission

Use a local problems checkout to avoid cloning:

BENCH_PROBLEMS_REPO=/path/to/problems bash test.sh /path/to/submission

Keep the temporary workspace for debugging:

BENCH_KEEP_WORKDIR=1 bash test.sh /path/to/submission

The runner creates a temporary workspace, copies the submission into it, materializes the relevant checkpoint tests and static assets, installs Python test dependencies in a virtual environment, and invokes pytest. Set BENCH_PROBLEMS_REPO to an existing problems checkout to avoid cloning. Set BENCH_KEEP_WORKDIR=1 to keep the temporary workspace for debugging.

For cpp, rust, and java subsets, the runner compiles the declared source file with g++, rustc, or javac before running the Python test harness. The js subset expects node to be available.

Downloads last month: 9

Anonymized Iterative Code Benchmark Problems

Dataset summary

Using the tests field

Test script examples

Using the `tests` field