-
Notifications
You must be signed in to change notification settings - Fork 1.3k
feat(coding-agent): FastContext explore adapter with hint and agent modes #3164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
oldschoola
wants to merge
26
commits into
can1357:main
Choose a base branch
from
oldschoola:fastcontext-explore-adapter
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
222e8f8
feat(coding-agent): FastContext explore adapter with hint and agent m…
oldschoola dbf2061
Definition-site boost + CamelCase identifier extraction. Fixed 2/5 mi…
oldschoola 823cc76
Glob-before-grep merge order + identifier segment globs + identifier-…
oldschoola 298fa89
Graduated multiplicative penalties (semble_rs-inspired). Replaced bin…
oldschoola e22775d
Added programming keywords (function, class, enum, interface, struct,…
oldschoola 165337b
Added CHANGELOG entry for FastContext ranking improvements. No code c…
oldschoola c4a2947
Lower-camelCase extraction with property-access and verb-position fil…
oldschoola 60e7d56
Updated CHANGELOG entry to include lower-camelCase extraction with pr…
oldschoola 34530b6
Fixed untilAborted stress-case miss. Two changes: (1) segment prefix …
oldschoola 8801893
Updated CHANGELOG and skill to include segment prefix globs and defin…
oldschoola 6a0452e
Remove dead strongIdentifierKeywords() and fix stale comment. The fun…
oldschoola cb5ef72
Fix barrel file retrieval: expand directory glob matches + barrel boo…
oldschoola 1627a96
Add CHANGELOG entry for barrel file retrieval fix
oldschoola 4174989
Fix directory expansion to respect gitignore/hidden filtering. Replac…
oldschoola 33d8aea
Fix three issues: symlink P2, agent-tool-type miss, directory-segment…
oldschoola 3014fa4
Add CHANGELOG entries for symlink fix and agent-tool-type fix
oldschoola c391f12
feat(coding-agent): FastContext registry backend — route through any …
oldschoola 9ca2954
feat(coding-agent): recommend devin/swe-1-6-fast for FastContext; str…
oldschoola 659afe0
style(coding-agent): biome-format the hint-timeout block
oldschoola 09ced99
feat(coding-agent): make FastContext MAX_READ_LINES env-tunable (FC_M…
oldschoola f37b603
feat(coding-agent): FastContext ranking improvements, prompt refineme…
oldschoola 941c10f
chore: remove autoresearch scratch files (_fc_*.md) from the branch
oldschoola 0997238
fix(devin): clamp temperature:0 to 0.01 floor — Devin API rejects tem…
oldschoola e9d6e65
feat(coding-agent): add FC_MAX_TURNS/FC_RESULT_MAX_LINES/FC_LINE_MAX_…
oldschoola 0340f10
refactor(fast-context): remove useless env var aliases, keep document…
oldschoola 0d665ed
feat(fast-context): add fastContext.maxTurns UI setting
oldschoola File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,213 @@ | ||
| # FastContext | ||
|
|
||
| FastContext is an opt-in local model adapter that accelerates codebase exploration. It runs a small local model (FastContext-1.0-4B) to expand natural-language queries into search plans, then executes them with native ripgrep/glob — returning a compact ranked file list and optional snippets in ~2.5s instead of 10–30s. | ||
|
|
||
| ## What it does | ||
|
|
||
| When enabled, the bundled `explore` subagent calls `fast_context` **first** for broad repository-retrieval queries. Without FastContext, explore uses multiple `search`/`find`/`read` tool calls (10K–180K tokens per exploration). FastContext compresses this into a single ~70-token packet — **~95% token savings**. | ||
|
|
||
| If FastContext returns no results, the explore subagent automatically falls back to normal search/find/read. | ||
|
|
||
| ## Setup guide | ||
|
|
||
| This guide is written so that an omp agent can follow it step-by-step to set up FastContext for a user. | ||
|
|
||
| ### Step 1: Install llama.cpp | ||
|
|
||
| llama.cpp provides the `llama-server` executable that serves an OpenAI-compatible API locally. | ||
|
|
||
| **Windows (prebuilt):** | ||
| 1. Download the latest `llama-*-bin-win-cuda-cu*.*.zip` from [llama.cpp releases](https://github.com/ggml-org/llama.cpp/releases) (pick the CUDA build if you have an NVIDIA GPU, otherwise the CPU build). | ||
| 2. Extract to a permanent location, e.g. `C:\llama\llama.cpp\`. | ||
| 3. Verify: `C:\llama\llama.cpp\llama-server.exe --version` | ||
|
|
||
| **macOS:** | ||
| ```bash | ||
| brew install llama.cpp | ||
| ``` | ||
| The binary will be at `$(brew --prefix)/bin/llama-server`. | ||
|
|
||
| **Linux:** Build from source — see [llama.cpp build instructions](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md). | ||
|
|
||
| ### Step 2: Download a FastContext model | ||
|
|
||
| Download a FastContext-1.0-4B GGUF model. Two variants are available: | ||
|
|
||
| | Model | File | Hit rate | Recommendation | | ||
| |---|---|---|---| | ||
| | **FastContext-1.0-4B-RL** | `fastcontext-1.0-4b-rl-q4_k_m.gguf` | 100% (2/2 runs) | ✅ **Use this one** | | ||
| | FastContext-1.0-4B-SFT | `fastcontext-1.0-4b-sft-q4_k_m.gguf` | 93.75% avg (missed cases) | Not recommended | | ||
|
|
||
| The RL model is fine-tuned with a retrieval reward signal — it learns which search plans actually find the right file, not just which plans look plausible. In benchmarks, the RL model hit 8/8 cases in every run, while the SFT model occasionally missed cases by surfacing plausible-sounding but wrong files (e.g. `model-roles.ts` instead of `model-resolver.ts`). Token cost is identical (~95% savings either way); the difference is purely in retrieval accuracy. | ||
|
|
||
| Both are ~2.5GB (Q4_K_M quantization, 4B parameters). Place the `.gguf` file in a models directory, e.g. `C:\llama\models\`. | ||
|
|
||
| ### Step 3: Start the server | ||
|
|
||
| #### Quick start (CPU) | ||
|
|
||
| ```bash | ||
| llama-server --model fastcontext-1.0-4b-rl-q4_k_m.gguf --port 8080 --ctx-size 4096 | ||
| ``` | ||
|
|
||
| #### GPU-accelerated (recommended for NVIDIA GPUs) | ||
|
|
||
| For a 16GB VRAM GPU, use a large context with quantized KV cache for whole-repo FastContext queries: | ||
|
|
||
| ```bash | ||
| llama-server \ | ||
| --model fastcontext-1.0-4b-rl-q4_k_m.gguf \ | ||
| --dev CUDA0 \ | ||
| --ngl auto \ | ||
| --c 200000 \ | ||
| --ctk q8_0 \ | ||
| --ctv q8_0 \ | ||
| --fa on \ | ||
| --np 1 \ | ||
| -n 512 \ | ||
| --fitt 6144 \ | ||
| --host 127.0.0.1 \ | ||
| --port 8080 | ||
| ``` | ||
|
|
||
| Key flags: | ||
| - `--dev CUDA0` — use the first NVIDIA GPU | ||
| - `--ngl auto` — offload all layers to GPU | ||
| - `-c 200000` — 200K-token context window (fits large workspace listings) | ||
| - `--ctk q8_0 --ctv q8_0` — quantize KV cache to Q8 (halves VRAM usage with negligible quality loss) | ||
| - `--fa on` — enable flash attention for faster inference | ||
| - `--np 1` — single slot (full 200K context available per request) | ||
| - `-n 512` — cap output at 512 tokens per request (FastContext plans are ~30–80 tokens) | ||
| - `--fitt 6144` — fit the model's prompt template into the context | ||
|
|
||
| #### Windows batch script | ||
|
|
||
| Create `C:\llama\server-fastcontext-gpu.bat`: | ||
|
|
||
| ```bat | ||
| @echo off | ||
| setlocal | ||
| set "ROOT=%~dp0" | ||
| set "MODEL=%ROOT%models\fastcontext-1.0-4b-rl-q4_k_m.gguf" | ||
|
|
||
| if not exist "%MODEL%" ( | ||
| echo Model not found: "%MODEL%" | ||
| exit /b 1 | ||
| ) | ||
|
|
||
| "%ROOT%llama.cpp\llama-server.exe" -m "%MODEL%" -dev CUDA0 -ngl auto -c 200000 -ctk q8_0 -ctv q8_0 -fa on -np 1 -n 512 -fitt 6144 --host 127.0.0.1 --port 8080 %* | ||
| ``` | ||
|
|
||
| Then start the server: | ||
| ```cmd | ||
| C:\llama\server-fastcontext-gpu.bat | ||
| ``` | ||
|
|
||
| #### Verify the server is running | ||
|
|
||
| ```bash | ||
| curl http://127.0.0.1:8080/v1/models | ||
| ``` | ||
|
|
||
| Should return JSON with the model id. Also check health: | ||
| ```bash | ||
| curl http://127.0.0.1:8080/health | ||
| ``` | ||
|
|
||
| Should return `{"status":"ok"}`. | ||
|
|
||
| ### Step 4: Enable FastContext in omp | ||
|
|
||
| ```bash | ||
| omp config set fastContext.enabled true | ||
| ``` | ||
|
|
||
| Or interactively: `/settings` → **Context** tab → **Fast Context** group → toggle **Enable FastContext**. | ||
|
|
||
| If you are logged in to **Devin**, you can skip the server and model setup entirely — FastContext automatically uses `devin/swe-1-6-fast` (no local llama.cpp server needed). Otherwise, continue with the local server setup below, or pick a provider model from the **FastContext Model** dropdown in `/settings`. | ||
|
|
||
| ### Step 5: Verify it works | ||
|
|
||
| Start an omp session and ask the explore subagent to find something: | ||
| ``` | ||
| explore "Find where the FastContext adapter tool class is defined" | ||
| ``` | ||
|
|
||
| If FastContext is working, the explore subagent will call `fast_context` first and return results in ~2–3s. If it fails or returns nothing, the subagent falls back to normal search automatically. | ||
|
|
||
| ## Settings | ||
|
|
||
| All settings appear in `/settings` under **Context → Fast Context**. The `model` and `baseUrl` fields are hidden until `enabled` is toggled on. | ||
|
|
||
| | Setting | Default | Description | | ||
| |---|---|---| | ||
| | `fastContext.enabled` | `false` | Toggle the FastContext adapter on/off. | | ||
| | `fastContext.model` | *(auto)* | Model for query expansion. Pick a provider model (e.g. `devin/swe-1-6-fast`, `zai/glm-5-turbo`) to route through your provider credentials — no local server needed — or **Local llama.cpp server**. When unset and Devin is logged in, `devin/swe-1-6-fast` is used automatically (a `local` sentinel or bare id forces the local server). | | ||
| | `fastContext.mode` | `hint` | Retrieval mode: **Hint** (default — one turn → native search, ~2-3s) or **Agent** (full multi-turn Read/Glob/Grep loop, slower and more thorough). Set in `/settings` → Context → Fast Context. | | ||
| | `fastContext.baseUrl` | `http://127.0.0.1:8080` | Base URL for the local OpenAI-compatible chat completions endpoint. Only shown when the model is set to the local server. | | ||
|
|
||
| ### YAML config | ||
|
|
||
| ```yaml | ||
| fastContext: | ||
| enabled: true | ||
| model: "" # auto: devin/swe-1-6-fast if Devin is logged in, else local server | ||
| mode: hint # hint (fast, default) or agent (full multi-turn loop) | ||
| baseUrl: http://127.0.0.1:8080 # only used by the local server backend | ||
| ``` | ||
|
|
||
| ### Using LM Studio or Ollama instead of llama.cpp | ||
|
|
||
| Any OpenAI-compatible local endpoint works — just point `baseUrl` at the port: | ||
|
|
||
| ```bash | ||
| omp config set fastContext.baseUrl http://127.0.0.1:1234 # LM Studio | ||
| omp config set fastContext.baseUrl http://127.0.0.1:11434 # Ollama | ||
| ``` | ||
|
|
||
| ### Using a cloud provider model instead of a local server | ||
|
|
||
| If you don't want to run a local model, point FastContext at any registered provider model by setting `fastContext.model` to a provider-prefixed id. FastContext resolves it through the model registry using your configured credentials and calls it directly — no llama.cpp/LM Studio/Ollama required. | ||
|
|
||
| ```bash | ||
| omp config set fastContext.enabled true | ||
| omp config set fastContext.model devin/swe-1-6-fast # Devin SWE-1.6 fast tier (login: /login devin) — 100% retrieval @ ~1.6s hint / ~3.3s agent, no local GPU | ||
| # other devin tiers: devin/swe-1-6 (standard), devin/swe-1-6-slow (reasoning-heavy, ~34s hint) | ||
| # or any provider model: zai/glm-5-turbo, openai-codex/gpt-5.5, pi/smol, ... | ||
|
|
||
| A provider-prefixed value (containing `/`) always selects the registry path; a bare id or blank keeps the local-endpoint behavior above. Both hint and agent modes are supported. Agent mode reuses one cascade id across all turns so the provider can thread the conversation. The `devin/swe-1-6-fast` tier (Cerebras 950 tok/s, same intelligence as `swe-1-6`) is the fastest option — faster than even a local 4B model — while `swe-1-6-slow` is reasoning-heavy and much slower per turn. | ||
|
|
||
| ## How it works | ||
|
|
||
| ### Hint mode (default, ~2.5s) | ||
|
|
||
| 1. The explore subagent calls `fast_context` with a natural-language query. | ||
| 2. FastContext sends the query to the local model, which returns a plan: keywords, glob patterns, grep patterns, and search paths. | ||
| 3. Native ripgrep and glob execute the plan in parallel — no model inference during search. | ||
| 4. Results are ranked by path-keyword matches, content-keyword density, and grep/glob match signals. | ||
| 5. A compact packet (`[FC hint: N files]` + file list + optional snippets) is returned. | ||
|
|
||
| If the model returns an empty plan, a query-derived fallback extracts keywords from the query itself and runs the same grep/glob/ranking pipeline. | ||
|
|
||
| ### Agent mode (~25–45s) | ||
|
|
||
| Agent mode runs a full multi-turn agentic loop where the model calls `Read`, `Glob`, and `Grep` tools directly. Slower but the model can read file contents and refine searches. Hint mode is recommended for interactive use. | ||
|
|
||
| ## Performance | ||
|
|
||
| Measured on the oh-my-pi repo (8 cross-package queries, FastContext-1.0-4B-RL-Q4_K_M, NVIDIA GPU): | ||
|
|
||
| | Metric | Without FastContext | With FastContext (hint) | | ||
| |---|---|---| | ||
| | Hit rate | — | 95–100% | | ||
| | Latency | 10–30s (multiple tool calls) | ~2.5s (single LLM turn + native search) | | ||
| | Token cost | 10K–180K per exploration | ~70 tokens per packet | | ||
| | Token savings | — | ~95% aggregate | | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| - **"FastContext hint failed: HTTP connection refused"** — The llama.cpp server isn't running. Start it with the batch script or command from Step 3. | ||
| - **No results / empty hint** — The model may return an empty plan. FastContext automatically falls back to query-derived grep. Check that the model is loaded (`curl http://127.0.0.1:8080/v1/models`). | ||
| - **Slow responses** — Without GPU offload (`--ngl`), the 4B model takes ~4–6s per turn on CPU. With GPU, it's ~1.5s. Ensure `-ngl auto` is set for GPU offload. | ||
| - **Out of memory (OOM)** — Reduce context size (`-c 8192` instead of `-c 200000`) or remove KV cache quantization (`-ctk q8_0 -ctv q8_0`). The 200K context requires ~16GB VRAM with Q8 KV cache. | ||
| - **Wrong files returned** — FastContext returns up to 20 candidate files. The ranking pipeline prioritizes files with query keywords in their path or content. Grep-matched files are boosted above glob-matched files. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| import { describe, expect, test } from "bun:test"; | ||
| import { resolveDevinTemperature } from "@oh-my-pi/pi-ai/providers/devin"; | ||
|
|
||
| describe("resolveDevinTemperature", () => { | ||
| test("clamps temperature 0 to a near-zero floor", () => { | ||
| // The Devin agent API rejects temperature: 0 with invalid_argument. | ||
| // Callers requesting deterministic output (FastContext hint mode) must | ||
| // get a clamped value instead of passing 0 through. | ||
| expect(resolveDevinTemperature(0)).toBeGreaterThan(0); | ||
| expect(resolveDevinTemperature(0)).toBeLessThanOrEqual(0.01); | ||
| }); | ||
|
|
||
| test("passes through non-zero temperatures unchanged", () => { | ||
| expect(resolveDevinTemperature(0.3)).toBe(0.3); | ||
| expect(resolveDevinTemperature(0.4)).toBe(0.4); | ||
| expect(resolveDevinTemperature(1)).toBe(1); | ||
| }); | ||
|
|
||
| test("defaults to 0.4 when undefined", () => { | ||
| expect(resolveDevinTemperature(undefined)).toBe(0.4); | ||
| }); | ||
| }); |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This opens a ```bash fence for the provider-model example but never closes it, so the explanatory paragraph and the rest of the FastContext guide render as part of the code block. In rendered docs, the “How it works” and troubleshooting sections stop being headings/lists, making the setup page hard to follow; add a closing fence after the command snippet.
Useful? React with 👍 / 👎.