# ADR 046: Custom Agentic Code Review

- HTML version: https://robbiepalmer.me/projects/personal-site/adrs/046-custom-agentic-code-review
- Project: Personal Site (https://robbiepalmer.me/projects/personal-site.md)
- Status: Proposed
- Date: 2026-06-21

# Context

[ADR 009: CodeRabbit](/projects/personal-site/adrs/009-code-rabbit) records why this project uses automated review. [ADR 041: Gemini Code Assist](/projects/personal-site/adrs/041-gemini-code-assist), [ADR 042: Greptile](/projects/personal-site/adrs/042-greptile), [ADR 044: Codex Code Review](/projects/personal-site/adrs/044-codex-code-review), and [ADR 045: Qodo](/projects/personal-site/adrs/045-qodo) record the successive attempts to maintain diverse review coverage. This ADR does not repeat that rationale.

The specialist review services are becoming a less reliable foundation for this workflow:

1. Free allowances are being reduced or exhausted, while pricing is increasingly used to ration review frequency.
2. Gemini Code Assist's consumer review service is being retired.
3. Review vendors are investing more heavily in enterprise requirements such as compliance controls, centralised reporting, administration, integrated linting, and workflow consolidation.
4. Those capabilities can be valuable to an enterprise, where consolidating procurement, policy, and reporting may outweigh having the best individual tool for every task. They provide little value to this solo open-source workflow, which can select a specialised tool when equivalent functionality is actually needed.

The issue is not that enterprise features are inherently undesirable. It is that paying for a broader platform can produce a worse fit when its differentiating investment does not improve the review feedback or iteration speed this project needs.

ADR 045 concludes that **CodeRabbit Pro at $30 per month** is the strongest tactical paid response if review limits continue to constrain velocity. It preserves a dedicated product with known review quality and five reviews per hour. However, that remains exposure to a vendor whose free allowance, pricing, and roadmap have already shifted.

OpenRouter is already part of the technology stack and provides a single API across a large, fast-changing model catalogue with pay-as-you-go pricing, **[model fallbacks, provider routing and failover](https://openrouter.ai/docs/guides/routing/provider-selection)**, budget controls, and optional Zero Data Retention. It can route a requested model across available providers and try configured fallback models when a model or provider fails. Frontier open-weight model families such as **[Kimi K2](https://huggingface.co/collections/moonshotai/kimi-k2)** increasingly target coding, reasoning, and agentic workloads. This creates an opportunity to buy inference directly and own the small amount of review-specific orchestration rather than buying another review platform.

This challenges the normal preference to buy rather than build. A focused vendor should usually produce a better product through domain expertise and sustained investment. Coding agents change the implementation economics: a narrow workflow tailored to one repository can now be created and changed cheaply, while SaaS vendors must price and design for a much broader market. That does not make custom software automatically better; it makes a bounded experiment economically credible.

# Decision

I propose evaluating the repository's custom **OpenRouter-based multi-model Pull Request reviewer** as a complement to the existing services.

The current implementation sends each eligible diff to four configurable scout models. The scouts independently return structured findings covering correctness, security, reliability, data loss, concurrency, and material performance. A separate merger model deduplicates findings and reconciles them with resolved GitHub review threads without judging whether a finding is correct.

The initial scout configuration favours frontier open-weight coding and reasoning model families, including Kimi, DeepSeek, GLM, and Qwen. These exact models are configuration rather than architecture: they should be replaced as stronger or more cost-effective models become available, and OpenRouter fallbacks can preserve a review run when a preferred model or provider is unavailable. A closed-weight merger may be used where it improves reliable orchestration because foundation-model purity is not the objective; useful, diverse, economical review is.

The workflow records findings retained from each model, invalid and out-of-scope responses, failures, and cumulative cost. This scorecard makes model selection empirical rather than reputational and creates a continuing evaluation of frontier open-weight models that can inform other projects.

The experiment will remain advisory and will not become a required merge check initially. It will be accepted only if representative Pull Requests show that it:

* produces actionable findings that are complementary to the managed reviewers;
* keeps false positives and duplicated findings at a manageable level;
* remains at or below the $30 monthly cost of CodeRabbit Pro for the required review volume;
* does not require enough maintenance or prompt tuning to distract materially from product work; and
* provides useful evidence about which open-weight models perform well on real repository changes.

If these conditions are not met, the workflow should be disabled rather than expanded into a general review platform. CodeRabbit Pro remains the default tactical fallback.

# Alternatives

## Subscribe to CodeRabbit Pro

* **Pros**: Buys the strongest existing reviewer, retains accumulated learnings, requires almost no maintenance, and provides five reviews per hour for $30 month-to-month.
* **Cons**: Retains exposure to CodeRabbit's pricing and roadmap and pays for an increasingly broad platform whose enterprise-oriented additions do not improve this workflow proportionally.
* **Decision**: Deferred, not rejected. This remains the lower-risk tactical response and the fallback if the custom experiment cannot match its useful coverage within the same budget.

## Continue Adding Free Managed Reviewers

* **Pros**: Avoids implementation and direct inference costs while each programme remains available.
* **Cons**: Repeats the provider churn documented by the preceding ADRs. Free allowances can be restricted, correlated reviewers can create an illusion of diversity, and product discontinuations force repeated replacement work.
* **Decision**: Rejected as the sole strategy. Free reviewers can remain useful inputs, but they do not provide a stable core.

## Adopt an Existing Open-Source Review Framework

* **Pros**: Starts with an established integration, community-maintained prompts, and support for common Git providers and model APIs.
* **Cons**: The evaluated frameworks impose abstractions and workflows that do not fit the desired scout-and-merger ensemble. Their model assumptions and integration patterns already appear behind the current frontier, while adapting or removing those constraints would require comparable effort to the focused implementation.
* **Decision**: Rejected for this experiment. Reconsider if a framework demonstrates that it reduces maintenance without constraining model evaluation or review behaviour.

## Use One Frontier Model

* **Pros**: Simpler orchestration, lower inference cost, and easier diagnosis of poor output.
* **Cons**: Recreates the blind spots and model dependency that motivated multiple reviewers. It also provides less comparative evidence about open-weight model performance.
* **Decision**: Rejected as the default design. A single model remains a valid outcome if the scorecard shows that the ensemble adds cost without retaining additional useful findings.

## Build a General Review Platform

* **Pros**: Could eventually provide dashboards, policy management, repository indexing, linting, analytics, and reusable integrations comparable to commercial services.
* **Cons**: Recreates the enterprise feature surface that is irrelevant to this decision and turns a focused experiment into a distracting product.
* **Decision**: Rejected. The scope is one repository's review loop, not a CodeRabbit competitor.

# Consequences

## Positive

* **Control of Review Behaviour**: Prompts, severity policy, model mix, deduplication, ignored files, and review frequency can match this repository rather than a vendor's median customer.
* **Model Agility**: OpenRouter allows models and providers to be changed without replacing the integration.
* **Inference Resilience**: Provider routing and model fallbacks reduce the chance that one unavailable or rate-limited model endpoint prevents a review.
* **Potential Cost Efficiency**: Pay-as-you-go inference may provide more useful reviews within the $30 CodeRabbit Pro comparison budget.
* **Reduced Review-SaaS Risk**: A specialist review vendor can change its allowance, pricing, or product direction without removing the custom workflow.
* **Open-Weight Evaluation**: Real Pull Requests provide continuous evidence about frontier open-weight coding models that can inform other work.
* **Measurable Model Selection**: Per-model retention, failure, validity, and cost data support removing weak or uneconomical scouts.
* **Focused Product Surface**: The implementation only needs review functionality that improves this workflow; compliance dashboards and adjacent enterprise features remain out of scope.

## Negative

* **Maintenance Ownership**: GitHub API changes, model deprecations, schema failures, prompt regressions, and provider errors become this project's responsibility.
* **Opportunity Cost**: Improving review orchestration can become an attractive distraction from delivering the site and Recipe Site.
* **Uncertain Economics**: Large diffs, long context, multiple scouts, and a merger can make per-review cost unpredictable. OpenRouter also charges a credit-purchase fee and model prices can change.
* **Unproven Quality**: Model benchmark strength does not guarantee useful review comments on this repository. The ensemble may remain worse than CodeRabbit despite using more models.
* **Remaining Platform Dependencies**: OpenRouter routing and fallbacks reduce dependence on any selected model or inference provider, but the design still depends on GitHub Actions and OpenRouter itself. It reduces review-SaaS and inference-provider risk rather than eliminating platform risk.
* **Security Sensitivity**: The workflow uses a secret in `pull_request_target`. It must continue executing trusted reviewer code from the base branch, treat the Pull Request as untrusted text, and never check out or execute contributor code while the secret is available.
* **Data Exposure**: Diffs and selected repository context pass through OpenRouter and inference providers. OpenRouter's [`data_collection: "deny"`](https://openrouter.ai/docs/guides/routing/provider-selection) filters out providers that may collect or store user data, but it is not the same guarantee as [`zdr: true`](https://openrouter.ai/docs/guides/features/zdr), which restricts routing to endpoints with a formal Zero Data Retention policy. Both controls should be enabled when their model availability and cost trade-offs are acceptable.
* **No Vendor Support for Findings**: False positives, outages, billing anomalies, and review-policy decisions cannot be delegated to a dedicated review vendor.

# Experiment Review

The proposal should move to **Accepted** only after its scorecard demonstrates useful complementary findings, acceptable noise, cost at or below the CodeRabbit Pro comparison, and low ongoing maintenance. It should move to **Rejected** if achieving those outcomes requires building broader platform features or sustained work that exceeds the value of buying CodeRabbit Pro.

---

Markdown index of this site: https://robbiepalmer.me/llms.txt