Tricking GitHub Copilot into reviewing PRs in Gitea
Building an AI code reviewer for Gitea using the Copilot SDK and MCP tools.
Xe Iaso wrote about building a review bot that uses a self-hosted LLM to review GitHub pull requests. It was an interesting idea, and when I found out about SDK's for "AI" CLIs, I wondered if I could do something similar, but using a harness built/maintained by a billion-dollar corporation, instead of handling the loop and execution environment myself.
Disclaimer: I chose Copilot for this because: it was free to use, the novelty of having a GitHub thing do something for something other than their platform, and these companies don't need more money, especially for a silly experiment.
Contrary to the title, I didn't actually "trick" Copilot. Turns out these AI coding CLIs have SDKs that let you interact with them programmatically, kinda like you would with a RESTful API, except in the same non-deterministic way the CLIs behave, plus they can still go through their normal loop and call tools. I happened to use Copilot, but it seems like this is a common pattern amongst these tools. I should also note that I have reservations about AI tooling in general (a topic for another post that's been sitting in my drafts). This post is about a pattern I found interesting, not an endorsement of any particular product, and especially not an endorsement/condoning of the practices of the companies behind them.
The approach#
Xe's setup triggers via a GitHub Actions workflow when someone comments /reviewbot on a PR. It uses an OpenAI-compatible API backed by a self-hosted model running on a DGX Spark, with an agentic loop that has two tools: one for executing Python (to analyze the codebase) and one for submitting the review. The bot also clones the repo so the model has filesystem access through Python execution.
The way I approached this was different, and yet also similar. A webhook listener watches for a specific pattern in PR comments, and when it matches, it fires off a review. The review starts off by the golang process fetching information about the PR such as diff, other existing comments, etc., then spins up the copilot cli, with a crafted prompt based on the collected context, and a set of custom tools. The agent then explores the codebase and calls the tools as needed, and when it's done, the review summary gets posted back to the PR.
The tools#
The agent gets three tools. These are the bridge between the PR and the agent itself. They translate between "AI wants to give feedback" and "Gitea needs specific API calls." The agent has no idea it's talking to Gitea.
post_inline_comment posts a comment on a specific file and line, with a severity level, either blocker (must-fix before merge) or suggestion (recommended improvement). The PR author can triage quickly: address the blockers, consider the suggestions.
note_low_confidence records an observation the agent isn't confident about. More on this one in a moment.
submit_review posts the final summary as a top-level PR comment. The agent calls this exactly once, as its last action.
The note_low_confidence tool came from Angie Jones' post on teaching Copilot to think like a maintainer, where she discusses setting a confidence threshold for AI reviewers. Her key insight: set a confidence threshold (>80%) so the reviewer only comments when it's fairly sure something is wrong. Without that threshold, AI reviewers dump every observation as an equally-weighted comment, and people learn to ignore the bot fast.
Note: While we may describe these machines as "thinking", it is important to recognize our anthropomorphization of what is really statistical analysis being done by the model.
I agree with having a confidence threshold, but my confidence in these models are also not high enough to trust in their weighting. If I were building a product, or if this were a production environment where overloading the developers with noise would be a problem, then I probably would have the bot stay silent. But, for an experiment, I wanted to see all that these models had to offer, to make the evaluation myself.
These low confidence observations are hidden in the main review, in a collapsible section at the bottom, so as not to distract from the main review, but they're still there for anyone who wants to see them. This way, I get to be my true Hannah Montana self, and have the best of both worlds.
The system prompt#
Angie's post was full of great ideas, especially around instructions on what to skip, such as linting, since presumably there are other CI workflows already catching those errors, and they're better and more efficient at finding them.
The prompt also explains the confidence model (high confidence gets an inline comment, low confidence gets a note_low_confidence call) and includes the expected output format for the final summary.
PR context goes in as a structured template: title, author, branches, description, the diff, existing comments (filtered to exclude bot noise). If the diff is too large, it's truncated with a note telling the agent to use git diff on specific files. The agent has the full cloned repo, so it can always look at more than what's in the prompt.
The feedback loop#
Once the agent is done, the review summary comment gets updated. Low-confidence notes are appended in a collapsible <details> section, and usage metadata (token counts, model used, API calls, cost if available) goes in another collapsible section at the bottom.
I include the usage metadata because when someone reads a bot-generated review, they should be able to see what produced it and roughly what it cost. Having it say "claude-sonnet-4.5, 12 API calls, 45k input tokens..." makes sure that the developer is aware of the resources that went into producing the review, and hopefully encourages their future contributions to be focused. If a contribution is too large, it may overflow the context window, causing reviews to be less effective.
The pattern#
So little of this is Copilot-specific. The SDK, the model, the hosting. Those are all interchangeable. The tools are what make it work somewhere it was never built for.
Code is left as an exercise to the reader. Although, if you are a VC, I am happy to send you my routing number in exchange for $10 Billion for this unicorn.