Why this matters now

On the morning of 2026-06-03, GitHub paused new Copilot sign-ups. The official reason: inference compute demand outran the capacity they had pre-paid. The unofficial reason, per the threads on Hacker News that day: a wave of enterprise procurement teams are writing checks for AI coding tools without understanding the unit economics, and vendors are absorbing the cost of customers who do not.

I have spent the last two weeks talking to engineering leaders at four companies who are scaling Copilot, Cursor, and Devin across teams of 50-500 engineers. The number that keeps coming up is this: for every $19/seat/month of Copilot Business, the actual inference cost is between $40 and $180 per engineer per month when you include chat, agent mode, and PR-review compute. The per-seat model is a marketing abstraction. The per-completion model is the financial reality.

This piece is the FinOps wedge I wish someone had handed me in April 2025: how to measure, attribute, alert, and cap the cost of AI coding agents before the bill lands and the CFO calls. It is built on the same principles as our LLM FinOps guide and the cost monitoring tools article, but applied to the specific failure modes of coding agents: long context, agentic loops, and the pricing-model mismatch between vendors and the buyers they sell to.

Advertisement
Advertisement

1. The per-engineer cost reality vs. the per-seat sticker price

Let us anchor the conversation with numbers. These are the public, vendor-confirmed prices as of June 2026 — they will move, but the order-of-magnitude is what matters.

Product Sticker price Realistic cost / engineer / month Primary cost driver
GitHub Copilot Business $19 / seat / month $40-90 Chat, PR review, agent mode
GitHub Copilot Enterprise $39 / seat / month $80-180 Agent mode, knowledge base indexing
Cursor Pro $20 / seat / month $50-120 Long-context completions, Composer
Cursor Business $40 / seat / month $90-220 Composer agent, large context windows
Devin Team $500 / seat / month $400-700 (with usage caps) Async multi-step task execution
Continue.dev + BYOK $0 (open source) + provider $30-200 Whichever model you wire in
Cody Sourcegraph Enterprise $30+ / seat / month $60-150 Repo-wide context, multi-file edits

Where the multiplier comes from:

  • Completion count, not seat count, drives inference. A heavy Copilot user triggers 200-400 completions per day, of which 30-50% are accepted. The accepted completions are mostly small (under 200 tokens), but the rejected ones still cost money because the model has to generate to know whether the human will accept it.
  • Chat and agent modes are 10-100x more expensive than inline completions. A single Copilot Chat turn with file context (5-20 files) is a 8,000-30,000-token prompt. An agent-mode task that reads the repo and writes a refactor is 50,000-200,000 tokens. One engineer doing 5-10 agent-mode tasks per day can single-handedly burn $30-50 of inference.
  • PR review (Copilot Code Review, Cursor Bugbot) bills per review, not per seat. At 5-20 PRs per engineer per month, with reviews running 4,000-12,000 tokens, this adds another $5-15 per engineer per month.
  • Background indexing is metered. Copilot Enterprise indexes the full repo graph. Cursor and Cody both do the same. For a 5-million-line monorepo, indexing can run into tens of dollars in one-time cost and ongoing query cost on every long-context request.

The practical implication: if you are budgeting $19 × 200 engineers × 12 months = $45.6K for Copilot Business, your real annual cost is likely between $96K and $216K. That is the same order of magnitude as a senior engineer's fully-loaded cost. Finance will ask.

2. The pricing-misalignment problem (why vendors sell per-seat)

Coding agents are sold per-seat because that is the procurement model engineering leaders already understand. It mirrors JetBrains, GitLab, and the rest of the developer-tools category. But the underlying cost curve is not per-seat — it is per-completion, and completions are log-normally distributed across your engineering team.

This creates three asymmetries that hurt buyers:

Power users are subsidized by light users. The 20% of engineers who use agent mode heavily (5+ tasks/day) consume 70-80% of the inference budget. The 80% who use it lightly (a few chat turns per week) consume almost nothing. A per-seat price flattens this. Vendors know this and price accordingly — they are pricing for the heavy users and using the light users as marketing.

Vendors are exposed to commodity-style inference cost shocks. The reason GitHub paused sign-ups in June 2026 is that the marginal cost of a Copilot completion (especially for code generation with long context) is sensitive to the price Anthropic and OpenAI charge for the underlying model. When those model providers raise prices, or when capacity is constrained, vendors either eat the cost or throttle features. Per-seat pricing does not give them a clean way to pass that through.

Buyers cannot attribute cost to value. The question every VP Engineering should be asking is: which engineers, on which tasks, are generating the most value per dollar of Copilot spend? Per-seat billing cannot answer that. Per-completion billing, with a way to attribute completions to PRs, can.

The honest framing: the per-seat sticker is a marketing price, and the per-completion cost is a FinOps problem. Your job in 2026 is to bridge the two.

3. Per-LOC and per-PR cost attribution

Per-engineer cost tells you how much. Per-LOC and per-PR cost tell you whether it was worth it. This is the metric engineering leaders actually want, and almost nobody has it.

Per-LOC attribution: take the total monthly inference cost attributed to an engineer (or a team), divide by the lines of code that engineer's PRs added (or modified, depending on your philosophy). For 2026, a reasonable range is $0.05-0.50 per modified LOC for high-leverage AI-assisted work, and $2-10 per modified LOC for low-leverage work (lots of churn, lots of throwaway code).

Per-PR attribution: take the total monthly cost attributed to an engineer, divide by the number of PRs merged. This normalizes for LOC density — a 50-line bug fix and a 2,000-line feature should both count as one PR if both delivered value. A reasonable target is $5-50 per merged PR; outliers above $200 are almost always a sign of an agent loop, a runaway context, or a junior engineer treating agent mode as a search engine.

How to actually compute these:

  1. Export the vendor's usage data (most vendors expose per-user usage APIs; Copilot Enterprise has an admin API, Cursor has a per-user dashboard export, Devin has a per-engineer session log).
  2. Join that data with your git host's PR data on engineer email (GitHub, GitLab, Bitbucket all have APIs for this).
  3. For per-LOC, use git diff --shortstat on the merge commit to get lines added/modified. For per-PR, just count.
  4. Load both into the same warehouse table (BigQuery, Snowflake, even a CSV join in Python). One SQL query gives you the per-engineer and per-team breakdown.

The 80/20 will jump out: a small number of engineers are generating outsized cost-per-LOC because they are using agent mode on tasks where the agent is not actually helpful (debugging, exploratory reading, large refactors with ambiguous requirements). Those are the engineers who need either training or a different tool, not more capacity.

4. Cost anomaly detection: the 200k-token refactor

The single biggest cost anomaly in AI coding agents is the accidental long-context blast: an engineer opens a chat with the entire monorepo attached (5,000-50,000 files indexed), asks a question, and the agent decides to "helpfully" re-read every file before answering. The result is a single session that consumes more tokens than the rest of the engineer's month combined.

I have seen this pattern at three of the four companies I talked to. The signatures:

  • One session, one engineer, 200K-1M tokens consumed in a single day.
  • The session is on a junior engineer (who doesn't know what context the agent is loading) or a senior engineer debugging a thorny issue (who is willing to throw compute at the problem).
  • The session produces a PR that gets reverted within a week, because the long-context refactor is not what the team actually wanted.

How to detect this in real time:

  1. Per-session token cap — Helicone, Portkey, and LangSmith all let you set hard limits on per-session or per-user spend. 100K tokens is a reasonable default for inline completions; 500K for chat; 2M for agent mode. Anyone hitting the cap gets a Slack ping, not a silent failure.
  2. Per-user daily spend cap — $20/day per engineer is generous for normal use. Anyone hitting it triggers a review.
  3. Per-team weekly burn rate — alert when a team's burn rate exceeds 1.5x the prior 4-week average. This catches the slow drift, not just the spike.
  4. Acceptance rate monitoring — if an engineer's acceptance rate drops below 20% on completions, they are generating cost without value. That is a training opportunity, not a compute problem.

For the technical implementation, the simplest path is the one we cover in the cost monitoring tools piece: wrap the IDE's API calls (where possible) or use the vendor's admin API to pull session-level token counts into Prometheus. Then alert in Grafana. The whole thing is a 2-3 day project for an SRE.

FinOps Helicone

Helicone is the fastest path to per-engineer AI cost visibility. Drop-in proxy for OpenAI, Anthropic, and any OpenAI-compatible coding-agent API. Per-user cost dashboards, session-level token caps, cache hit analysis, and Slack/PagerDuty alerting on burn-rate anomalies. Free tier covers 100K events/month — enough for a 50-engineer team to start.

5. The four attribution metrics that matter

After two years of watching teams instrument this, here is the shortlist of metrics that actually drive decisions. Skip the rest.

1. Cost per merged PR, per engineer, per week. This is the single most useful number. It captures the value side of the equation. Trends over time are more useful than absolute values.

2. Tokens per accepted completion, by model. This tells you whether your engineers are using the right model for the task. A senior engineer using Claude Opus for boilerplate is burning money. A junior using Claude Haiku for architecture review is wasting time. Both are visible in this metric.

3. Agent-mode sessions per engineer per week, and median session length. Agent mode is where the money goes. If the median is creeping up week-over-week, you are paying for more agentic work, which is fine if the PR throughput is also going up, and a problem if it is not.

4. Cost per PR that ships to production. Not merged — shipped. A PR that gets reverted the next day is not a delivery. If you can wire this to your deploy log, you get the truest measure of whether the AI spend is producing business value.

Most observability backends can do this with a few custom metrics. The trick is getting the cost data into the same warehouse as the git data. Once that join exists, the rest is SQL.

6. Enterprise policy: per-team budgets, hard-token caps, completion-only modes

Once you have the attribution working, the next question is enforcement. Three patterns have held up well in production at the companies I have looked at:

Per-team monthly budgets with soft and hard stops. Each team gets a monthly budget (e.g., $2K for a 10-engineer team on Copilot Business). At 80% of budget, the team lead gets a Slack ping. At 100%, the team's IDEs switch to a read-only mode for non-essential features. At 120%, the overage requires VP approval to continue. This is the model that has worked best because it puts the decision in the hands of the people closest to the work.

Hard per-session and per-user token caps. No single session should burn more than 500K tokens. No single user should burn more than $50/day. These are not negotiable defaults; they are guardrails. Engineers who need to exceed them can do so by filing a one-line request, which gives you a paper trail for the cost. This single control has saved every company I have seen implement it from at least one $10K surprise bill.

Completion-only mode for junior engineers and contractors. For engineers in their first 90 days, or for contractors with limited scope, restrict the IDE to inline completions only. No chat, no agent mode. This is both a cost control and a security control (junior engineers cannot accidentally exfiltrate repo context through agent mode). After the 90-day ramp, the restriction lifts automatically.

For the policy engine itself, you have two real options: build it in the gateway (Portkey, Helicone) or build it in the IDE admin console (Copilot has org-level policies, Cursor has team-level policies, Devin has usage caps per seat). The gateway approach is more flexible but requires you to route traffic through it, which is invasive. The vendor-admin approach is less flexible but ships in an afternoon.

FinOps Portkey

Portkey's AI gateway gives you policy-as-code for AI coding agents — per-team budgets, hard token caps, automatic fallbacks when caps are hit, and unified cost dashboards across OpenAI, Anthropic, and any OpenAI-compatible provider. The same gateway also handles retries, fallbacks, and load balancing, so you are not standing up a separate control plane. Generous free tier for small teams.

7. Self-hosted and BYOK as a cost hedge

If your engineering team is large enough (200+ engineers, sustained Copilot usage), the unit economics of running a self-hosted model for completions start to make sense. Cursor and Continue.dev both support bring-your-own-key (BYOK) and bring-your-own-model. The math:

  • A 200-engineer team on Copilot Business at full sticker is $45.6K/year. Real cost, with chat and agent mode, is $96K-$216K/year.
  • The same team running a self-hosted Qwen2.5-Coder-32B or DeepSeek-Coder-V2 on 4x H100s (reserved, $2/hr on CoreWeave, $1.50/hr on Lambda) for inline completions only: ~$26K-$35K/year in GPU cost, plus 0.5 FTE of ML platform engineering. The completions are 80% as good for boilerplate and 60% as good for novel code, which is fine for the long tail of completions.
  • Hybrid model: self-hosted Qwen for inline completions, vendor API (Copilot or Cursor) for chat and agent mode where model quality matters more. This is what several of the teams I talked to have landed on, and it cuts the bill by 40-60% while preserving the user experience on the high-value interactions.

The hybrid approach is not free — you have to maintain the self-hosted model, keep up with upstream Qwen/DeepSeek releases, monitor quality, and run inference infrastructure. For a 200-engineer team, the break-even is real but not dramatic. For a 1,000-engineer team, it is obvious.

Observability Grafana

Grafana is the de facto dashboard layer for AI coding agent cost monitoring. Send OpenTelemetry cost attributes from your coding-agent gateway (or vendor admin API) into Grafana, and you get per-engineer, per-team, per-PR cost dashboards next to your existing infrastructure metrics. Alert on burn-rate anomalies the same way you alert on any other infra metric. Free tier covers small teams; self-hosted is free.

8. A 30-day rollout plan

If you are starting from zero, here is the order I would do it in. This compresses two months of work at one of the companies I advised into a single sprint.

Week 1 — Visibility. Wire the vendor's admin API (Copilot, Cursor, or Devin) into your data warehouse. Build the per-engineer, per-team, per-week cost table. Do not build dashboards yet — just get the table. The first time you see the distribution, you will be shocked.

Week 2 — Attribution. Join the cost table with your git host's PR data on engineer email. Compute per-LOC and per-PR cost. Publish the top-10 cost-per-LOC engineers to their managers. This is uncomfortable but it works.

Week 3 — Guardrails. Set per-session and per-user hard caps in the gateway (Helicone, Portkey) or in the vendor admin console. Default caps: 500K tokens per session, $50/day per user, 1M tokens per session for agent mode. Document an override process.

Week 4 — Budgets. Set per-team monthly budgets with soft (Slack ping at 80%) and hard (read-only mode at 100%) stops. Hand the controls to the team leads. Stand up the Grafana dashboard for ongoing monitoring.

After 30 days, you will have cut 20-40% of the waste, identified the engineers who need training, and built the data foundation to negotiate with the vendor from a position of strength. The vendor pricing team will be more flexible once they know you can measure usage.

9. What the vendors will not tell you

A few things I have learned from talking to the procurement and engineering teams on both sides of these deals:

Negotiate on usage caps, not seat count. Most vendors will give you a 30-50% discount on the sticker if you commit to a minimum seat count. The trap is that the seat count is fixed but the usage is not. Negotiate the seat count down and add a usage ceiling clause — if you go over, the overage is at cost-plus, not at sticker.

Ask for the model tier, not the brand. "Powered by GPT-4o" is a marketing claim. What you actually want to know is: are the inline completions running on a frontier model or a distilled one? Vendors route different products to different model tiers to manage their own cost. You have the right to know which tier your completions are running on.

Data retention is the real cost driver. Copilot Enterprise and Devin both retain chat history and PR-review history by default, indexed and queryable. The retention is part of the value, but it is also part of the inference cost (re-ranking, embedding, retrieval). Turn it off if you do not need it.

The agent-mode tax is going up, not down. Anthropic and OpenAI have been raising prices on long-context models for two years. The vendors that resell those models have not been able to absorb the increases forever. Expect either feature throttling (Copilot's "premium requests" model is the first sign) or price increases (Cursor raised Business pricing 25% in Q1 2026). Plan for the bill to grow 2-3x over the next 18 months even if your usage stays flat.

Advertisement
Advertisement

10. Putting it together

AI coding agents are the first category of developer tool where the cost scales with usage, not with seats, and where the vendors' pricing model does not reflect that. That mismatch is a FinOps problem, not a procurement problem. The companies that will get the most value out of Copilot, Cursor, and Devin in 2026 are the ones that build the attribution and guardrails now, while the bills are still small enough to absorb.

Start with the per-engineer, per-week cost table. Everything else follows from there.

This piece builds on the FinOps wedge established in our LLM FinOps 2026 guide, the tooling deep-dive in LLM Cost Monitoring Tools, and the GPU/infra side of cost in AWS Savings Plans vs Reserved Instances. The multimodal cost pattern is covered separately in Multimodal LLM Cost Optimization and the routing strategy in Multi-LLM Routing.

Open Source Continue.dev

Continue.dev is the leading open-source AI coding agent — bring-your-own-model, bring-your-own-key, fully self-hostable. The only coding agent that lets you swap between GPT-4o, Claude, Qwen-Coder, and DeepSeek-Coder without changing your editor workflow. Free for individual use; team and enterprise plans available. For teams that want cost control via self-hosting, Continue is the most flexible foundation in 2026.