AI Cost Tracking on GCP: A Practical Guide to Vertex AI, Gemini API, and Model Spend

AI spend is the fastest-growing line item on most GCP bills. Teams experimenting with Gemini, Vertex AI, or any of the Cloud AI services often have no idea what they're actually spending until the invoice arrives. The reason: AI costs are buried in billing exports under generic service names, split across multiple SKUs, and priced in units (tokens, characters, seconds) that don't map intuitively to dollars.

The good news is that once you understand the billing structure, tracking and optimizing AI costs follows the same principles as any other GCP service — labels, billing export queries, and architectural discipline. This guide covers the practical steps to get there.

Where AI Costs Hide in GCP Billing

The first challenge is finding your AI spend. GCP doesn't group all AI costs under a single "AI" category. Instead, they're scattered across several service names in billing exports:

Billing Service Name	What It Covers
Vertex AI	Model training, online/batch predictions, endpoints, Vertex AI Studio, custom models
Cloud AI APIs	Vision API, Natural Language API, Translation API, Speech-to-Text, Text-to-Speech
Vertex AI Generative AI	Gemini API calls, PaLM (legacy), Imagen, embeddings
Compute Engine	GPUs/TPUs used for training or serving (billed as compute, not AI)
Cloud Storage	Training data, model artifacts, pipeline outputs

The problem gets worse when you look at SKU descriptions. A charge for "Online Prediction - Gemini 2.5 Pro" looks different from "Custom Prediction - N1 with NVIDIA T4 GPU," but both are AI inference costs. And the GPU compute for training jobs shows up under Compute Engine, not Vertex AI — so a simple filter on "Vertex AI" misses a significant chunk of your actual AI spend.

SELECT
  service.description AS service,
  sku.description AS sku,
  ROUND(SUM(cost), 2) AS total_cost
FROM project.dataset.gcp_billing_export_v1_XXXXXX
WHERE invoice.month = '202603'
  AND (
    service.description LIKE '%Vertex AI%'
    OR service.description LIKE '%Cloud AI%'
    OR service.description LIKE '%Natural Language%'
    OR sku.description LIKE '%GPU%'
    OR sku.description LIKE '%TPU%'
    OR sku.description LIKE '%Gemini%'
    OR sku.description LIKE '%PaLM%'
    OR sku.description LIKE '%Imagen%'
  )
GROUP BY service, sku
ORDER BY total_cost DESC;

Run this against your billing export to get the full picture of AI-related charges, including GPU compute that might be hiding under Compute Engine.

Understanding Vertex AI and Gemini API Pricing

AI pricing on GCP is fundamentally different from traditional cloud pricing. Instead of paying for compute time, you pay per token (for generative models) or per prediction (for custom models). Understanding the unit economics is essential for forecasting and optimization.

Gemini API pricing (input and output tokens)

Gemini models charge separately for input and output tokens. The price gap between models is significant:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Best For
Gemini 2.5 Pro	$1.25 - $10.00	$10.00 - $30.00	Complex reasoning, multi-step tasks
Gemini 2.5 Flash	$0.15 - $0.60	$1.00 - $3.50	High-volume, cost-sensitive workloads
Gemini 2.0 Flash	$0.10	$0.40	Simple tasks, high throughput

Note: Prices vary based on context length thresholds. Gemini 2.5 Pro charges more for prompts exceeding 200K tokens.

The output-to-input price ratio matters more than absolute price. Output tokens cost 3-8x more than input tokens, so tasks that generate long responses (code generation, summarization) are disproportionately expensive.

Custom model pricing

For custom-trained models or fine-tuned models deployed on Vertex AI endpoints:

Component	Pricing
Training (GPUs)	Per GPU-hour ($1.10 - $12.00+/hr depending on GPU type)
Online prediction	Per node-hour (based on machine type + accelerator)
Batch prediction	Per node-hour (typically 40-60% cheaper than online)
Model storage	Per GB/month

The critical cost here is the endpoint. A Vertex AI endpoint running 24/7 with an NVIDIA L4 GPU costs ~$800/month — whether you send it zero requests or a million. This is where the biggest waste happens.

Provisioned throughput vs pay-per-use

For high-volume Gemini API usage, GCP offers provisioned throughput — you buy a guaranteed capacity (tokens per minute) at a discount. This only makes sense if:

You have consistent, predictable traffic
Your volume is high enough that the commitment discount exceeds your average usage
You can accurately forecast demand

For most teams experimenting with AI, pay-per-use is the safer choice.

Cost Allocation by Model and Team

Once you can find AI costs, the next step is attributing them — which model, which team, which use case is driving spend.

Using labels on Vertex AI resources

Label everything:

When creating a Vertex AI endpoint
from google.cloud import aiplatform
endpoint = aiplatform.Endpoint.create(
    display_name="customer-support-gemini",
    labels={
        "team": "support",
        "model": "gemini-2-5-flash",
        "environment": "production",
        "use_case": "ticket-classification"
    }
)

When submitting a training job
job = aiplatform.CustomTrainingJob(
    display_name="churn-model-v3",
    labels={
        "team": "data-science",
        "model_type": "custom",
        "environment": "production"
    },
    # ...
)

Labels propagate to billing exports, enabling cost queries by team, model, or use case.

Billing export query: AI spend by label

SELECT
  labels.value AS team,
  service.description AS service,
  sku.description AS sku,
  ROUND(SUM(cost), 2) AS total_cost
FROM project.dataset.gcp_billing_export_v1_XXXXXX,
  UNNEST(labels) AS labels
WHERE invoice.month = '202603'
  AND labels.key = 'team'
  AND (
    service.description LIKE '%Vertex AI%'
    OR service.description LIKE '%Cloud AI%'
    OR sku.description LIKE '%Gemini%'
  )
GROUP BY team, service, sku
ORDER BY total_cost DESC;

Tracking Gemini API costs per application

If multiple applications call the Gemini API through the same project, costs blend together. Separate them by:

Using different projects per application (cleanest, but adds management overhead)
Using labels on API requests (if your framework supports passing labels through the Vertex AI SDK)
Logging request metadata and joining with billing data post-hoc

The project-per-application approach is the most reliable because billing export natively groups by project. Labels work but require discipline to maintain.

Common AI Cost Traps

These are the mistakes we see most often. Each one is easy to make and expensive to ignore.

1. Over-provisioned endpoints sitting idle

This is the #1 source of AI cost waste. A Vertex AI endpoint with a GPU runs 24/7 whether it's serving requests or not. Teams deploy a model for testing, forget about it, and discover months later they've been paying $800+/month for an endpoint handling zero traffic.

How to find them:

SELECT
  resource.name AS endpoint_name,
  ROUND(SUM(cost), 2) AS monthly_cost,
  COUNT(DISTINCT usage_start_time) AS billing_periods
FROM project.dataset.gcp_billing_export_v1_XXXXXX
WHERE service.description = 'Vertex AI'
  AND sku.description LIKE '%Prediction%'
  AND invoice.month = '202603'
GROUP BY endpoint_name
ORDER BY monthly_cost DESC;

Cross-reference these with actual traffic in Vertex AI monitoring. If an endpoint costs $500+/month but handles < 100 requests/day, it's a candidate for right-sizing or removal.

2. Using expensive models for simple tasks

Gemini 2.5 Pro is powerful, but it costs 10-30x more than Flash models. If you're using Pro for tasks like classification, extraction, or simple Q&A, you're likely overpaying.

A practical test: Run your task on Gemini 2.0 Flash first. If the quality is acceptable, you're done. If not, try 2.5 Flash. Only escalate to Pro for tasks that genuinely need complex reasoning.

Task Type	Recommended Model	Why
Text classification	Gemini 2.0 Flash	Simple input/output, doesn't need reasoning
Summarization	Gemini 2.5 Flash	Needs some comprehension, but not complex reasoning
Code generation	Gemini 2.5 Pro	Complex reasoning significantly improves output quality
Data extraction	Gemini 2.0 Flash	Structured input/output, pattern matching
Multi-step analysis	Gemini 2.5 Pro	Requires chain-of-thought reasoning

3. Forgotten training jobs and notebooks

GPU-backed Vertex AI Workbench notebooks and training jobs that aren't cleaned up keep the underlying compute running — and billing. A single notebook with an NVIDIA T4 costs ~$330/month if left running.

Set idle shutdown policies on all notebooks, and configure training jobs with maximum runtime limits.

4. Context window bloat

Every token you send to a model costs money. Teams often send entire documents when a summary or relevant excerpt would work. A 100K-token prompt to Gemini 2.5 Pro costs ~$1.00 per request. If you're making 1,000 requests/day, that's $30,000/month in input tokens alone.

Strategies to reduce input tokens:

Trim irrelevant context before sending to the model
Use retrieval-augmented generation (RAG) to send only relevant chunks
Cache system prompts when using models that support prompt caching
Summarize long documents before including them as context

5. Streaming when batch would work

The Gemini Batch API is significantly cheaper than the standard API for workloads that don't need real-time responses. If you're processing documents, generating reports, or running analysis jobs, batch mode can cut costs by 50%.

Practical Optimization Strategies

Model selection framework

Build a decision tree for your team:

Is the task simple (classification, extraction, short answers)? → Start with Gemini 2.0 Flash Does quality suffer with Flash? → Move to Gemini 2.5 Flash Does the task require complex reasoning or long-form generation? → Use Gemini 2.5 Pro

Is the volume high (>10K requests/day)? → Evaluate provisioned throughput pricing

Prompt caching

For applications that use the same system prompt across many requests, Gemini's context caching reduces costs by caching the repeated prefix. Instead of paying full input token price for the system prompt every time, you pay a reduced rate for cached tokens.

This is especially impactful for applications with large system prompts (1K+ tokens) and high request volumes.

Batch API for non-real-time workloads

If your use case doesn't need a response in seconds, use the Batch Prediction API:

from google.cloud import aiplatform
batch_job = aiplatform.BatchPredictionJob.create(
    model_name="publishers/google/models/gemini-2.5-flash",
    input_config={
        "instances_format": "jsonl",
        "gcs_source": {"uris": ["gs://bucket/input.jsonl"]}
    },
    output_config={
        "predictions_format": "jsonl",
        "gcs_destination": {"output_uri_prefix": "gs://bucket/output/"}
    },
    labels={"team": "analytics", "use_case": "document-processing"}
)

Batch prediction typically costs 50% less than online prediction and doesn't require a persistent endpoint.

Right-sizing GPU/TPU for training

Don't default to the largest available GPU. Match the GPU to your model size:

Model Parameters	Recommended GPU	Cost/hour
< 1B	NVIDIA T4 (16 GB)	~$0.35
1-7B	NVIDIA L4 (24 GB)	~$0.70
7-13B	NVIDIA A100 (40 GB)	~$3.67
13B+	NVIDIA A100 (80 GB) or H100	$3.67 - $12.00+

Start with the smallest GPU that fits your model in memory. Scale up only if training speed is unacceptable.

Set spending limits

Configure budget alerts filtered to AI services:

Billing → Budgets & alerts → Create budget
Filter by service: Vertex AI, Cloud AI APIs
Set thresholds at 50%, 80%, 100% of expected monthly spend
Route alerts to your engineering channel, not just finance

Monitoring AI Spend

Weekly cost trend query

SELECT
  PARSE_DATE('%Y%m%d', FORMAT_TIMESTAMP('%Y%m%d', usage_start_time)) AS date,
  service.description AS service,
  ROUND(SUM(cost), 2) AS daily_cost
FROM project.dataset.gcp_billing_export_v1_XXXXXX
WHERE usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND (
    service.description LIKE '%Vertex AI%'
    OR service.description LIKE '%Cloud AI%'
    OR sku.description LIKE '%Gemini%'
  )
GROUP BY date, service
ORDER BY date DESC;

Look for:

Sudden spikes: Usually a new endpoint deployed or a batch job running
Steady upward trend: Growing usage that needs capacity planning
Weekend costs equal weekday costs: Likely idle endpoints running 24/7

Forecasting AI spend growth

AI costs tend to grow faster than traditional cloud costs because:

Teams add new use cases faster than they optimize existing ones
Token volumes increase as adoption spreads across the organization
Model upgrades (Flash → Pro) happen without cost review

Track your weekly token consumption alongside costs. If token volume grows 20%/month, your costs will grow at least that fast — and faster if teams migrate to more expensive models.

Putting It All Together

This week (30 minutes)

[ ] Run the billing export query to identify all AI-related charges
[ ] List all active Vertex AI endpoints and their monthly costs
[ ] Check for forgotten notebooks running on GPUs
[ ] Verify you're using the cheapest model that works for each use case

This month (2-3 hours)

[ ] Add labels to all Vertex AI endpoints and training jobs (team, model, use case)
[ ] Set up budget alerts filtered to AI services
[ ] Evaluate whether any Gemini Pro usage can be replaced with Flash
[ ] Implement idle shutdown policies on all Workbench notebooks
[ ] Switch non-real-time workloads to the Batch API

This quarter

[ ] Build a cost-per-request dashboard for each AI application
[ ] Evaluate provisioned throughput vs pay-per-use for high-volume applications
[ ] Implement prompt caching for applications with repeated system prompts
[ ] Set up automated alerts for new endpoint deployments

Expected results

Optimization	Typical Savings
Model downgrade (Pro → Flash)	70-90% per request
Remove idle endpoints	$500-2,000+/month per endpoint
Batch instead of online prediction	50% on eligible workloads
Context window optimization	20-60% on input token costs
Prompt caching	30-50% on repeated system prompts
Right-sizing training GPUs	30-70% on training costs

Struggling to track AI spend across your GCP projects? GCP FinOps helps growing companies identify and eliminate cloud waste without enterprise complexity.

Related Articles: