Back to Blog
gcp
vertex-ai
gemini
cost-optimization
finops
ai

AI Cost Tracking on GCP: A Practical Guide to Vertex AI, Gemini API, and Model Spend

AI spend is the fastest-growing line item on GCP bills. This guide covers how to find AI costs in billing exports, understand Vertex AI and Gemini API pricing, allocate costs by model and team, and avoid the most common AI cost traps.

Matias Coca|
12 min read

AI spend is the fastest-growing line item on most GCP bills. Teams experimenting with Gemini, Vertex AI, or any of the Cloud AI services often have no idea what they're actually spending until the invoice arrives. The reason: AI costs are buried in billing exports under generic service names, split across multiple SKUs, and priced in units (tokens, characters, seconds) that don't map intuitively to dollars.

The good news is that once you understand the billing structure, tracking and optimizing AI costs follows the same principles as any other GCP service — labels, billing export queries, and architectural discipline. This guide covers the practical steps to get there.


Where AI Costs Hide in GCP Billing

The first challenge is finding your AI spend. GCP doesn't group all AI costs under a single "AI" category. Instead, they're scattered across several service names in billing exports:

Billing Service NameWhat It Covers
Vertex AIModel training, online/batch predictions, endpoints, Vertex AI Studio, custom models
Cloud AI APIsVision API, Natural Language API, Translation API, Speech-to-Text, Text-to-Speech
Vertex AI Generative AIGemini API calls, PaLM (legacy), Imagen, embeddings
Compute EngineGPUs/TPUs used for training or serving (billed as compute, not AI)
Cloud StorageTraining data, model artifacts, pipeline outputs
The problem gets worse when you look at SKU descriptions. A charge for "Online Prediction - Gemini 2.5 Pro" looks different from "Custom Prediction - N1 with NVIDIA T4 GPU," but both are AI inference costs. And the GPU compute for training jobs shows up under Compute Engine, not Vertex AI — so a simple filter on "Vertex AI" misses a significant chunk of your actual AI spend.
SELECT
  service.description AS service,
  sku.description AS sku,
  ROUND(SUM(cost), 2) AS total_cost
FROM project.dataset.gcp_billing_export_v1_XXXXXX
WHERE invoice.month = '202603'
  AND (
    service.description LIKE '%Vertex AI%'
    OR service.description LIKE '%Cloud AI%'
    OR service.description LIKE '%Natural Language%'
    OR sku.description LIKE '%GPU%'
    OR sku.description LIKE '%TPU%'
    OR sku.description LIKE '%Gemini%'
    OR sku.description LIKE '%PaLM%'
    OR sku.description LIKE '%Imagen%'
  )
GROUP BY service, sku
ORDER BY total_cost DESC;

Run this against your billing export to get the full picture of AI-related charges, including GPU compute that might be hiding under Compute Engine.


Understanding Vertex AI and Gemini API Pricing

AI pricing on GCP is fundamentally different from traditional cloud pricing. Instead of paying for compute time, you pay per token (for generative models) or per prediction (for custom models). Understanding the unit economics is essential for forecasting and optimization.

Gemini API pricing (input and output tokens)

Gemini models charge separately for input and output tokens. The price gap between models is significant:

ModelInput (per 1M tokens)Output (per 1M tokens)Best For
Gemini 2.5 Pro$1.25 - $10.00$10.00 - $30.00Complex reasoning, multi-step tasks
Gemini 2.5 Flash$0.15 - $0.60$1.00 - $3.50High-volume, cost-sensitive workloads
Gemini 2.0 Flash$0.10$0.40Simple tasks, high throughput
Note: Prices vary based on context length thresholds. Gemini 2.5 Pro charges more for prompts exceeding 200K tokens.

The output-to-input price ratio matters more than absolute price. Output tokens cost 3-8x more than input tokens, so tasks that generate long responses (code generation, summarization) are disproportionately expensive.

Custom model pricing

For custom-trained models or fine-tuned models deployed on Vertex AI endpoints:

ComponentPricing
Training (GPUs)Per GPU-hour ($1.10 - $12.00+/hr depending on GPU type)
Online predictionPer node-hour (based on machine type + accelerator)
Batch predictionPer node-hour (typically 40-60% cheaper than online)
Model storagePer GB/month
The critical cost here is the endpoint. A Vertex AI endpoint running 24/7 with an NVIDIA L4 GPU costs ~$800/month — whether you send it zero requests or a million. This is where the biggest waste happens.

Provisioned throughput vs pay-per-use

For high-volume Gemini API usage, GCP offers provisioned throughput — you buy a guaranteed capacity (tokens per minute) at a discount. This only makes sense if:

  • You have consistent, predictable traffic
  • Your volume is high enough that the commitment discount exceeds your average usage
  • You can accurately forecast demand
For most teams experimenting with AI, pay-per-use is the safer choice.

Cost Allocation by Model and Team

Once you can find AI costs, the next step is attributing them — which model, which team, which use case is driving spend.

Using labels on Vertex AI resources

Label everything:

When creating a Vertex AI endpoint

from google.cloud import aiplatform

endpoint = aiplatform.Endpoint.create(
display_name="customer-support-gemini",
labels={
"team": "support",
"model": "gemini-2-5-flash",
"environment": "production",
"use_case": "ticket-classification"
}
)

When submitting a training job

job = aiplatform.CustomTrainingJob( display_name="churn-model-v3", labels={ "team": "data-science", "model_type": "custom", "environment": "production" }, # ... )

Labels propagate to billing exports, enabling cost queries by team, model, or use case.

Billing export query: AI spend by label

SELECT
  labels.value AS team,
  service.description AS service,
  sku.description AS sku,
  ROUND(SUM(cost), 2) AS total_cost
FROM project.dataset.gcp_billing_export_v1_XXXXXX,
  UNNEST(labels) AS labels
WHERE invoice.month = '202603'
  AND labels.key = 'team'
  AND (
    service.description LIKE '%Vertex AI%'
    OR service.description LIKE '%Cloud AI%'
    OR sku.description LIKE '%Gemini%'
  )
GROUP BY team, service, sku
ORDER BY total_cost DESC;

Tracking Gemini API costs per application

If multiple applications call the Gemini API through the same project, costs blend together. Separate them by:

  1. Using different projects per application (cleanest, but adds management overhead)
  2. Using labels on API requests (if your framework supports passing labels through the Vertex AI SDK)
  3. Logging request metadata and joining with billing data post-hoc
The project-per-application approach is the most reliable because billing export natively groups by project. Labels work but require discipline to maintain.

Common AI Cost Traps

These are the mistakes we see most often. Each one is easy to make and expensive to ignore.

1. Over-provisioned endpoints sitting idle

This is the #1 source of AI cost waste. A Vertex AI endpoint with a GPU runs 24/7 whether it's serving requests or not. Teams deploy a model for testing, forget about it, and discover months later they've been paying $800+/month for an endpoint handling zero traffic.

How to find them:

SELECT
  resource.name AS endpoint_name,
  ROUND(SUM(cost), 2) AS monthly_cost,
  COUNT(DISTINCT usage_start_time) AS billing_periods
FROM project.dataset.gcp_billing_export_v1_XXXXXX
WHERE service.description = 'Vertex AI'
  AND sku.description LIKE '%Prediction%'
  AND invoice.month = '202603'
GROUP BY endpoint_name
ORDER BY monthly_cost DESC;

Cross-reference these with actual traffic in Vertex AI monitoring. If an endpoint costs $500+/month but handles < 100 requests/day, it's a candidate for right-sizing or removal.

2. Using expensive models for simple tasks

Gemini 2.5 Pro is powerful, but it costs 10-30x more than Flash models. If you're using Pro for tasks like classification, extraction, or simple Q&A, you're likely overpaying.

A practical test: Run your task on Gemini 2.0 Flash first. If the quality is acceptable, you're done. If not, try 2.5 Flash. Only escalate to Pro for tasks that genuinely need complex reasoning.

Task TypeRecommended ModelWhy
Text classificationGemini 2.0 FlashSimple input/output, doesn't need reasoning
SummarizationGemini 2.5 FlashNeeds some comprehension, but not complex reasoning
Code generationGemini 2.5 ProComplex reasoning significantly improves output quality
Data extractionGemini 2.0 FlashStructured input/output, pattern matching
Multi-step analysisGemini 2.5 ProRequires chain-of-thought reasoning

3. Forgotten training jobs and notebooks

GPU-backed Vertex AI Workbench notebooks and training jobs that aren't cleaned up keep the underlying compute running — and billing. A single notebook with an NVIDIA T4 costs ~$330/month if left running.

Set idle shutdown policies on all notebooks, and configure training jobs with maximum runtime limits.

4. Context window bloat

Every token you send to a model costs money. Teams often send entire documents when a summary or relevant excerpt would work. A 100K-token prompt to Gemini 2.5 Pro costs ~$1.00 per request. If you're making 1,000 requests/day, that's $30,000/month in input tokens alone.

Strategies to reduce input tokens:

  • Trim irrelevant context before sending to the model
  • Use retrieval-augmented generation (RAG) to send only relevant chunks
  • Cache system prompts when using models that support prompt caching
  • Summarize long documents before including them as context

5. Streaming when batch would work

The Gemini Batch API is significantly cheaper than the standard API for workloads that don't need real-time responses. If you're processing documents, generating reports, or running analysis jobs, batch mode can cut costs by 50%.


Practical Optimization Strategies

Model selection framework

Build a decision tree for your team:

Is the task simple (classification, extraction, short answers)?
  → Start with Gemini 2.0 Flash

Does quality suffer with Flash?
→ Move to Gemini 2.5 Flash

Does the task require complex reasoning or long-form generation?
→ Use Gemini 2.5 Pro

Is the volume high (>10K requests/day)?
→ Evaluate provisioned throughput pricing

Prompt caching

For applications that use the same system prompt across many requests, Gemini's context caching reduces costs by caching the repeated prefix. Instead of paying full input token price for the system prompt every time, you pay a reduced rate for cached tokens.

This is especially impactful for applications with large system prompts (1K+ tokens) and high request volumes.

Batch API for non-real-time workloads

If your use case doesn't need a response in seconds, use the Batch Prediction API:

from google.cloud import aiplatform

batch_job = aiplatform.BatchPredictionJob.create(
model_name="publishers/google/models/gemini-2.5-flash",
input_config={
"instances_format": "jsonl",
"gcs_source": {"uris": ["gs://bucket/input.jsonl"]}
},
output_config={
"predictions_format": "jsonl",
"gcs_destination": {"output_uri_prefix": "gs://bucket/output/"}
},
labels={"team": "analytics", "use_case": "document-processing"}
)

Batch prediction typically costs 50% less than online prediction and doesn't require a persistent endpoint.

Right-sizing GPU/TPU for training

Don't default to the largest available GPU. Match the GPU to your model size:

Model ParametersRecommended GPUCost/hour
< 1BNVIDIA T4 (16 GB)~$0.35
1-7BNVIDIA L4 (24 GB)~$0.70
7-13BNVIDIA A100 (40 GB)~$3.67
13B+NVIDIA A100 (80 GB) or H100$3.67 - $12.00+
Start with the smallest GPU that fits your model in memory. Scale up only if training speed is unacceptable.

Set spending limits

Configure budget alerts filtered to AI services:

  1. Billing → Budgets & alerts → Create budget
  2. Filter by service: Vertex AI, Cloud AI APIs
  3. Set thresholds at 50%, 80%, 100% of expected monthly spend
  4. Route alerts to your engineering channel, not just finance

Monitoring AI Spend

Weekly cost trend query

SELECT
  PARSE_DATE('%Y%m%d', FORMAT_TIMESTAMP('%Y%m%d', usage_start_time)) AS date,
  service.description AS service,
  ROUND(SUM(cost), 2) AS daily_cost
FROM project.dataset.gcp_billing_export_v1_XXXXXX
WHERE usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND (
    service.description LIKE '%Vertex AI%'
    OR service.description LIKE '%Cloud AI%'
    OR sku.description LIKE '%Gemini%'
  )
GROUP BY date, service
ORDER BY date DESC;

Look for:

  • Sudden spikes: Usually a new endpoint deployed or a batch job running
  • Steady upward trend: Growing usage that needs capacity planning
  • Weekend costs equal weekday costs: Likely idle endpoints running 24/7

Forecasting AI spend growth

AI costs tend to grow faster than traditional cloud costs because:

  1. Teams add new use cases faster than they optimize existing ones
  2. Token volumes increase as adoption spreads across the organization
  3. Model upgrades (Flash → Pro) happen without cost review

Track your weekly token consumption alongside costs. If token volume grows 20%/month, your costs will grow at least that fast — and faster if teams migrate to more expensive models.


Putting It All Together

This week (30 minutes)

  • [ ] Run the billing export query to identify all AI-related charges
  • [ ] List all active Vertex AI endpoints and their monthly costs
  • [ ] Check for forgotten notebooks running on GPUs
  • [ ] Verify you're using the cheapest model that works for each use case

This month (2-3 hours)

  • [ ] Add labels to all Vertex AI endpoints and training jobs (team, model, use case)
  • [ ] Set up budget alerts filtered to AI services
  • [ ] Evaluate whether any Gemini Pro usage can be replaced with Flash
  • [ ] Implement idle shutdown policies on all Workbench notebooks
  • [ ] Switch non-real-time workloads to the Batch API

This quarter

  • [ ] Build a cost-per-request dashboard for each AI application
  • [ ] Evaluate provisioned throughput vs pay-per-use for high-volume applications
  • [ ] Implement prompt caching for applications with repeated system prompts
  • [ ] Set up automated alerts for new endpoint deployments

Expected results

OptimizationTypical Savings
Model downgrade (Pro → Flash)70-90% per request
Remove idle endpoints$500-2,000+/month per endpoint
Batch instead of online prediction50% on eligible workloads
Context window optimization20-60% on input token costs
Prompt caching30-50% on repeated system prompts
Right-sizing training GPUs30-70% on training costs

Struggling to track AI spend across your GCP projects? GCP FinOps helps growing companies identify and eliminate cloud waste without enterprise complexity.


Related Articles:

Written by Matias Coca

Building GCP cost optimization tools for growing companies. Questions or feedback? Let's connect.

Ready to optimize your GCP costs?

See exactly where your cloud spend goes with our cost optimization dashboard.