AI spend is the fastest-growing line item on most GCP bills. Teams experimenting with Gemini, Vertex AI, or any of the Cloud AI services often have no idea what they're actually spending until the invoice arrives. The reason: AI costs are buried in billing exports under generic service names, split across multiple SKUs, and priced in units (tokens, characters, seconds) that don't map intuitively to dollars.
The good news is that once you understand the billing structure, tracking and optimizing AI costs follows the same principles as any other GCP service — labels, billing export queries, and architectural discipline. This guide covers the practical steps to get there.
Where AI Costs Hide in GCP Billing
The first challenge is finding your AI spend. GCP doesn't group all AI costs under a single "AI" category. Instead, they're scattered across several service names in billing exports:
| Billing Service Name | What It Covers |
|---|---|
| Vertex AI | Model training, online/batch predictions, endpoints, Vertex AI Studio, custom models |
| Cloud AI APIs | Vision API, Natural Language API, Translation API, Speech-to-Text, Text-to-Speech |
| Vertex AI Generative AI | Gemini API calls, PaLM (legacy), Imagen, embeddings |
| Compute Engine | GPUs/TPUs used for training or serving (billed as compute, not AI) |
| Cloud Storage | Training data, model artifacts, pipeline outputs |
Quick query to find all AI-related charges
SELECT
service.description AS service,
sku.description AS sku,
ROUND(SUM(cost), 2) AS total_cost
FROM project.dataset.gcp_billing_export_v1_XXXXXX
WHERE invoice.month = '202603'
AND (
service.description LIKE '%Vertex AI%'
OR service.description LIKE '%Cloud AI%'
OR service.description LIKE '%Natural Language%'
OR sku.description LIKE '%GPU%'
OR sku.description LIKE '%TPU%'
OR sku.description LIKE '%Gemini%'
OR sku.description LIKE '%PaLM%'
OR sku.description LIKE '%Imagen%'
)
GROUP BY service, sku
ORDER BY total_cost DESC;
Run this against your billing export to get the full picture of AI-related charges, including GPU compute that might be hiding under Compute Engine.
Understanding Vertex AI and Gemini API Pricing
AI pricing on GCP is fundamentally different from traditional cloud pricing. Instead of paying for compute time, you pay per token (for generative models) or per prediction (for custom models). Understanding the unit economics is essential for forecasting and optimization.
Gemini API pricing (input and output tokens)
Gemini models charge separately for input and output tokens. The price gap between models is significant:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best For |
|---|---|---|---|
| Gemini 2.5 Pro | $1.25 - $10.00 | $10.00 - $30.00 | Complex reasoning, multi-step tasks |
| Gemini 2.5 Flash | $0.15 - $0.60 | $1.00 - $3.50 | High-volume, cost-sensitive workloads |
| Gemini 2.0 Flash | $0.10 | $0.40 | Simple tasks, high throughput |
The output-to-input price ratio matters more than absolute price. Output tokens cost 3-8x more than input tokens, so tasks that generate long responses (code generation, summarization) are disproportionately expensive.
Custom model pricing
For custom-trained models or fine-tuned models deployed on Vertex AI endpoints:
| Component | Pricing |
|---|---|
| Training (GPUs) | Per GPU-hour ($1.10 - $12.00+/hr depending on GPU type) |
| Online prediction | Per node-hour (based on machine type + accelerator) |
| Batch prediction | Per node-hour (typically 40-60% cheaper than online) |
| Model storage | Per GB/month |
Provisioned throughput vs pay-per-use
For high-volume Gemini API usage, GCP offers provisioned throughput — you buy a guaranteed capacity (tokens per minute) at a discount. This only makes sense if:
- You have consistent, predictable traffic
- Your volume is high enough that the commitment discount exceeds your average usage
- You can accurately forecast demand
Cost Allocation by Model and Team
Once you can find AI costs, the next step is attributing them — which model, which team, which use case is driving spend.
Using labels on Vertex AI resources
Label everything:
When creating a Vertex AI endpoint
from google.cloud import aiplatform
endpoint = aiplatform.Endpoint.create(
display_name="customer-support-gemini",
labels={
"team": "support",
"model": "gemini-2-5-flash",
"environment": "production",
"use_case": "ticket-classification"
}
)
When submitting a training job
job = aiplatform.CustomTrainingJob(
display_name="churn-model-v3",
labels={
"team": "data-science",
"model_type": "custom",
"environment": "production"
},
# ...
)
Labels propagate to billing exports, enabling cost queries by team, model, or use case.
Billing export query: AI spend by label
SELECT
labels.value AS team,
service.description AS service,
sku.description AS sku,
ROUND(SUM(cost), 2) AS total_cost
FROM project.dataset.gcp_billing_export_v1_XXXXXX,
UNNEST(labels) AS labels
WHERE invoice.month = '202603'
AND labels.key = 'team'
AND (
service.description LIKE '%Vertex AI%'
OR service.description LIKE '%Cloud AI%'
OR sku.description LIKE '%Gemini%'
)
GROUP BY team, service, sku
ORDER BY total_cost DESC;
Tracking Gemini API costs per application
If multiple applications call the Gemini API through the same project, costs blend together. Separate them by:
- Using different projects per application (cleanest, but adds management overhead)
- Using labels on API requests (if your framework supports passing labels through the Vertex AI SDK)
- Logging request metadata and joining with billing data post-hoc
Common AI Cost Traps
These are the mistakes we see most often. Each one is easy to make and expensive to ignore.
1. Over-provisioned endpoints sitting idle
This is the #1 source of AI cost waste. A Vertex AI endpoint with a GPU runs 24/7 whether it's serving requests or not. Teams deploy a model for testing, forget about it, and discover months later they've been paying $800+/month for an endpoint handling zero traffic.
How to find them:
SELECT
resource.name AS endpoint_name,
ROUND(SUM(cost), 2) AS monthly_cost,
COUNT(DISTINCT usage_start_time) AS billing_periods
FROM project.dataset.gcp_billing_export_v1_XXXXXX
WHERE service.description = 'Vertex AI'
AND sku.description LIKE '%Prediction%'
AND invoice.month = '202603'
GROUP BY endpoint_name
ORDER BY monthly_cost DESC;
Cross-reference these with actual traffic in Vertex AI monitoring. If an endpoint costs $500+/month but handles < 100 requests/day, it's a candidate for right-sizing or removal.
2. Using expensive models for simple tasks
Gemini 2.5 Pro is powerful, but it costs 10-30x more than Flash models. If you're using Pro for tasks like classification, extraction, or simple Q&A, you're likely overpaying.
A practical test: Run your task on Gemini 2.0 Flash first. If the quality is acceptable, you're done. If not, try 2.5 Flash. Only escalate to Pro for tasks that genuinely need complex reasoning.
| Task Type | Recommended Model | Why |
|---|---|---|
| Text classification | Gemini 2.0 Flash | Simple input/output, doesn't need reasoning |
| Summarization | Gemini 2.5 Flash | Needs some comprehension, but not complex reasoning |
| Code generation | Gemini 2.5 Pro | Complex reasoning significantly improves output quality |
| Data extraction | Gemini 2.0 Flash | Structured input/output, pattern matching |
| Multi-step analysis | Gemini 2.5 Pro | Requires chain-of-thought reasoning |
3. Forgotten training jobs and notebooks
GPU-backed Vertex AI Workbench notebooks and training jobs that aren't cleaned up keep the underlying compute running — and billing. A single notebook with an NVIDIA T4 costs ~$330/month if left running.
Set idle shutdown policies on all notebooks, and configure training jobs with maximum runtime limits.
4. Context window bloat
Every token you send to a model costs money. Teams often send entire documents when a summary or relevant excerpt would work. A 100K-token prompt to Gemini 2.5 Pro costs ~$1.00 per request. If you're making 1,000 requests/day, that's $30,000/month in input tokens alone.
Strategies to reduce input tokens:
- Trim irrelevant context before sending to the model
- Use retrieval-augmented generation (RAG) to send only relevant chunks
- Cache system prompts when using models that support prompt caching
- Summarize long documents before including them as context
5. Streaming when batch would work
The Gemini Batch API is significantly cheaper than the standard API for workloads that don't need real-time responses. If you're processing documents, generating reports, or running analysis jobs, batch mode can cut costs by 50%.
Practical Optimization Strategies
Model selection framework
Build a decision tree for your team:
Is the task simple (classification, extraction, short answers)?
→ Start with Gemini 2.0 Flash
Does quality suffer with Flash?
→ Move to Gemini 2.5 Flash
Does the task require complex reasoning or long-form generation?
→ Use Gemini 2.5 Pro
Is the volume high (>10K requests/day)?
→ Evaluate provisioned throughput pricing
Prompt caching
For applications that use the same system prompt across many requests, Gemini's context caching reduces costs by caching the repeated prefix. Instead of paying full input token price for the system prompt every time, you pay a reduced rate for cached tokens.
This is especially impactful for applications with large system prompts (1K+ tokens) and high request volumes.
Batch API for non-real-time workloads
If your use case doesn't need a response in seconds, use the Batch Prediction API:
from google.cloud import aiplatform
batch_job = aiplatform.BatchPredictionJob.create(
model_name="publishers/google/models/gemini-2.5-flash",
input_config={
"instances_format": "jsonl",
"gcs_source": {"uris": ["gs://bucket/input.jsonl"]}
},
output_config={
"predictions_format": "jsonl",
"gcs_destination": {"output_uri_prefix": "gs://bucket/output/"}
},
labels={"team": "analytics", "use_case": "document-processing"}
)
Batch prediction typically costs 50% less than online prediction and doesn't require a persistent endpoint.
Right-sizing GPU/TPU for training
Don't default to the largest available GPU. Match the GPU to your model size:
| Model Parameters | Recommended GPU | Cost/hour |
|---|---|---|
| < 1B | NVIDIA T4 (16 GB) | ~$0.35 |
| 1-7B | NVIDIA L4 (24 GB) | ~$0.70 |
| 7-13B | NVIDIA A100 (40 GB) | ~$3.67 |
| 13B+ | NVIDIA A100 (80 GB) or H100 | $3.67 - $12.00+ |
Set spending limits
Configure budget alerts filtered to AI services:
- Billing → Budgets & alerts → Create budget
- Filter by service: Vertex AI, Cloud AI APIs
- Set thresholds at 50%, 80%, 100% of expected monthly spend
- Route alerts to your engineering channel, not just finance
Monitoring AI Spend
Weekly cost trend query
SELECT
PARSE_DATE('%Y%m%d', FORMAT_TIMESTAMP('%Y%m%d', usage_start_time)) AS date,
service.description AS service,
ROUND(SUM(cost), 2) AS daily_cost
FROM project.dataset.gcp_billing_export_v1_XXXXXX
WHERE usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
AND (
service.description LIKE '%Vertex AI%'
OR service.description LIKE '%Cloud AI%'
OR sku.description LIKE '%Gemini%'
)
GROUP BY date, service
ORDER BY date DESC;
Look for:
- Sudden spikes: Usually a new endpoint deployed or a batch job running
- Steady upward trend: Growing usage that needs capacity planning
- Weekend costs equal weekday costs: Likely idle endpoints running 24/7
Forecasting AI spend growth
AI costs tend to grow faster than traditional cloud costs because:
- Teams add new use cases faster than they optimize existing ones
- Token volumes increase as adoption spreads across the organization
- Model upgrades (Flash → Pro) happen without cost review
Track your weekly token consumption alongside costs. If token volume grows 20%/month, your costs will grow at least that fast — and faster if teams migrate to more expensive models.
Putting It All Together
This week (30 minutes)
- [ ] Run the billing export query to identify all AI-related charges
- [ ] List all active Vertex AI endpoints and their monthly costs
- [ ] Check for forgotten notebooks running on GPUs
- [ ] Verify you're using the cheapest model that works for each use case
This month (2-3 hours)
- [ ] Add labels to all Vertex AI endpoints and training jobs (team, model, use case)
- [ ] Set up budget alerts filtered to AI services
- [ ] Evaluate whether any Gemini Pro usage can be replaced with Flash
- [ ] Implement idle shutdown policies on all Workbench notebooks
- [ ] Switch non-real-time workloads to the Batch API
This quarter
- [ ] Build a cost-per-request dashboard for each AI application
- [ ] Evaluate provisioned throughput vs pay-per-use for high-volume applications
- [ ] Implement prompt caching for applications with repeated system prompts
- [ ] Set up automated alerts for new endpoint deployments
Expected results
| Optimization | Typical Savings |
|---|---|
| Model downgrade (Pro → Flash) | 70-90% per request |
| Remove idle endpoints | $500-2,000+/month per endpoint |
| Batch instead of online prediction | 50% on eligible workloads |
| Context window optimization | 20-60% on input token costs |
| Prompt caching | 30-50% on repeated system prompts |
| Right-sizing training GPUs | 30-70% on training costs |
Struggling to track AI spend across your GCP projects? GCP FinOps helps growing companies identify and eliminate cloud waste without enterprise complexity.
Related Articles: