AI Cloud

Serve Gemma Serverless on Google Cloud Run

Deploy Google's Gemma model on Cloud Run for serverless, auto-scaling LLM inference with pay-per-request pricing and zero idle costs.

The EaseCloud Team

17 Feb 2026 • 6 min read

TLDR;

Scale-to-zero eliminates idle costs with pay-per-request pricing at $0.000024/vCPU-second
Gemma 7B deployment costs $68/month for 100K requests versus $146/month for always-on VM
INT8 quantization provides 2-3x faster inference with 75% memory reduction
Cloud Run handles sub-10-second cold starts for Gemma models up to 9B parameters

Introduction

Google's Gemma models provide high-quality LLM capabilities in compact sizes optimized for efficient inference. Gemma 2B delivers fast responses for lightweight tasks, Gemma 7B balances quality with performance, and Gemma 9B offers maximum capabilities for complex reasoning. All variants support instruction-tuning and run efficiently on Cloud Run's serverless platform without requiring GPU acceleration.

Cloud Run eliminates infrastructure management through fully managed serverless containers. Applications scale automatically from zero to thousands of instances based on request volume, with sub-10-second cold start times for Gemma models. Pay-per-request pricing charges only for actual computation time rounded to the nearest 100ms, making Cloud Run cost-effective for variable workloads where traffic patterns change throughout the day or week.

This guide covers Cloud Run deployment architecture, container optimization for fast cold starts, CPU-based inference optimization through quantization, auto-scaling configuration, cost analysis comparing serverless versus always-on infrastructure, and security patterns for production deployments. You'll learn to build and deploy Gemma inference servers, implement request batching for throughput optimization, configure IAM authentication, and monitor performance with Cloud Operations. These patterns enable production Gemma serving at 50-70% lower costs than VM-based deployments for variable traffic workloads.

Deployment Architecture and Model Selection

Gemma 2B (2.5B parameters, ~5GB memory, ~300 tokens/sec CPU) handles chat and classification. Gemma 7B (7B parameters, ~14GB memory, ~150 tokens/sec) serves general-purpose generation. Gemma 9B (9B parameters, ~18GB memory, ~120 tokens/sec) excels at complex reasoning and coding. Choose based on quality requirements and available Cloud Run memory limits (up to 32GB per instance).

Deployment Guide

Deploy Gemma to Cloud Run.

Build Container Image

# Dockerfile
FROM python:3.10-slim

# Install dependencies
RUN pip install --no-cache-dir \
    transformers \
    torch \
    accelerate \
    fastapi \
    uvicorn \
    gunicorn

# Copy model and code
WORKDIR /app
COPY inference_server.py .

# Expose port
EXPOSE 8080

# Start server
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 inference_server:app

Inference server:

# inference_server.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os

app = FastAPI()

# Load model on startup
print("Loading Gemma model...")
model_name = "google/gemma-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)
print("Model loaded successfully")

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7
    top_p: float = 0.9

@app.post("/generate")
async def generate(request: GenerateRequest):
    try:
        inputs = tokenizer(request.prompt, return_tensors="pt")
        
        with torch.no_grad():
            outputs = model.generate(
                inputs.input_ids,
                max_new_tokens=request.max_tokens,
                temperature=request.temperature,
                top_p=request.top_p,
                do_sample=True
            )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        return {
            "generated_text": response,
            "tokens_generated": len(outputs[0])
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {"status": "healthy", "model": model_name}

@app.get("/")
async def root():
    return {"message": "Gemma API - send POST to /generate"}

Build and Push to Artifact Registry

# Set variables
PROJECT_ID="your-project-id"
REGION="us-central1"
REPOSITORY="llm-models"
IMAGE="gemma-inference"

# Create Artifact Registry repository
gcloud artifacts repositories create $REPOSITORY \
    --repository-format=docker \
    --location=$REGION \
    --description="LLM models"

# Build image
gcloud builds submit --tag $REGION-docker.pkg.dev/$PROJECT_ID/$REPOSITORY/$IMAGE:latest

# Or build locally
docker build -t $REGION-docker.pkg.dev/$PROJECT_ID/$REPOSITORY/$IMAGE:latest .
docker push $REGION-docker.pkg.dev/$PROJECT_ID/$REPOSITORY/$IMAGE:latest

Deploy to Cloud Run

# Deploy service
gcloud run deploy gemma-api \
    --image=$REGION-docker.pkg.dev/$PROJECT_ID/$REPOSITORY/$IMAGE:latest \
    --platform=managed \
    --region=$REGION \
    --memory=16Gi \
    --cpu=4 \
    --timeout=300 \
    --concurrency=10 \
    --min-instances=0 \
    --max-instances=10 \
    --allow-unauthenticated

# Get service URL
gcloud run services describe gemma-api --region=$REGION --format='value(status.url)'

Performance Optimization

Maximize throughput and reduce costs.

CPU Optimization

Cloud Run uses CPU-only instances. Optimize for CPU inference:

# Use INT8 quantization
from transformers import AutoModelForCausalLM
from optimum.intel import INCQuantizer

# Quantize model
quantizer = INCQuantizer.from_pretrained(model_name)
quantized_model = quantizer.quantize(
    calibration_dataset=dataset,
    save_directory="./gemma-int8"
)

# Deploy quantized model
model = AutoModelForCausalLM.from_pretrained(
    "./gemma-int8",
    torch_dtype=torch.int8
)

Benefits:

2-3x faster inference
75% memory reduction
Lower Cloud Run costs

Request Batching

Implement batch endpoints to process multiple prompts simultaneously, improving throughput by 2-3x through parallel tokenization and generation with padding.

Caching Strategy

Configure min-instances=1 during business hours to keep containers warm, eliminating cold starts. Use Cloud Scheduler to scale to zero overnight for cost optimization.

Cost Analysis

Understand Cloud Run pricing.

Pricing components:

vCPU: $0.00002400/vCPU-second
Memory: $0.00000250/GiB-second
Requests: $0.40/million requests

Example calculation (Gemma 7B):

Configuration: 4 vCPU, 16GB memory
Average inference: 5 seconds
Traffic: 100,000 requests/month

Cost breakdown:

vCPU: 100K × 5s × 4 × $0.000024 = $48
Memory: 100K × 5s × 16 × $0.0000025 = $20
Requests: 100K × $0.40/1M = $0.04
Total: ~$68/month

Compare to VM (n1-standard-4, 24/7):

VM cost: ~$146/month
Savings: $78/month (53%)

Scaling Configuration

Optimize auto-scaling behavior.

Instance Limits

# Production configuration
gcloud run services update gemma-api \
    --region=$REGION \
    --min-instances=1 \
    --max-instances=50 \
    --concurrency=5 \
    --cpu-throttling \
    --execution-environment=gen2

Scaling parameters:

concurrency: Requests per instance
Lower concurrency = better latency
Higher concurrency = lower cost

Request Timeout

# Set appropriate timeout
gcloud run services update gemma-api \
    --region=$REGION \
    --timeout=300  # 5 minutes max

Recommendations:

Short prompts: 60s timeout
Long generation: 300s timeout
Batch processing: 900s (max)

Security and Authentication

Secure your Cloud Run service.

IAM Authentication

# Remove public access
gcloud run services remove-iam-policy-binding gemma-api \
    --region=$REGION \
    --member="allUsers" \
    --role="roles/run.invoker"

# Grant access to specific service account
gcloud run services add-iam-policy-binding gemma-api \
    --region=$REGION \
    --member="serviceAccount:api-client@project.iam.gserviceaccount.com" \
    --role="roles/run.invoker"

Client authentication:

import google.auth
from google.auth.transport.requests import Request
import requests

# Get ID token
credentials, project = google.auth.default()
auth_req = Request()
credentials.refresh(auth_req)
id_token = credentials.id_token

# Make authenticated request
response = requests.post(
    "https://gemma-api-xxxxx.run.app/generate",
    headers={"Authorization": f"Bearer {id_token}"},
    json={"prompt": "Explain quantum computing"}
)

API Key Protection

# Add API key validation
from fastapi import Header, HTTPException

API_KEY = os.environ.get("API_KEY")

@app.post("/generate")
async def generate(
    request: GenerateRequest,
    x_api_key: str = Header(...)
):
    if x_api_key != API_KEY:
        raise HTTPException(status_code=401, detail="Invalid API key")
    
    # Process request...

Monitoring with Cloud Operations

Track performance and errors.

Structured Logging

import logging
import json
from google.cloud import logging as cloud_logging

# Setup Cloud Logging
client = cloud_logging.Client()
client.setup_logging()

logger = logging.getLogger(__name__)

@app.post("/generate")
async def generate(request: GenerateRequest):
    import time
    start = time.time()
    
    # Generate response
    result = model.generate(...)
    
    latency = (time.time() - start) * 1000
    
    # Structured log
    logger.info("Inference completed", extra={
        "latency_ms": latency,
        "tokens": len(result),
        "prompt_length": len(request.prompt),
        "model": "gemma-7b"
    })
    
    return result

Custom Metrics

from google.cloud import monitoring_v3

def write_latency_metric(latency_ms):
    client = monitoring_v3.MetricServiceClient()
    project_name = f"projects/{project_id}"
    
    series = monitoring_v3.TimeSeries()
    series.metric.type = "custom.googleapis.com/gemma/latency"
    series.resource.type = "cloud_run_revision"
    series.resource.labels["service_name"] = "gemma-api"
    series.resource.labels["project_id"] = project_id
    
    point = monitoring_v3.Point()
    point.value.double_value = latency_ms
    point.interval.end_time.seconds = int(time.time())
    
    series.points = [point]
    client.create_time_series(name=project_name, time_series=[series])

Conclusion

Cloud Run provides optimal serverless infrastructure for Gemma models when traffic patterns vary and infrastructure management overhead must stay minimal. Pay-per-request pricing eliminates idle costs, making Cloud Run 50-70% cheaper than always-on VM deployments for workloads with <60% average utilization. Fast cold starts (3-10 seconds for Gemma 2B-7B) enable scaling to zero during low-traffic periods without degrading user experience.

Deploy Gemma 7B on Cloud Run with 4 vCPU and 16GB memory for balanced performance at ~$68/month for 100K requests, compared to $146/month for equivalent always-on VM capacity. Use INT8 quantization to reduce memory requirements and fit larger models within Cloud Run's 32GB limit. Configure min-instances=1 during business hours to eliminate cold starts for critical workloads while maintaining cost efficiency.

Cloud Run's serverless model excels for API endpoints, batch processing jobs, and development environments where infrastructure simplicity trumps maximum performance. For sustained high-volume traffic (>1M requests/month) or GPU-dependent workloads, consider GKE or Vertex AI Prediction. Start with Cloud Run for rapid deployment and migrate to managed infrastructure only when traffic patterns justify operational complexity.

Frequently Asked Questions

How do I optimize Gemma cold start times on Cloud Run?

Minimize cold starts with --min-instances=1, use smaller Gemma variants (2B starts in 5-15s versus 15-30s for 7B), enable CPU boost, and pre-load models during container build. For production, allocate 1-2 minimum instances for always-warm containers. Use Cloud Scheduler to ping endpoints every 5 minutes during off-peak hours to prevent complete scale-to-zero.

What's the cost difference between Cloud Run and GKE for Gemma deployment?

Cloud Run costs less for variable workloads with idle time (pay only for requests, no idle costs). Gemma 7B serving 10M requests/month at 2 seconds each costs ~$280/month on Cloud Run versus ~$180/month for always-on GKE e2-standard-8 node. Cloud Run becomes more expensive at >60% sustained utilization. Choose Cloud Run for bursty traffic and low-moderate volume. Choose GKE for consistent high-volume traffic or multi-model deployments.

Can I deploy larger models like Gemma 70B on Cloud Run?

No, Cloud Run's 32GB memory limit prevents Gemma 70B deployment (requires ~140GB FP16). Gemma 7B fits comfortably at 14-16GB. For larger models, use GKE with GPU nodes, Vertex AI Prediction with A100 GPUs, or INT4 quantization to compress 13B models to ~8GB (with quality trade-offs).