Explicit Context Caching with Google Gemini
TLDR; use context/prompt caching, and get:
- Much lower TTFT(Time-to-first-token) latency. I’ve observed it going from order of seconds to 100s of milliseconds.
- Encourages and enables longer prompts because of cost savings
- Longer propmts with more examples enables use of smaller/cheaper models, further saving costs and improving TTFT
OpenAI (or if you run your OpenAI models on Azure) performs context caching automatically for supported models.
For Gemini however, we had trouble getting it to reliablly use implicit (automatic) context caching. (I’m not alone)[https://github.com/googleapis/python-genai/issues/1880] and have observed this issue since mid 2025. I have shared an example of how to do.
Cost and Latency
On one of our prompts, we saw TTFT go from
Cost is an obvious and major factor. 90% savings in cost is the difference between positive and negative unit economics, and therefore, essential for profitability and sustainability.
Combine that with the assumption that $ per unit of intelligence will likely continue to fall (10x every 6 months 🤯), we can leverage the cost saving to performance.
Considerations
- Cached tokens have a cost. Be mindful, It can add up. Delete caches when not needed.
- Local in-memory cache doesn’t persist across server restarts. Each restart will create new caches in Google’s service, adding to storage costs.
- This example is not suitable for distributed systems as-is. Each instance maintains its own local cache, potentially creating duplicate caches in Google’s service.
- This example does not automatically clean up expired caches from Google’s service. Consider implementing a cleanup job or relying on TTL expiration.
- It’s useful to version your prompt names (e.g., with your build number) so caches are automatically invalidated when prompts change.
- However, be mindful to cleanup old prompts appropriately to save costs.
Usage
import { GoogleGenAI, GenerateContentParameters } from "@google/genai";
const genAI = new GoogleGenAI({ apiKey: process.env.GOOGLE_API_KEY });
const cache = new GeminiContextCache(86400); // 1 day TTL
const params: GenerateContentParameters = {
model: "gemini-2.5-flash",
config: {
systemInstruction: "You are a helpful assistant that...", // Your long system prompt
},
contents: [{ role: "user", parts: [{ text: "Hello!" }] }],
};
// First call to `applyCache` creates it, subsequent calls reuse it
const cachedParams = await cache.applyCache(genAI, params, "my-assistant-prompt");
const response = await genAI.models.generateContent(cachedParams);
import {
CachedContent,
GenerateContentParameters,
GoogleGenAI,
Tool,
} from "@google/genai";
/**
* A wrapper for Google Gemini's explicit context caching.
*/
class GeminiContextCache {
/** Local in-memory cache */
private cachedContents: Map<string, CachedContent> = new Map();
/** Tracks prompts that don't meet Gemini's 1024 token minimum for caching */
private promptsWithTooFewTokens: Set<string> = new Set();
/** Time-to-live for cached content (e.g., "86400s" for 1 day) */
private readonly cacheTtl: string;
/**
* @param cacheTtlSeconds How long caches should live in seconds. Default is 1 day.
*/
constructor(cacheTtlSeconds: number = 86400) {
this.cacheTtl = `${cacheTtlSeconds}s`;
}
/**
* Applies cached content to generation parameters if available.
*
* This is the main entry point. Call this before generateContent() to
* automatically use caching for your system instruction and tools.
*
* Cache lookup order:
* 1. Local in-memory cache (fast path, no API call)
* 2. Google's cache service (by displayName)
* 3. Create new cache if none found
*
* @param genAI - The GoogleGenAI client instance
* @param params - Original generation parameters with systemInstruction
* @param promptName - A unique name for this prompt (e.g., "my-assistant-prompt")
* @returns Modified params with cachedContent set, ready for generateContent()
*/
async applyCache(
genAI: GoogleGenAI,
params: GenerateContentParameters,
promptName: string
): Promise<GenerateContentParameters> {
if (!params.config) {
return params;
}
const newParams = structuredClone(params);
newParams.config.cachedContent = await this.getOrCreateCache(
genAI,
params,
promptName
);
// When using cachedContent, these fields must be unset - they're already in the cache
if (newParams.config.cachedContent) {
newParams.config.toolConfig = undefined;
newParams.config.tools = undefined;
newParams.config.systemInstruction = undefined;
}
return newParams;
}
/**
* Retrieves an existing cache or creates a new one.
* Checks local memory first, then Google's service, then creates if needed.
*/
private async getOrCreateCache(
genAI: GoogleGenAI,
params: GenerateContentParameters,
promptName: string
): Promise<string | undefined> {
const cacheKey = `${params.model}-${promptName}`;
// Fast path: check local in-memory cache (no API call)
const localCache = this.cachedContents.get(cacheKey);
if (localCache && this.isCacheValid(localCache)) {
return localCache.name;
}
// Skip API calls if we already know the prompt is too short
if (this.promptsWithTooFewTokens.has(cacheKey)) {
return undefined;
}
// Check Google's cache service for an existing cache
const existingCache = await this.findCacheByDisplayName(genAI, cacheKey);
if (existingCache && this.isCacheValid(existingCache)) {
this.cachedContents.set(cacheKey, existingCache);
return existingCache.name;
}
// No valid cache exists - create a new one
return this.createCache(genAI, params, cacheKey);
}
/**
* Checks if a cache is still valid with a 5-minute safety buffer.
* The buffer prevents using a cache that might expire mid-request.
*/
private isCacheValid(cache: CachedContent): boolean {
if (!cache.expireTime) return true;
const expirationDate = new Date(cache.expireTime);
const safetyBufferMs = 5 * 60 * 1000; // 5 minutes
return expirationDate.getTime() - Date.now() > safetyBufferMs;
}
/**
* Searches Google's cache service for a cache matching our displayName.
* This allows cache reuse across application restarts.
*/
private async findCacheByDisplayName(
genAI: GoogleGenAI,
displayName: string
): Promise<CachedContent | undefined> {
try {
const caches = await genAI.caches.list({ config: { pageSize: 1000 } });
for await (const cache of caches) {
if (cache.displayName === displayName) {
return cache;
}
}
} catch {
return undefined;
}
return undefined;
}
/**
* Creates a new cache in Google's service.
*
* Note: Gemini requires at least 1024 tokens for caching. If your system
* instruction is shorter, caching will be skipped and the prompt is
* remembered to avoid repeated token counting.
*/
private async createCache(
genAI: GoogleGenAI,
params: GenerateContentParameters,
cacheKey: string
): Promise<string | undefined> {
const tools = this.extractFunctionDeclarationTools(params);
try {
// Count tokens to check if we meet the 1024 minimum
const tokens = await genAI.models.countTokens({
contents: params.config.systemInstruction || "",
model: params.model,
config: { tools },
});
if (tokens.totalTokens < 1024) {
// Remember this to skip future API calls
this.promptsWithTooFewTokens.add(cacheKey);
return undefined;
}
const cache = await genAI.caches.create({
model: params.model,
config: {
systemInstruction: params.config.systemInstruction,
displayName: cacheKey,
tools,
toolConfig: params.config?.toolConfig,
ttl: this.cacheTtl,
},
});
this.cachedContents.set(cacheKey, cache);
return cache.name;
} catch {
return undefined;
}
}
/**
* Extracts only function declaration tools from params.
* Other tool types (like code execution) aren't supported in cached content.
*/
private extractFunctionDeclarationTools(
params: GenerateContentParameters
): Tool[] | undefined {
if (!params.config?.tools) return undefined;
const tools: Tool[] = params.config.tools.filter(
(tool) => "functionDeclarations" in tool
);
return tools.length > 0 ? tools : undefined;
}
}