30-second answer: Gemini Video Search is Google's ability to ingest a video file, understand its visual content frame by frame, and produce searchable embeddings from what it sees. You upload a video, Gemini watches it and creates a mathematical fingerprint of each moment, and then you can search through all of it with plain English — "find when the red car arrives" or "show me every moment the error screen appears" — and get sub-second results. A working open-source demo called SentrySearch proved this out on real security camera footage. Any builder with a Google AI API key can do this today.
Jump to: Why This Matters · How It Works · Real Use Cases · Getting Started · What Can Go Wrong · FAQ · What to Learn Next
Why This Matters for Builders
Until very recently, video was one of the hardest types of content to make searchable. The only practical approaches were:
- Manual tagging: Someone watches the video and writes descriptions. Slow, expensive, doesn't scale.
- Caption-based search: Auto-generated speech transcripts you can search with keywords. Only captures what people say, not what appears on screen.
- Timestamp metadata: Someone writes chapter markers or notes. Still manual work.
None of these approaches let you search for visual events — objects appearing, actions happening, text on screen, scene changes — unless a human watched and described them first. That bottleneck made video search fundamentally limited compared to text search.
Gemini's video understanding changes this completely. The model doesn't need a human to describe the video. It watches it directly, understands what's in each frame, and produces searchable representations of that visual content automatically. The result is that searching video becomes as natural as searching text — and almost as fast.
A Show HN project called SentrySearch demonstrated this concretely in early 2026. The project showed a security camera setup where you could type "find when the package arrives" or "show me when the motion light turns on" and get sub-second results over hours of footage — with no manual tagging, no captions, and no pre-built database of what happened. The post hit 100 points on Hacker News, and the reactions were overwhelmingly "I didn't know you could do this yet."
The reason this matters for non-traditional builders — people using AI as their primary coding partner rather than coming from a computer vision background — is that this capability is now accessible through a normal API call. You don't need to know anything about optical flow algorithms, convolutional neural networks, or frame extraction pipelines. You upload a video file, call the Gemini API, and ask questions about it. Your AI coding assistant can write most of that code for you today.
New to embeddings?
Video search with Gemini uses the same fundamental concept as text search with embeddings — the model converts content into numerical representations that can be compared for similarity. If "embeddings" is a new concept for you, the embeddings explainer covers exactly what they are and why they're useful before you dive into the video-specific details here.
How It Works (Plain English)
You don't need to understand the internals to use this. But a basic mental model will help you make good decisions about when to use it and why the results behave the way they do.
Step 1: Upload the Video
Before Gemini can analyze a video, it needs to receive the file. Google provides a File API specifically for this — you upload a video file (MP4, MOV, AVI, and others are supported) and get back a file URI that you can reference in subsequent API calls. Think of it like attaching a document: the file lives on Google's servers temporarily, and you reference it by its identifier.
The upload process is separate from the analysis. For a 10-minute video, the upload might take a few seconds depending on your connection. Once it's uploaded, the analysis calls are fast — the file is already there, so Gemini is working from a local copy rather than re-downloading it every time.
Step 2: Gemini Watches the Video
When you make an API call that references a video file, Gemini samples the video — approximately one frame per second — and processes each frame through its visual understanding system. It recognizes objects, people, actions, text on screen, spatial relationships between things, and how the scene changes over time.
This is the core capability that's new. Gemini isn't running optical character recognition on the frames (though it can read text on screen). It's doing visual scene understanding — the same capability that lets it describe a photograph or explain what a diagram means. Applied to video, it applies this understanding to every sampled frame in sequence, giving it a temporal understanding of what happens and when.
Step 3: Creating the Searchable Fingerprints (Embeddings)
For one-off questions ("what happens at 2:34?") you can just send Gemini the video and ask a natural language question directly. But for search — where you want to query across multiple videos or query the same video many times — you need a different approach.
Embeddings are the answer. Instead of asking Gemini a question every time, you run the video through Gemini once and store a set of numerical vectors — one per chunk of video — that represent what Gemini understood about each moment. These vectors are stored in a vector database or even a simple in-memory structure.
When you want to search, your text query ("find when the whiteboard appears") is also converted to a vector using the same embedding model. Then you run a similarity comparison between the query vector and all the stored video vectors. The closest matches are your results. This is the same architecture as Retrieval Augmented Generation (RAG), but instead of chunked text documents, your data is chunked video moments.
The speed advantage of this approach is significant. The slow part (Gemini understanding the video) happens once, offline, when you first process the video. Every subsequent search query is just a fast vector similarity lookup — which is why SentrySearch achieved sub-second search times even on hours of footage.
The Key Insight: One Processing Pass, Infinite Searches
This is the architectural point most worth internalizing. The Gemini API call is the expensive step — in time and money. Once you've run a video through Gemini and stored the embeddings, you can run as many search queries against it as you want at near-zero cost. The search is a math operation, not an AI call.
This means the economics work for high-query scenarios. A surveillance system might process one hour of footage (expensive once), then serve a hundred different search queries throughout the day (each query is essentially free). That's a very different cost structure from calling Gemini on every search.
Want to understand the token math?
Video is billed in tokens like text — roughly 258 tokens per second of video. A 60-minute video is about 930,000 tokens for the processing pass. That's significant but bounded — and you only pay it once if you store the embeddings. The AI tokens explainer covers how token pricing works across different model types.
Real Use Cases for Non-Traditional Builders
The interesting thing about this capability becoming API-accessible is the range of products suddenly within reach for solo builders and small teams. Here are the use cases that are practical today.
Security Camera Footage Search
This is exactly what SentrySearch built. A camera records continuously. You want to find specific events: "when did the delivery arrive?", "show me every time someone opened the back gate", "find the moment the car alarm went off." Without AI, you either scrub through footage manually or install expensive purpose-built software. With Gemini video search, a solo builder can ship this feature in days.
The approach: process overnight footage in a background job, store the embeddings, give the user a search bar. Users can query in natural language and jump directly to the relevant moment. This is a completely buildable weekend project with Gemini's API.
Video Tutorial and Course Navigation
Online courses and tutorial videos have a navigation problem. A 3-hour course on React has no good way to answer "find the part where they explain useCallback" unless the creator manually timestamped every section. AI video search fixes this automatically.
Process each course video, embed it, and give students a search bar. Type "dependency injection" in a backend course, get timestamps of every moment it's discussed or demonstrated on screen. This is a significant quality-of-life improvement that course platforms and individual educators could offer without a large engineering team.
Screen Recording and Meeting Search
Tools like Loom, Zoom, and Riverside generate enormous amounts of video content that's almost impossible to navigate retroactively. "I know someone demoed the new feature in one of our calls last month but I can't find which one" is a problem every team has. Video search turns your meeting archive into something actually searchable.
Process recordings as they come in, embed them, and give your team a search interface. "Find every meeting where we discussed the pricing page" or "show me when Marcus demoed the checkout flow" become answerable queries rather than excavation projects.
Product Demo and Sales Call Analysis
Sales teams record calls. QA teams record user testing sessions. Design teams record usability studies. In every case, they face the same problem: hours of footage, and no good way to find the moments that matter.
AI video search lets a sales manager ask "show me every call where the prospect asked about enterprise pricing" across every recorded call this quarter. A UX researcher can ask "find all the moments users looked confused at the onboarding flow" across a library of session recordings.
Sports and Event Highlight Extraction
Sports footage is another naturally high-volume, hard-to-navigate use case. "Find every goal in this match" or "show me all the moments the referee called a foul" can now be answered programmatically without manual video editing or specialized computer vision tooling. For smaller leagues, amateur sports, and esports that lack the resources of professional broadcast teams, this opens up highlight creation and analysis that was previously out of reach.
Content Moderation at Scale
Any platform that accepts video uploads faces content moderation challenges. Traditional approaches rely on classifiers trained on labeled datasets — expensive to build and slow to adapt. Gemini's video understanding can be used to flag specific types of content, check for brand safety issues, or identify policy violations using natural language descriptions rather than trained models. For a small platform that can't afford a dedicated trust-and-safety team, this is a meaningful tool.
Getting Started
Here's the practical path from zero to a working video search implementation. You'll need a Google AI API key (get one free from Google AI Studio) and either Python or JavaScript.
Step 1: Upload the Video
The File API handles video uploads. Once a file is uploaded, it stays available for 48 hours, so you can reference it in multiple API calls without re-uploading.
# Python — upload a video file to Google's File API
import google.generativeai as genai
import time
import os
genai.configure(api_key=os.environ.get("GOOGLE_AI_API_KEY"))
# Upload the video file
print("Uploading video...")
video_file = genai.upload_file(path="security_footage.mp4")
# Wait for processing to complete (Gemini needs to extract frames)
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = genai.get_file(video_file.name)
if video_file.state.name == "FAILED":
raise ValueError("Video processing failed")
print(f"Upload complete. File URI: {video_file.uri}")
Step 2: Ask Gemini Questions About the Video
Once uploaded, you can ask natural language questions directly. This is the simplest pattern — useful for one-off analysis or prototyping before you build the full search system.
# Python — ask Gemini a question about the uploaded video
model = genai.GenerativeModel(model_name="gemini-2.0-flash")
response = model.generate_content([
video_file,
"Describe everything that happens in this video in chronological order, "
"including timestamps in seconds for each notable event."
])
print(response.text)
For a 10-minute security camera clip, this will return a detailed description of every notable event with approximate timestamps — all from one API call. No manual review required.
Step 3: Build the Searchable Index
For a real search system, you want to process the video once and store embeddings you can query repeatedly. The pattern breaks the video into chunks and embeds each chunk separately.
# Python — chunk a video description and generate searchable embeddings
import json
# First, get a structured description with timestamps from Gemini
model = genai.GenerativeModel(model_name="gemini-2.0-flash")
description_response = model.generate_content([
video_file,
"""Analyze this video and return a JSON array of events.
Each event should have:
- "start_seconds": when it begins
- "end_seconds": when it ends
- "description": plain English description of what happens
Be specific about objects, actions, and any text visible on screen.
Return only valid JSON, no other text."""
])
# Parse the structured events
events = json.loads(description_response.text)
# Now embed each event description for semantic search
embedded_events = []
for event in events:
embedding_result = genai.embed_content(
model="models/text-embedding-004",
content=event["description"],
task_type="retrieval_document"
)
embedded_events.append({
"start_seconds": event["start_seconds"],
"end_seconds": event["end_seconds"],
"description": event["description"],
"embedding": embedding_result["embedding"]
})
# Save to disk (or a vector DB in production)
with open("video_index.json", "w") as f:
json.dump(embedded_events, f)
print(f"Indexed {len(embedded_events)} video segments")
Step 4: Search the Index
# Python — search the video index with a natural language query
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def search_video(query: str, index: list, top_k: int = 3):
# Embed the search query
query_embedding = genai.embed_content(
model="models/text-embedding-004",
content=query,
task_type="retrieval_query"
)["embedding"]
# Score every segment
scored = []
for segment in index:
score = cosine_similarity(query_embedding, segment["embedding"])
scored.append((score, segment))
# Return top matches
scored.sort(key=lambda x: x[0], reverse=True)
return scored[:top_k]
# Load the index
with open("video_index.json") as f:
index = json.load(f)
# Search
results = search_video("the package is delivered", index)
for score, segment in results:
minutes = int(segment["start_seconds"] // 60)
seconds = int(segment["start_seconds"] % 60)
print(f"[{minutes:02d}:{seconds:02d}] Score: {score:.3f}")
print(f" {segment['description']}")
print()
This search call is instant — it's just vector math on data you've already computed. No additional Gemini API calls needed. That's the architecture that enables sub-second search over large video libraries.
Study SentrySearch
The SentrySearch project (github.com/ssrajadh/sentrysearch) is a working implementation of exactly this pattern applied to security footage. Cloning it and reading through the code is the fastest way to see a complete, production-tested version of these patterns in action. The Show HN comments also have useful discussion about architectural trade-offs.
In Production: Use a Real Vector Database
The JSON file approach above is fine for a prototype. For anything with more than a few hours of footage or multiple users, you'll want a proper vector database — options like Pinecone, Qdrant, Supabase's vector extension, or PocketBase with a vector plugin. The query pattern stays the same; you're just replacing the flat JSON lookup with a database query that handles indexing and filtering efficiently.
The RAG explainer covers vector database options and trade-offs in more detail, since the infrastructure decisions are identical whether your source content is video, text, or any other media type.
What Can Go Wrong
Video search with Gemini is genuinely useful, but there are several failure modes and limitations you should understand before you ship something that depends on it.
Frame Sampling Misses Fast Events
Gemini processes approximately one frame per second of video. For most use cases — someone arriving at a door, text appearing on screen, an object being placed on a table — that's more than enough resolution. But for fast events that last less than a second — a muzzle flash, a ball crossing a goal line, a specific card flip in a card game — the relevant frame may simply not be sampled.
If your use case involves high-speed events, test explicitly with your actual footage before committing to this approach. In many cases, one frame per second is fine. In some, it's a hard limitation.
Hallucinated Details
Like all large language models, Gemini can hallucinate — produce confident descriptions of things that weren't in the video. This is more likely to affect specific details ("the person was wearing a red jacket") than general events ("someone entered the room"), and more likely when the video quality is poor, lighting is bad, or the scene is ambiguous.
For high-stakes use cases (security, legal, medical), treat Gemini's descriptions as indicators that point you to the right timestamp, then verify by watching that segment yourself. Don't rely solely on the AI description as evidence of what happened.
Cost Scales with Video Length
Video is expensive in tokens. At roughly 258 tokens per second, a one-hour video costs approximately 930,000 tokens for the processing pass. At current Gemini Flash pricing, that's a small but real cost — and it multiplies quickly when you're processing dozens or hundreds of videos.
The mitigating strategy is the one-and-done processing architecture: process each video once, store the embeddings, never process it again. If you're processing the same video repeatedly (re-running analysis on every search), you'll burn money fast. Build the caching layer early. Understanding context windows and token costs will help you size this correctly for your use case.
The 48-Hour File Expiration
Files uploaded to Google's File API are automatically deleted after 48 hours. This is important for your architecture: if you're planning to reference a video file URI in future API calls, you need to re-upload periodically, or store the processed embeddings so you don't need the original file URI anymore. For a search system where you process videos once and store embeddings, this usually isn't a problem — you only need the file URI during the initial processing window. But if your system re-runs analysis on-demand, you need to handle re-uploading.
Video Quality and Compression Artifacts
Heavily compressed video, low-resolution footage, or videos with significant noise degrade Gemini's understanding. Security cameras, in particular, often produce low-bitrate footage that makes it harder to distinguish between similar-looking scenes. Test your actual footage quality early, not just with pristine sample videos.
Supported Format Limits
The File API supports common formats but not all of them. Unusual container formats or codecs may fail silently or produce degraded results. The maximum file size is 2GB per file. Videos longer than approximately one hour need to be split before processing.
When Asking Your AI Coding Assistant About This
AI assistants trained before 2026 may not know that Gemini supports native video understanding via the standard API. They may tell you to use third-party computer vision services, suggest extracting frames manually and sending them as images, or recommend older approaches like transcription-only search. If your assistant gives you a solution that doesn't use Gemini's video file support directly, it may be working from outdated knowledge. Ask specifically about the Gemini File API and video embeddings to get current approaches.
Additionally, AI assistants may conflate two different Gemini video patterns:
- Direct video Q&A: Send a video, ask a question, get an answer. Simple but you pay for every query.
- Embeddings-based search: Process once, store embeddings, run fast free queries. More complex to build but dramatically cheaper at scale.
Make sure you're getting advice about the right architecture for your use case. If you're building a search feature used many times per day, the embeddings approach is the right one — but some AI assistants default to the simpler direct Q&A approach without flagging the cost implications.
Beyond Search: Other Things Gemini Can Do With Video
Search is the most immediately practical video use case, but Gemini's video understanding enables several other capabilities worth knowing about.
Video Summarization
Send Gemini a one-hour meeting recording and ask for a five-bullet summary of what was decided. Ask it to generate meeting minutes in a specific format. Ask it to identify all the action items that were mentioned. This works the same way as video Q&A — upload the file, send a prompt, get structured output.
Content Extraction
Any text that appears on screen in the video — slides, whiteboards, screen shares, captions — can be extracted by Gemini. For a lecture recording where the presenter shares slides, you can pull all slide content without having the original deck files. For a coding tutorial, you can extract the actual code being written. This is more reliable and more comprehensive than OCR tools that only process still frames.
Automated Captioning and Translation
Gemini can generate detailed captions for video content — more descriptive than speech-to-text alone because it includes visual context. "The presenter points to the diagram on the left" is information that speech-to-text can't capture but Gemini can. Combined with Gemini's multilingual capabilities, this makes automated localization of video content more accessible for small teams.
Quality Assurance for Video Content
For platforms that produce video content — course creators, YouTubers, marketing teams — Gemini can review footage for specific quality issues: check that a product is displayed correctly, verify that a presenter's slides match their spoken content, confirm that a specific demonstration worked as expected. This kind of automated review would otherwise require a human to watch every video before it ships.
The Gemini Ecosystem
Video understanding is one capability in a broader Gemini toolkit that includes text, images, audio, and code. If you're new to the Gemini model family and want a full picture of what's available before diving into video specifically, the Gemini for coding overview explains the model lineup, pricing tiers, and how all these capabilities relate to each other.
FAQ
Not literally — Gemini samples frames at a rate of about one frame per second rather than processing every single frame. For most practical use cases this is more than sufficient: it captures actions, objects, text on screen, and scene changes with high accuracy. For anything requiring analysis of very rapid motion (sports at 60fps, fast-cut editing), the sampling may miss specific moments. Test on your actual footage to know whether this is a limitation for your use case.
YouTube search works on titles, descriptions, tags, and auto-generated speech transcripts. Gemini video search works on what actually appears in the video visually — objects, actions, people, scenes, on-screen text — even if none of it was ever described in words. You can search for "the moment the whiteboard diagram appears" or "when the error message shows on screen" and Gemini will find it, even in a video with no captions or metadata at all.
Gemini's File API supports common video formats including MP4, AVI, MOV, MKV, WebM, FLV, MPEG, and MPG. The maximum file size is 2GB per file, and uploaded files are available for 48 hours before automatic deletion. For videos longer than approximately one hour, you'll need to split them before processing, as the context window has limits on how much video content can be processed in a single call.
Video is billed based on tokens — roughly 258 tokens per second of video. A 10-minute video is approximately 155,000 tokens. At Gemini 2.0 Flash pricing, that's a small fraction of a cent per minute for the embedding pass. The key is the architecture: if you process each video once and store the embeddings, subsequent search queries cost essentially nothing — they're just vector similarity comparisons, not additional API calls. Re-processing the same video repeatedly is where costs climb quickly.
Yes — SentrySearch (github.com/ssrajadh/sentrysearch) is an open-source project that demonstrated sub-second video search using Gemini's native video understanding. The Show HN post reached 100 points with active discussion in the comments about the implementation. It's a working proof-of-concept you can clone and study — it shows exactly how to upload video via the File API, generate embeddings, store them, and run semantic search queries. A solid foundation to build from rather than starting from scratch.
What to Learn Next
You've got the full picture on Gemini video search. Here's where to go next to build out the surrounding knowledge: