How search engines work: crawling, indexing, and ranking in the age of AI

The Three-Stage Pipeline
Stage 1: Crawling
How URLs Get Discovered
Crawl Budget & the 2MB Limit
Robots.txt
Stage 2: Indexing
JavaScript Rendering
Mobile-First Indexing
Canonicalization
Stage 3: Ranking
Top Ranking Factors
Core Web Vitals
E-E-A-T
AI Is Changing Search
AI Overviews
SERPs & Featured Snippets
What to Do About All This
FAQ
References

Overview

Every search result you see went through a three step process: crawling, indexing, and ranking. Google's infrastructure for doing this has gotten dramatically more complex since the early PageRank days, and in 2026 AI systems are changing the game at every level. This guide breaks down each stage with current data, explains what has shifted with AI Overviews and Gemini, and tells you what actually matters for getting your site found.

8.5B

Google searches per day

96.55%

of pages get zero organic traffic

58.5%

of US searches end with zero clicks

The Three-Stage Pipeline

Search engines do three things, in order. They crawl the web to discover pages. They index those pages by parsing and storing the content. And they rank indexed pages against your query, sorting results by relevance and quality. That is the entire model. Everything else is detail on how each stage works.

Crawl

Bots follow links, read sitemaps, and download pages from across the web.

Index

Pages are parsed, rendered, deduplicated, and stored in a searchable database.

Rank

Algorithms score indexed pages against each query using hundreds of signals.

If your page fails at any stage, it won't show up in results. A page that isn't crawled can't be indexed. A page that isn't indexed can't be ranked. And a page that is indexed but has weak signals will be buried. Understanding where your site is breaking down in this pipeline is the first step toward fixing your organic search performance.

Stage 1: Crawling

Web crawlers are automated programs that browse the internet, fetch pages, and follow links to find new content. Google's crawler, Googlebot, is by far the most active. But Googlebot is not a single bot.

A March 2026 post by Google's Gary Illyes revealed that Googlebot is really just one client of a centralized crawling platform shared across dozens of Google products. Search, Shopping, AdSense, Gemini, Google News ... they all route crawl requests through the same infrastructure under different crawler names. In November 2025, Google even moved its crawling documentation to a separate site (developers.google.com/crawling), which tells you something about how they now think of crawling as a company wide service rather than a search specific one.

The scale is enormous. In 2025, Googlebot generated more than 25% of all verified bot traffic observed by Cloudflare and accounted for 4.5% of all HTML request traffic globally. That is more than every AI crawler combined.

How URLs Get Discovered

Googlebot finds new URLs four ways. Link discovery is the primary method: following hyperlinks from pages it already knows about, using internal link density to estimate which pages matter most. XML sitemaps let you hand Google a structured list of URLs, though Google ignores the priority and changefreq tags and only trusts lastmod timestamps when they are actually accurate. Direct submission through Google Search Console's URL Inspection tool gives you manual control. And Google also maintains a persistent crawl queue of previously known URLs, so once Google has seen a URL, it never fully forgets it.

Key Takeaway

Internal links are still the most reliable way to get pages discovered. Every page you want indexed needs at least one internal link pointing to it. Orphan pages, those with no internal links at all, are essentially invisible to crawlers.

Crawl Budget and the New 2MB HTML Limit

Crawl budget is the number of URLs Google can and wants to crawl on your site. Two forces determine it: crawl capacity limit, which is how many simultaneous requests your server can handle before slowing down, and crawl demand, which is Google's interest level based on your content's popularity, freshness, and overall site size.

For most websites, crawl budget is not something you need to think about. Google can easily handle sites with a few hundred pages. It starts mattering at 1 million+ pages with moderate content changes, or 10,000+ pages that change daily.

What you should think about: in February 2026, Google reduced Googlebot's HTML fetch limit from 15MB down to 2MB per URL. That is an 86.7% reduction. Anything past 2MB gets ignored entirely. PDFs still have a 64MB limit, and external resources like CSS and JS each get their own 2MB cap, but for your actual HTML content, lean pages are now more important than they have been in years. This matters for your site architecture and development decisions.

Robots.txt: What It Does and Doesn't Do

Robots.txt controls crawler access using User-agent, Disallow, Allow, and Sitemap directives. A few things people consistently get wrong about it:

Google ignores the Crawl-delay directive. Only Bing and Yandex respect it.
Blocking a page with robots.txt does not prevent indexing. If other sites link to a blocked URL, Google can still index it, just without a snippet. To prevent indexing, you need a noindex meta tag or X-Robots-Tag header.
The file has a 500 KiB size limit and works on case sensitive paths.
You should never block CSS or JavaScript files, because Google needs them to render your pages correctly.

Is Google Actually Seeing Your Pages?

Crawl errors and indexing gaps cost you traffic every day. We audit your site's technical SEO foundation so nothing falls through the cracks.

Get a Technical SEO Audit

Stage 2: Indexing

Once Googlebot fetches a page, the indexing pipeline takes over. This is where Google decides what your page is about and whether it deserves a spot in the index.

The pipeline runs through five stages. First, HTML parsing extracts text, title tags, heading structure, alt attributes, images, and structured data. Second, the Web Rendering Service (WRS) executes JavaScript. Third, canonicalization groups similar pages and picks the best version. Fourth, signal collection gathers quality, language, and usability data. Fifth, the canonical page and all its metadata are stored across Google's distributed index.

The architecture underneath all of this still builds on Caffeine, Google's indexing system from 2010. Before Caffeine, Google processed the web in batch updates that took weeks. Caffeine introduced continuous indexing, where pages move through the pipeline and go live almost immediately after being crawled. At launch, it delivered a 50% fresher index. Everything since then, including the tiered indexing and quality based filtering exposed by the 2024 API leak, builds on top of that foundation.

JavaScript Rendering

Google's WRS uses an evergreen version of headless Chromium that matches the latest stable Chrome release. It processes CSS, handles AJAX requests, and discovers content and links injected by JavaScript. One thing it does not do: simulate user interactions. No clicking, no scrolling, no typing.

In early 2026, Google removed its longstanding warning about building pages that work without JavaScript. They said rendering capabilities had improved enough to make that guidance unnecessary. And a 2025 Vercel/MERJ study found Google does render 100% of HTML pages including complex JS, with most spending fewer than 20 seconds in the rendering queue.

But here is the catch. Most AI crawlers, including GPTBot, ClaudeBot, and PerplexityBot, still cannot execute JavaScript. If you want your content cited in AI generated answers, which is becoming its own competitive dimension, server side rendering or static site generation are still the safer bets. This is part of a broader AI search visibility strategy.

Mobile-First Indexing

Google completed its migration to mobile first indexing in July 2024. That means Googlebot Smartphone is now the primary crawler for virtually all websites. Desktop only content risks being missed entirely.

With over 60% of global web traffic coming from mobile devices and the September 2025 core update reinforcing mobile performance as a ranking signal, there is no argument for treating mobile as an afterthought. If your mobile and desktop versions serve different content, the mobile version is what Google uses for indexing.

Canonicalization

When Google finds multiple URLs with similar content (say, example.com/page and example.com/page?ref=social), it picks one as the canonical version to represent them all. You can suggest which URL to prefer using rel="canonical" tags.

Google treats your canonical tag as a strong hint, not a command. They override it roughly 35% of the time when other signals disagree. Common reasons for override: the canonical URL returns a redirect, serves a noindex tag, or points to content that is substantially different. Self referencing canonical tags using absolute URLs on every page is still the recommended approach.

Stage 3: Ranking

Ranking is not one algorithm. It is a collection of systems that evaluate hundreds of signals in a multi stage pipeline.

During the 2023 DOJ antitrust trial, Google VP Pandu Nayak described four stages. Retrieval pulls tens of thousands of candidate documents using keyword matching and Neural Matching (RankEmbed), a dual encoder model that finds relevant results even without keyword overlap. Coarse ranking narrows that to a few hundred using RankBrain, which converts queries into mathematical vectors to handle ambiguous searches. Fine ranking applies BERT (deployed as DeepRank) to the top 20 to 30 results, understanding language context bidirectionally. And re-ranking through NavBoost refines the final order using 13 months of Chrome click data, tracking which results users actually find useful.

Top Ranking Factors in 2026

First Page Sage's ongoing algorithm study provides the most detailed public estimates of how Google weighs different ranking factors:

Factor	Estimated Weight	Trend
Content quality & relevance	~26%	Stable (dominant factor)
Backlinks	~13%	Declining (was 50%+ historically)
User engagement	~12%	Increasing each year
Core Web Vitals / page experience	~10-15%	Stable
Content freshness	~6%	Stable (+4.6 positions for yearly updates)

Backlinks are still meaningful. The number one result in Google has on average 3.8x more backlinks than positions two through ten. But the weight has shifted dramatically from the PageRank era. Quality matters much more than quantity now, and only 1 in 20 pages without any backlinks receives organic traffic at all. If you are working on your link building strategy, focus on contextually relevant, authoritative sources rather than volume.

Core Web Vitals

Core Web Vitals became official ranking factors through the Page Experience update. The three current metrics are:

Largest Contentful Paint (LCP): How fast the main content loads. Target: 2.5 seconds or less.
Interaction to Next Paint (INP): How quickly the page responds to user input. Target: 200 milliseconds or less. This replaced First Input Delay in March 2024.
Cumulative Layout Shift (CLS): How much the page layout moves unexpectedly. Target: 0.1 or less.

Google evaluates these using real field data from the Chrome User Experience Report. At least 75% of page visits need to hit "Good" thresholds. As of 2025, 54.2% of websites fail all three metrics. Sites that pass them see 24% higher CTR and 19% lower bounce rates. If your web development team has not addressed CWV yet, that is where your competitors are gaining ground.

E-E-A-T

E-E-A-T stands for Experience, Expertise, Authoritativeness, and Trustworthiness. It is not a direct ranking factor or an algorithmic score. It is a framework from Google's 176-page Search Quality Rater Guidelines, used by 10,000+ human quality raters worldwide to evaluate how well Google's algorithms are doing.

Trust is the most important pillar. For YMYL (Your Money or Your Life) content covering health, finance, safety, and civic topics, Google applies extra scrutiny. The September 2025 QRG update expanded YMYL categories to include election content and flagged purely AI generated content without human review as the lowest quality tier.

How you build E-E-A-T signals: named authors with real credentials and linked bios, original research or first-hand experience, consistent brand presence across channels, and transparent trust signals like HTTPS, contact info, and privacy policies. This is where content marketing strategy and E-E-A-T overlap directly.

Key Takeaway

E-E-A-T is not something you optimize for with a meta tag. It is built over time through content quality, author credibility, and trust signals. The sites that rank well have it. The sites that don't, struggle. There is no shortcut.

AI Is Changing Search

The most disruptive change to search in two decades is happening right now. Google is shifting from a retrieval engine to something closer to a reasoning engine, and the implications for organic traffic are significant.

AI Overviews and the CTR Impact

Google AI Overviews, the AI generated summaries at the top of search results, launched in May 2024 and now reach 1.5 billion users monthly across 200+ countries. Their appearance has grown from 6.49% of queries in January 2025 to roughly 25-48% by early 2026, depending on which keyword set you measure.

The effect on organic click through rates is hard to ignore. A Seer Interactive study of 25.1 million impressions found organic CTR dropped 61% for queries with AI Overviews. Ahrefs' December 2025 analysis of 300,000 keywords showed AI Overviews cut position one CTR by 34.5 to 58%. And a Pew Research study of 68,879 actual searches found only 8% of users who saw AI Overviews clicked a traditional result.

The flip side: brands that get cited within AI Overviews earn roughly 35% more organic clicks than uncited brands. So AI search visibility is becoming its own competitive dimension, separate from traditional ranking.

Google's AI Mode, a conversational search tab powered by Gemini, hit 75 million daily active users by early 2026. An alarming 93% of AI Mode queries end without any click to an external site. Meanwhile, Google Search generated $63 billion in Q4 2025 revenue alone, partly because ads now appear in 25.5% of AI Overview results, up 394% from early 2025.

Are You Visible in AI Search Results?

AI Overviews are cutting organic CTR by up to 61%. We help brands get cited in AI answers, not buried by them.

Explore AI Visibility Services

Modern SERPs and Featured Snippets

The simple ten blue links layout is long gone. Semrush Sensor data shows only 1.49% of first-page results have no SERP features at all. Google assembles each results page dynamically based on intent, location, device, and user context, mixing in AI Overviews, People Also Ask boxes (appearing in ~75% of searches), Knowledge Panels, local map packs, image and video carousels, sitelinks (68% of SERPs), and rich snippets driven by structured data markup.

Featured snippets have taken a hit from AI Overviews. Their SERP visibility dropped 64% between January and June 2025, falling from 15.41% to 5.53% of US desktop queries. But when present, they still capture 42.9% of total clicks, the highest of any SERP element. Paragraph snippets make up 70% of all featured snippets, and the sweet spot is 40 to 60 words directly answering a question under a heading that matches the query.

Zero click searches now account for 58.5% of US searches and 59.7% of EU searches. For queries with AI Overviews, that jumps to 80-83%. Gartner projects 25% of organic search traffic will migrate to AI chatbots and voice assistants by the end of 2026. The question is no longer whether this shift is happening, but how fast.

What to Do About All This

The fundamentals have not changed. The execution has. Here is what matters at each stage of the pipeline, written for people who actually manage websites and need to make decisions about where to spend their time.

For crawling

Keep your robots.txt clean. Block the low value stuff (faceted navigation, internal search results, URL parameters) and leave CSS and JavaScript accessible. Submit XML sitemaps containing only canonical, indexable 200-status URLs with accurate lastmod timestamps. Use standard <a href> links with descriptive anchor text for internal linking. Run crawl audits quarterly and eliminate orphan pages. If you are managing large sites, watch the 2MB HTML limit, it will bite you if your page templates are bloated.

For indexing

Add self referencing canonical tags with absolute URLs on every page. If you use JavaScript frameworks, prioritize server side rendering or static generation. Do not inject canonical tags or meta robots directives via JavaScript. Check Google Search Console's Pages report monthly for indexing problems, especially "Crawled - currently not indexed" (usually a quality issue) and "Discovered - currently not indexed" (usually a crawl budget issue).

For ranking

Match content format to search intent. Lead with direct answers, then elaborate. Update content at least quarterly, pages updated yearly gain an average 4.6 ranking positions. Optimize Core Web Vitals: preload LCP resources, break up long JS tasks for INP, set explicit dimensions on all media for CLS. Build E-E-A-T through named authors with credentials, original data, and consistent brand presence. Implement structured data in JSON-LD format, particularly Article, FAQPage, HowTo, and Organization schema.

For AI visibility

This is the newer, less understood optimization layer. Lead with concise answers using a "bottom line up front" structure. Use bullet points, numbered lists, and HTML tables that AI models parse easily. Build topical authority through content clusters rather than isolated pages. Cite authoritative external sources. About 48% of URLs cited in AI responses across ChatGPT, Perplexity, Copilot, and Google AI Mode do not rank in Google's traditional top 100, which means AI citation and organic ranking are becoming separate competitive games.

Key Takeaway

You now need to optimize for two things at once: traditional organic ranking and AI citation. They overlap in some areas (quality content, structured data, topical authority) but diverge in others. Sites that treat them as one problem will lose ground to competitors who address both.

Frequently Asked Questions

Search engines work in three stages: crawling (discovering pages by following links and reading sitemaps), indexing (parsing, rendering, and storing page content in a searchable database), and ranking (scoring indexed pages against a query using hundreds of signals like content relevance, backlinks, and user engagement to determine result order).

Crawl budget is the number of URLs Google can and wants to crawl on your site within a given timeframe. It is determined by your server's capacity and Google's interest in your content. Crawl budget mainly affects large sites with over 1 million pages or medium sites with 10,000+ pages that change frequently. Small sites with a few hundred pages rarely need to worry about it.

Common reasons include: the page has a noindex tag blocking indexing, robots.txt is blocking crawlers from accessing it, it lacks internal links so crawlers never discover it, Google considers the content too low quality or too similar to other pages, or the page has not been crawled yet. Check Google Search Console's Pages report for specific indexing status and errors.

AI Overviews reduce organic click through rates significantly. Studies show organic CTR drops by roughly 34-61% for queries that trigger an AI Overview. However, websites cited within AI Overview responses see approximately 35% more organic clicks than uncited sites, making AI citation a new competitive dimension alongside traditional ranking.

The top ranking factors in 2026 are content quality and relevance (estimated at 26% weight), backlink profile (13%), user engagement signals like click behavior (12%), Core Web Vitals and page experience (10-15%), and content freshness (6%). Google uses AI systems including BERT for language understanding and NavBoost for measuring real user satisfaction through Chrome click data.

References & Sources

1. In-Depth Guide to How Google Search Works — Google Search Central
2. Inside Googlebot: Demystifying Crawling, Fetching, and the Bytes We Process — Google (March 2026)
3. Crawl Budget Management — Google Crawling Infrastructure
4. Build and Submit a Sitemap — Google Search Central
5. Google Crawler (User Agent) Overview — Google Crawling Infrastructure
6. Our New Search Index: Caffeine — Google Search Central Blog
7. Understanding Core Web Vitals and Google Search Results — Google Search Central
8. A Guide to Google Search Ranking Systems — Google Search Central
9. Creating Helpful, Reliable, People-First Content — Google Search Central
10. The 2025 Google Algorithm Ranking Factors — First Page Sage
11. Google's AI Ranking: RankBrain, BERT, DeepRank & NavBoost — SEO-Kreativ
12. AI Overviews Killed CTR 61%: 9 Strategies to Show Up — Dataslayer
13. Google AI Overview SEO Impact: 2026 Data & Statistics — Stackmatix
14. Google AI Mode: 75M Users, Ads in 25% of AI Results — Digital Applied
15. 50+ Zero Click Search Statistics for 2026 — CLICKVISION Digital
16. SERP Features: What They Are & Why They Matter — Backlinko
17. Featured Snippets: How to Win Position Zero — Search Engine Land
18. Google AI Overviews Surge 58% Across 9 Industries — ALM Corp

Author

Michael Timi

Partner & Marketing Manager, eMac Media

Drives strategic partnerships and revenue growth through high impact marketing initiatives, business development, and lead generation.

Editor

Princess Pitts

Director of Communications Strategy, eMac Media

Specializes in editorial strategy, content governance, and brand communications at scale.

Your Pages Deserve to Be Found

From crawl errors to AI visibility, we help you fix the technical problems holding your organic traffic back.

Get Your Free Strategy Proposal