Google’s Next Move: Could Your Robots.txt Files Be at Risk?

Google's Next Move: Could Your Robots.txt Files Be at Risk?

Ever scratched your head wondering what oddball rules in your robots.txt file are just whispered about in the shadows of Google’s unread manuals? Well, you’re not alone. Google’s diving deep into the wild west of robots.txt data, using real-world evidence plucked straight from the HTTP Archive abyss, to shine a spotlight on those sneaky unsupported rules lurking out there. Gary Illyes and Martin Splitt kicked off this detective work after a sharp community member nudged them with a pull request nudging for new “unsupported” tags to get a proper mention. But here’s the kicker—they didn’t just stop at two; they went big, aiming to map out the top 10 or 15 most-used “unsupported” rules, aiming to carve out a real, no-fluff baseline for everyone juggling these directives. Curious how they cracked the code? It involved custom JavaScript parsers and some serious BigQuery muscle. And for those of us who tinker with robots.txt beyond the usual suspects—user-agent, allow, disallow, and sitemap—this could mean a whole new understanding of what Google truly honors or silently snubs. Oh, and get this: typo tolerance might actually get friendlier—because who hasn’t fat-fingered “disallow” once or twice? If you’re the keeper of a robots.txt kingdom, it’s worth a peek before Google’s new handbook drops. Ready for the full scoop? LEARN MORE.

Google may expand the list of unsupported robots.txt rules in its documentation based on analysis of real-world robots.txt data collected through HTTP Archive.

Gary Illyes and Martin Splitt described the project on the latest episode of Search Off the Record. The work started after a community member submitted a pull request to Google’s robots.txt repository proposing two new tags be added to the unsupported list.

Illyes explained why the team broadened the scope beyond the two tags in the PR:

“We tried to not do things arbitrarily, but rather collect data.”

Rather than add only the two tags proposed, the team decided to look at the top 10 or 15 most-used unsupported rules. Illyes said the goal was “a decent starting point, a decent baseline” for documenting the most common unsupported tags in the wild.

How The Research Worked

The team used HTTP Archive to study what rules websites use in their robots.txt files. HTTP Archive runs monthly crawls across millions of URLs using WebPageTest and stores the results in Google BigQuery.

The first attempt hit a wall. The team “quickly figured out that no one is actually requesting robots.txt files” during the default crawl, meaning the HTTP Archive datasets don’t typically include robots.txt content.

After consulting with Barry Pollard and the HTTP Archive community, the team wrote a custom JavaScript parser that extracts robots.txt rules line by line. The custom metric was merged before the February crawl, and the resulting data is now available in the custom_metrics dataset in BigQuery.

What The Data Shows

The parser extracted every line that matched a field-colon-value pattern. Illyes described the resulting distribution:

“After allow and disallow and user agent, the drop is extremely drastic.”

Beyond those three fields, rule usage falls into a long tail of less common directives, plus junk data from broken files that return HTML instead of plain text.

Google currently supports four fields in robots.txt. Those fields are user-agent, allow, disallow, and sitemap. The documentation says other fields “aren’t supported” without listing which unsupported fields are most common in the wild.

Google has clarified that unsupported fields are ignored. The current project extends that work by identifying specific rules Google plans to document.

The top 10 to 15 most-used rules beyond the four supported fields are expected to be added to Google’s unsupported rules list. Illyes did not name specific rules that would be included.

Typo Tolerance May Expand

Illyes said the analysis also surfaced common misspellings of the disallow rule:

“I’m probably going to expand the typos that we accept.”

His phrasing implies the parser already accepts some misspellings. Illyes didn’t commit to a timeline or name specific typos.

Why This Matters

Search Console already surfaces some unrecognized robots.txt tags. If Google documents more unsupported directives, that could make its public documentation more closely reflect the unrecognized tags people already see surfaced in Search Console.

Looking Ahead

The planned update would affect Google’s public documentation and how disallow typos are handled. Anyone maintaining a robots.txt file with rules beyond user-agent, allow, disallow, and sitemap should audit for directives that have never worked for Google.

The HTTP Archive data is publicly queryable on BigQuery for anyone who wants to examine the distribution directly.


Featured Image: Screenshot from: YouTube.com/GoogleSearchCentral, April 2026. 

Post Comment