Shocking Reveal: Over 80% of ‘AI Assistant’ Traffic Was Fake—Googlebot Data Tells an Even Darker Story

Shocking Reveal: Over 80% of 'AI Assistant' Traffic Was Fake—Googlebot Data Tells an Even Darker Story

You ever wonder how many of those “AI assistant” visits in your site logs are actually… well, just pretenders wearing a bot costume? When I launched CitationIQ.com, I figured the 33 AI hits in two weeks were a decent start—until I dug a little deeper and realized only six were the real deal. That’s right, the rest were fakes sneaking around, some even trying to swipe sensitive files under the guise of ChatGPT. Googlebot’s number was even more suspect—out of nearly 800 Googlebot-named requests, only about a hundred had the bona fide IP to back it up. Spoofing bots pretending to be the big players is nothing new, but the sheer scale here is eye-opening. So what’s the truth behind those server logs? How do you separate the trustworthy crawlers from the masqueraders? And why is this more than just a numbers game—it’s the foundation for protecting your site, your data, and your SEO strategy? Stick around as I walk through the sleuthing process, the Python code that powers it, and why you must run this check on your own logs before those fakes start shaping your analytics in ways you never imagined. LEARN MORE

I launched CitationIQ.com recently. Over the last two weeks, my logs claimed 33 AI assistants visited, a little better than two a day. That number is a lie. The real number? Six.

Googlebot looked worse. Of 799 requests carrying its name, only 107 were real, though we all know scammers love to spoof Googlebot. And some of those fake AI visits, while wearing ChatGPT’s name, asked my server to hand over its secrets file.

I run this brand-new platform, and I have spent zero dollars promoting it thus far, so traffic remains modest. I went looking for a quiet, accurate read of who (robots and crawlers, since Google Analytics 4 handles the rest) was visiting, expecting small numbers, and I got them. What I did not expect was that most of even these modest numbers were lies. Here is what happened, how I checked, how I chased the stubborn cases to proof, and why the most useful thing you can do this week is run the same check on your own logs.

The Thing Nobody Checks

When a bot fetches your page, it announces a name. ChatGPT-User. Claude-User. Googlebot. CCBot, or whoever they say they are. Your server writes that name into the log, your analytics counts it, and you draw conclusions from it.

The name is self-reported, merely a string in the request header, and anyone can put anything they like there. Claiming to be Googlebot costs nothing and proves nothing. It is a stranger at your door in a delivery uniform, and the uniform is easy to fake.

The real check is not complicated. The major operators publish the actual IP addresses their bots use, as plain files you can open right now, and a request is legitimate only if the name matches and the address sits inside the published list. The name is the claim. The IP is the proof.

  • ChatGPT-User https://openai.com/chatgpt-user.json
  • Claude (all bots) https://claude.com/crawling/bots.json
  • Perplexity-User https://www.perplexity.com/perplexity-user.json
  • Googlebot https://developers.google.com/static/crawling/ipranges/common-crawlers.json
  • CCBot https://index.commoncrawl.org/ccbot.json

I built my check with three outcomes, not two. Verified means the IP is in the published range. Spoofed means the ranges loaded, and the IP is not in them. Unverifiable means I could not determine it, because a list failed to load or a record was missing. I never call something fake just because I failed to confirm it, and later that restraint is exactly what kept one investigation honest long enough to reach the truth.

The check is about 15 lines of Python using only the standard library, because deciding whether an address sits inside a network range is a solved problem.

import ipaddress, json, urllib.request

# A vendor’s published list of the IPs its bot really uses.

url = “https://openai.com/chatgpt-user.json”

data = json.loads(urllib.request.urlopen(url).read())

# Pull every address range out of the file.

nets = []

def collect(node):

if isinstance(node, dict):

for v in node.values():

collect(v)

elif isinstance(node, list):

for v in node:

collect(v)

elif isinstance(node, str):

try:

nets.append(ipaddress.ip_network(node, strict=False))

except ValueError:

pass

collect(data)

# A request claiming to be ChatGPT-User is only real if its

# source IP sits inside one of those ranges.

def is_real(ip):

addr = ipaddress.ip_address(ip)

return any(addr in net for net in nets)

That snippet is the heart of the check, not the whole thing. It is read-only and standard-library, but it is not a finished verifier. As written, it loads one vendor’s list, so on its own, it would wrongly flag every real Claude, Perplexity, and Google request as fake. A working version wraps this core in four things the example leaves out: It reads your actual log lines instead of one hardcoded address, maps each bot name to its own published list, adds the unverifiable state for cases a list cannot settle, and falls back to reverse DNS for an operator like Common Crawl that leans on it.

The Demand Gap

Start with the demand signal, the requests that come not from a scheduled crawl but from an assistant fetching my page live during a real user’s session. That is what these agent names mark: a fetch triggered in real time by someone using the assistant, not the routine background crawling everything else here is doing. What the log cannot tell me is what that person was after, whether they asked about me by name or something broader where my page got pulled in to ground an answer, so I will not claim either. What I can say is that 33 requests carried one of those live-fetch names. Six came from an IP the vendor publishes. Twenty-seven did not. That is an 81.8% spoof rate among the requests I could check.

The fakes gave themselves away by where they went. A real assistant fetch lands on a real page. The spoofed ones, still wearing the assistant’s name, went hunting for .env.production, secrets.yaml, and config.json. Nobody asked an assistant to read my environment variables. Those were credential scanners borrowing a trusted name to slip past filters, and the IP check caught every one.

Hold these numbers loosely. Six verified is only six, one small new site over 14 days, and you cannot build a theory on a sample that thin. Treat it as my baseline, not a finding about the world. Your numbers will matter far more than mine.

The Bigger Number, Which Is Not News

Of 799 requests carrying the Googlebot name, only 107 came from a verified Google address. The other 692, roughly 87%, were not Google.

This is not a discovery. Googlebot has been the most impersonated name on the web for the better part of two decades, which is exactly why Google publishes its ranges and tells you to verify by IP rather than trust the string. What the data does is confirm the pattern and show its scale on a brand-new site with no traffic to speak of. The most trusted crawler name draws the most impersonation, and it draws it immediately. Some fakes even used Googlebot strings tied to products Google retired years ago, a scanner copying an old user-agent off a list and never looking back.

So the reminder holds, old as it is. The Googlebot line in your logs is not a Google number. It is a “claims to be Google” number, and the gap can be enormous.

Two Different Games

First, a clarification, because the numbers are about to get bigger. Everything so far counted demand: Live fetches an assistant makes during a real conversation, the agents whose names end in -User. What follows is a separate population, the scheduled crawlers that index and train in the background, and they are different bots. ChatGPT-User is not GPTBot, and Claude-User is not ClaudeBot. So these counts run larger than the six, and they do not overlap with them. Strip the fakes away, and the verified crawl tells a more interesting story than the demand fetches did, because the crawlers themselves play two different games people lump together.

Some do retrieval. They build the index that gets pulled into an answer today. When a person asks an assistant a question, and it reaches for current sources, this is the machinery behind that. Retrieval is about whether you show up this week.

Others do training. They harvest content that may be folded into the weights of the next model. When a training crawler takes your page, that is not a visit you measure in referral traffic. It is a deposit into a corpus used to build models that will answer questions for years, often without ever fetching you again. The payoff is delayed, compounding, and invisible to every dashboard you own.

Here is my verified crawl data (two weeks, one new site, a snapshot, and nothing more). The most active verified crawler on my domain was not Google. It was Anthropic’s ClaudeBot at 166 confirmed crawls, ahead of verified Googlebot at 107, with OpenAI’s GPTBot at 46 and its search crawler at 40 behind. Is that a trend? No, it is 14 days on a site nobody has heard of. But the composition is worth seeing, because who spends crawl budget on a brand-new, unpromoted domain is the kind of signal that turns strategic once the volume is real.

Retrieval is your visibility today. Training is whether the model knows you tomorrow, without having to look you up at all. Most measurement fixates on the first. The second is quieter, arguably matters more, and almost nobody is watching it.

The One I Had To Chase: CCBot

Which brings me to what might be the most consequential training crawler of all, and the best illustration of why that unverifiable column exists. Common Crawl, fetched by CCBot, produces the open dataset that sits underneath a large share of the models trained in recent years. So when my report showed CCBot at zero verified, four spoofed, and sixteen unverifiable, the 16 bothered me. Unverified swings both ways. It does not mean fake, and it does not mean real. It means go find out. So I did, and the path is one you can copy.

First, the published list. Common Crawl publishes its crawler IP ranges, and not one of the 20 CCBot-labeled requests fell inside them.

Second, reverse DNS. Real CCBot resolves to a commoncrawl.org hostname. Four of mine resolved to something that was not Common Crawl, and the other sixteen had no reverse record at all, which is precisely why the script would not vouch for them.

Third, the corpus itself. Common Crawl runs a public index where you can ask whether a domain has been captured. I checked the three most recent monthly crawls for my domain, with wildcards, so I was not merely matching the homepage. Nothing.

Fourth, ownership. I pulled the raw IPs out of my logs and ran a WHOIS lookup on each. Every one traced to commodity hosting across several countries (most in Europe), the cheap rented infrastructure scanners run on.

Four independent angles, one answer. All 20 were impostors. The teaching point is the part an SEO will appreciate. The automated check correctly refused to call those 16 fake, since an absent record is not evidence of fraud, and it took manual digging to close the loop. So when your own report shows unverifiable rows, that is not a dead end. It is an invitation: pull the IPs, check the owner, check the corpus, and the picture resolves.

The One I Could Not Measure: Gemini

There is one major player I could not measure at all, and the reason is the point. Gemini.

OpenAI, Anthropic, and Perplexity each expose distinct, verifiable signals. You can separate their training crawler from their retrieval crawler from their live, user-driven fetch, and confirm each by IP. Google does not work this way. There is one Googlebot crawl. Whether the content it gathers feeds Gemini training is governed by a robots.txt token called Google-Extended, which is not a crawler. It never fetches anything. It is a permission flag on a crawl that already happened. There is no Gemini fetcher in your logs by design, and so no way to measure Gemini demand by name, the way you can for ChatGPT or Claude.

My script looked for it. It found nothing claiming to be Gemini, which tells you even the impersonators have not bothered with that name. It did catch four requests announcing themselves as Google-Extended while fetching pages, and since Google-Extended cannot fetch, those four are fake on their face, disproved by the name alone before any IP check runs.

If you have done this work as long as I have, this is familiar. In 2011, Google encrypted search referrers, and the keyword data we depended on collapsed into “(not provided).” The granularity went away, and we were handed a flag in place of a measurement. The AI era is mimicking. Where its competitors expose training, retrieval, and demand as separate, verifiable events, Google bundles them into a single crawl and an invisible token. You can confirm Googlebot, and nothing past it, and the rest is, once again, not provided.

2 Honest Asterisks

Perplexity is murkier than a clean pass or fail. Its crawler failed my IP check on 24 of 36 requests, but Perplexity has been documented fetching from addresses outside its own published ranges, so some failures may be impersonators, and some may be Perplexity operating off-list. For that one, spoofed is ambiguous in both directions. And again, all of this is two weeks of data on one small site.

Go Make Your Own Baseline

Do not take my numbers; take the method.

My data is thin because my site is new, and yours probably is not. If you have any real traffic, you are sitting on a far better dataset than mine, in your own access logs, right now, and you can run this check this afternoon. Pull a date range, match the names, verify the IPs against the published lists, and find your real fraction. Then look at your Googlebot line and brace yourself.

When you hit unverifiable rows, do what I did with CCBot. Pull the IPs, check the owner, query the corpus, and chase it until the picture resolves. There is nothing an SEO enjoys more than running down proof, and this is a target-rich place to do it.

What You Are Measuring, And What You Are Not

Think about what even a verified number does, and does not, tell you. A confirmed crawl tells you a real bot took your content. It does not tell you what happened next: whether your page ended up in the answer a person saw, whether you were cited, paraphrased without credit, or left out entirely, or whether the model that trained on you will ever surface your name or quietly absorb you and move on. The fetch is the visit. The outcome is a separate question.

That gap, between being fetched and being used, is the question I spend my days on, and it is the reason I built CitationIQ.

If you run this on your own logs, reply and tell me two numbers: your demand spoof rate, and your Googlebot one.

More Resources:


This post was originally published on Duane Forrester Decodes.


Featured Image: Prostock-studio/Shutterstock; Paulo Bobita/Search Engine Journal

Post Comment