Trakkr behind a WAF
How Trakkr crawler tracking works when your site sits behind Sucuri, Cloudflare, Wordfence, AWS WAF, or any other firewall. Cache caveats, IP-header handling, and how to allowlist AI bot user-agents per platform.
A WAF or security plugin is the single most common reason crawler data looks lower than expected. To most rule engines, AI crawlers look like scrapers — and the same defences that block scrapers block them.
This page covers the three things that matter when Trakkr sits behind one: which install method to pick, how the real client IP makes it through proxy layers, and how to allowlist AI bots per platform.
Pick your install method
Pick the deepest layer Trakkr can read directly. CDN-side integrations see traffic that origin-side installs can't, so they're the better choice when available.
| Your setup | Best install | Why |
|---|---|---|
| Site proxied through Cloudflare | Cloudflare | Reads Cloudflare's GraphQL analytics — sees blocked and cached requests too. |
| Site behind AWS WAF + CloudFront | AWS CloudFront Lambda@Edge | Runs at the edge on every viewer request, including cache hits. |
| Site behind Sucuri Firewall | Origin-side: WordPress, Next.js, Node, or Nginx | Sucuri's logs aren't exposed to merchants; Trakkr reads at the origin. |
| Site behind Imperva, StackPath, or another generic WAF | Origin-side matching your stack | Most enterprise WAF analytics don't expose a feed Trakkr can read. |
| WordPress + Wordfence or iThemes | WordPress plugin + allowlist (below) | Plugin runs at origin; allowlist makes sure bots reach WP. |
What each install method sees
When your site sits behind a CDN or caching WAF, requests served from cache never reach your origin — so origin-side integrations (WordPress plugin, Express, Nginx) don't record them. Same for requests the WAF blocks before they hit your stack.
| Install method | Cache hits | WAF-blocked requests |
|---|---|---|
| Cloudflare integration | ✓ | ✓ |
| AWS CloudFront Lambda@Edge | ✓ (Viewer Request) | ✓ |
| Vercel Log Drain | ✓ (Vercel cache) | Partial — depends on edge config |
| Netlify Edge Function | ✓ (Netlify cache) | ✓ when WAF is at Netlify or downstream |
| WordPress / Next.js / Node / Nginx | ✗ | ✗ |
In practice the origin-side gap is smaller than it looks: AI crawler volume is low relative to overall traffic, and most CDN configurations bypass cache for non-browser user-agents. But if you want full coverage on a heavily-cached site, the Cloudflare or CloudFront integration is the one to use.
Real client IP detection
When a request flows through a proxy, the TCP connection your origin sees comes from the proxy — not the client. The real IP rides along in a request header, and there's no single standard.
| Header | Proxy that sets it |
|---|---|
CF-Connecting-IP | Cloudflare |
True-Client-IP | Cloudflare Enterprise, Akamai |
X-Sucuri-ClientIP | Sucuri Firewall |
X-Real-IP | Nginx, generic reverse proxies |
X-Forwarded-For | Almost everything (comma-separated chain; the client is first) |
REMOTE_ADDR / connection IP | Fallback when no proxy header is set |
The Trakkr WordPress plugin walks this list and picks the first valid IP:
CF-Connecting-IP → True-Client-IP → X-Sucuri-ClientIP → X-Real-IP → X-Forwarded-For → REMOTE_ADDRSo a WordPress site behind Sucuri correctly resolves the real bot IP from X-Sucuri-ClientIP, even though every TCP connection comes from a Sucuri edge server.
Other origin-side integrations ship with the common headers but may need a one-line tweak when you're behind a WAF whose header isn't in the default set:
| Method | Reads by default | Behind Sucuri / Cloudflare |
|---|---|---|
| Next.js middleware | x-forwarded-for (first entry) | Add CF-Connecting-IP / X-Sucuri-ClientIP lookup for precise attribution |
| Express middleware | req.ip (with app.set('trust proxy', true)) | Same — prepend the proxy-specific header lookup |
| Nginx / OpenResty | ngx.var.remote_addr | Configure real_ip_header + set_real_ip_from for your proxy first |
| Lambda@Edge | request.clientIp | CloudFront resolves the real client IP for you |
Allowlist AI bots
If the verification ping arrives but real crawler traffic stays empty for 48 hours, the bots are probably being blocked or challenged before they reach your origin. Cover the major user-agents:
GPTBot
ChatGPT-User
OAI-SearchBot
ClaudeBot
Claude-User
Claude-SearchBot
PerplexityBot
Perplexity-User
Bytespider
CCBot
Amazonbot
MistralAI-User
Meta-ExternalFetcher
Google-Agent
ApplebotSucuri Firewall
- 1.Sucuri dashboard → your site → Settings → Whitelist & Blacklist.
- 2.Under Whitelist User-Agent, paste a regex covering the bots above — e.g.
(GPTBot|ChatGPT-User|ClaudeBot|Claude-User|PerplexityBot|Perplexity-User|OAI-SearchBot|Bytespider|CCBot|Amazonbot|Applebot). - 3.Save. Matched user-agents stop being challenged immediately.
/wp-json/ (a common WordPress lockdown), also whitelist /wp-json/trakkr/* under Settings → Access Control — otherwise the WordPress connection works but no crawler visits ever sync.Cloudflare
Multiple bot layers can challenge crawlers. Handle whichever you're using:
- Bot Fight Mode (free) — only knob is on/off. If you can't see crawler traffic and Bot Fight Mode is on, toggle it off for 24 hours as a diagnostic.
- Super Bot Fight Mode (Pro+) — Security → Bots, set Verified bots to Allow. Cloudflare's verified list covers most major AI crawlers but not all; add a custom rule for the rest.
- Custom WAF Rule (any plan) — Security → WAF → Custom rules. Field: User Agent, Operator: contains, Value:
GPTBot(repeat OR for each bot). Action: Skip. Place above other blocking rules.
Wordfence (WordPress)
- 1.Wordfence → Firewall → Blocking → Advanced Blocking — check for any User-Agent blocks matching AI crawler strings.
- 2.Wordfence → Live Traffic — filter by user-agent (e.g.
GPTBot,ClaudeBot). Rows taggedBlockedare your culprits.
AI bots rarely publish static IPs, so IP-based allowlists usually don't help — work from the user-agent side.
iThemes / Solid Security (WordPress)
- 1.Security → Settings → Network Brute Force Protection → API Settings → Banned Hosts and Banned User Agents — remove wildcard matches like
*bot*or*scraper*that catch AI crawlers. - 2.Check the Firewall section for User Agent filters under System Tweaks.
AWS WAF
- 1.AWS WAF & Shield → Web ACLs → your ACL.
- 2.If you use a managed rule group (
AWSManagedRulesCommonRuleSet,AWSManagedRulesBotControlRuleSet), set its bot rules to Count (not Block), or add a User-Agent Exception. - 3.For custom rules: add a higher-priority Allow rule matching your AI bot regex.
Imperva, StackPath, generic WAFs
Most enterprise WAFs let you create a User-Agent-based Allow rule that takes priority over the default bot-protection layer:
- 1.New rule, priority above default bot rules.
- 2.Match:
User-Agent matches <regex>. - 3.Action: Allow (or Bypass, or Skip — terminology varies).
Find out who's blocking
If you're not sure which layer is blocking, curl your site as a bot from outside the WAF:
"text-[#5b5fc7]">curl "text-[#0e9373]">-A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" \
"text-[#0e9373]">-I https://your-site.com/A 200 OK means the bot would reach you. A 403, 503, or challenge page means a WAF is in the way — the response Server header usually identifies the layer (sucuri/cloudproxy, cloudflare + cf-ray:, x-amzn-requestid:, etc.).
Troubleshooting
Three places to look when you suspect a WAF is interfering:
- 1.Send verification (Crawler header) — fires synthetic
GPTBot,PerplexityBot, andChatGPT-Userrows through the ingest pipeline. If they appear in the Feed with a Verified badge, the Trakkr side is healthy and the issue is bots not reaching your origin. - 2.Access tab — cross-references your robots.txt with actual bot visit data. Two findings here flag WAF problems:
- Traffic dropped — visits down sharply with no robots.txt change. Usually a new Cloudflare bot mode, WAF rule, or rate limit. - Access mismatch — robots.txt allows the bot but most of its requests get denied at the origin.
- 1.robots.txt check — the Access tab also parses your robots.txt and flags a
Disallow: /underUser-agent: GPTBot(etc.). Bots respect this even if your WAF allowlist is correct.
Going further
Install crawler tracking
If you're still deciding which install method to use, the picker walks through all 18 options with prerequisites and step-by-step setup.
Crawlers dashboard
How to read crawler data once it's flowing — the three bot categories, the page funnel, alerts.
