Trakkr behind a WAF

How Trakkr crawler tracking works when your site sits behind Sucuri, Cloudflare, Wordfence, AWS WAF, or any other firewall. Cache caveats, IP-header handling, and how to allowlist AI bot user-agents per platform.

6 min read

A WAF or security plugin is the single most common reason crawler data looks lower than expected. To most rule engines, AI crawlers look like scrapers — and the same defences that block scrapers block them.

This page covers the three things that matter when Trakkr sits behind one: which install method to pick, how the real client IP makes it through proxy layers, and how to allowlist AI bots per platform.

Pick your install method

Pick the deepest layer Trakkr can read directly. CDN-side integrations see traffic that origin-side installs can't, so they're the better choice when available.

Your setup	Best install	Why
Site proxied through Cloudflare	Cloudflare	Reads Cloudflare's GraphQL analytics — sees blocked and cached requests too.
Site behind AWS WAF + CloudFront	AWS CloudFront Lambda@Edge	Runs at the edge on every viewer request, including cache hits.
Site behind Sucuri Firewall	Origin-side: WordPress, Next.js, Node, or Nginx	Sucuri's logs aren't exposed to merchants; Trakkr reads at the origin.
Site behind Imperva, StackPath, or another generic WAF	Origin-side matching your stack	Most enterprise WAF analytics don't expose a feed Trakkr can read.
WordPress + Wordfence or iThemes	WordPress plugin + allowlist (below)	Plugin runs at origin; allowlist makes sure bots reach WP.

What each install method sees

When your site sits behind a CDN or caching WAF, requests served from cache never reach your origin — so origin-side integrations (WordPress plugin, Express, Nginx) don't record them. Same for requests the WAF blocks before they hit your stack.

Install method	Cache hits	WAF-blocked requests
Cloudflare integration	✓	✓
AWS CloudFront Lambda@Edge	✓ (Viewer Request)	✓
Vercel Log Drain	✓ (Vercel cache)	Partial — depends on edge config
Netlify Edge Function	✓ (Netlify cache)	✓ when WAF is at Netlify or downstream
WordPress / Next.js / Node / Nginx	✗	✗

In practice the origin-side gap is smaller than it looks: AI crawler volume is low relative to overall traffic, and most CDN configurations bypass cache for non-browser user-agents. But if you want full coverage on a heavily-cached site, the Cloudflare or CloudFront integration is the one to use.

Real client IP detection

When a request flows through a proxy, the TCP connection your origin sees comes from the proxy — not the client. The real IP rides along in a request header, and there's no single standard.

Header	Proxy that sets it
`CF-Connecting-IP`	Cloudflare
`True-Client-IP`	Cloudflare Enterprise, Akamai
`X-Sucuri-ClientIP`	Sucuri Firewall
`X-Real-IP`	Nginx, generic reverse proxies
`X-Forwarded-For`	Almost everything (comma-separated chain; the client is first)
`REMOTE_ADDR` / connection IP	Fallback when no proxy header is set

The Trakkr WordPress plugin walks this list and picks the first valid IP:

Text

CF-Connecting-IP → True-Client-IP → X-Sucuri-ClientIP → X-Real-IP → X-Forwarded-For → REMOTE_ADDR

So a WordPress site behind Sucuri correctly resolves the real bot IP from X-Sucuri-ClientIP, even though every TCP connection comes from a Sucuri edge server.

Warning

These headers are spoofable. Trakkr records the IP for analytics only (country attribution, anomaly detection) — never for an access decision. Don't use them for rate limiting or blocking without trust-proxy logic.

Other origin-side integrations ship with the common headers but may need a one-line tweak when you're behind a WAF whose header isn't in the default set:

Method	Reads by default	Behind Sucuri / Cloudflare
Next.js middleware	`x-forwarded-for` (first entry)	Add `CF-Connecting-IP` / `X-Sucuri-ClientIP` lookup for precise attribution
Express middleware	`req.ip` (with `app.set('trust proxy', true)`)	Same — prepend the proxy-specific header lookup
Nginx / OpenResty	`ngx.var.remote_addr`	Configure `real_ip_header` + `set_real_ip_from` for your proxy first
Lambda@Edge	`request.clientIp`	CloudFront resolves the real client IP for you

Allowlist AI bots

If the verification ping arrives but real crawler traffic stays empty for 48 hours, the bots are probably being blocked or challenged before they reach your origin. Cover the major user-agents:

Text

GPTBot
ChatGPT-User
OAI-SearchBot
ClaudeBot
Claude-User
Claude-SearchBot
PerplexityBot
Perplexity-User
Bytespider
CCBot
Amazonbot
MistralAI-User
Meta-ExternalFetcher
Google-Agent
Applebot

Sucuri Firewall

1.Sucuri dashboard → your site → Settings → Whitelist & Blacklist.
2.Under Whitelist User-Agent, paste a regex covering the bots above — e.g. (GPTBot|ChatGPT-User|ClaudeBot|Claude-User|PerplexityBot|Perplexity-User|OAI-SearchBot|Bytespider|CCBot|Amazonbot|Applebot).
3.Save. Matched user-agents stop being challenged immediately.

Tip

If Sucuri hardening blocks /wp-json/ (a common WordPress lockdown), also whitelist /wp-json/trakkr/* under Settings → Access Control — otherwise the WordPress connection works but no crawler visits ever sync.

Cloudflare

Multiple bot layers can challenge crawlers. Handle whichever you're using:

Bot Fight Mode (free) — only knob is on/off. If you can't see crawler traffic and Bot Fight Mode is on, toggle it off for 24 hours as a diagnostic.
Super Bot Fight Mode (Pro+) — Security → Bots, set Verified bots to Allow. Cloudflare's verified list covers most major AI crawlers but not all; add a custom rule for the rest.
Custom WAF Rule (any plan) — Security → WAF → Custom rules. Field: User Agent, Operator: contains, Value: GPTBot (repeat OR for each bot). Action: Skip. Place above other blocking rules.

Wordfence (WordPress)

1.Wordfence → Firewall → Blocking → Advanced Blocking — check for any User-Agent blocks matching AI crawler strings.
2.Wordfence → Live Traffic — filter by user-agent (e.g. GPTBot, ClaudeBot). Rows tagged Blocked are your culprits.

AI bots rarely publish static IPs, so IP-based allowlists usually don't help — work from the user-agent side.

iThemes / Solid Security (WordPress)

1.Security → Settings → Network Brute Force Protection → API Settings → Banned Hosts and Banned User Agents — remove wildcard matches like *bot* or *scraper* that catch AI crawlers.
2.Check the Firewall section for User Agent filters under System Tweaks.

AWS WAF

1.AWS WAF & Shield → Web ACLs → your ACL.
2.If you use a managed rule group (AWSManagedRulesCommonRuleSet, AWSManagedRulesBotControlRuleSet), set its bot rules to Count (not Block), or add a User-Agent Exception.
3.For custom rules: add a higher-priority Allow rule matching your AI bot regex.

Imperva, StackPath, generic WAFs

Most enterprise WAFs let you create a User-Agent-based Allow rule that takes priority over the default bot-protection layer:

1.New rule, priority above default bot rules.
2.Match: User-Agent matches <regex>.
3.Action: Allow (or Bypass, or Skip — terminology varies).

Find out who's blocking

If you're not sure which layer is blocking, curl your site as a bot from outside the WAF:

Terminal

"text-[#5b5fc7]">curl "text-[#0e9373]">-A "Mozilla/5.0 (compatible; GPTBot/1.0; +https://openai.com/gptbot)" \
  "text-[#0e9373]">-I https://your-site.com/

A 200 OK means the bot would reach you. A 403, 503, or challenge page means a WAF is in the way — the response Server header usually identifies the layer (sucuri/cloudproxy, cloudflare + cf-ray:, x-amzn-requestid:, etc.).

Troubleshooting

Three places to look when you suspect a WAF is interfering:

1.Send verification (Crawler header) — fires synthetic GPTBot, PerplexityBot, and ChatGPT-User rows through the ingest pipeline. If they appear in the Feed with a Verified badge, the Trakkr side is healthy and the issue is bots not reaching your origin.
2.Access tab — cross-references your robots.txt with actual bot visit data. Two findings here flag WAF problems:

- Traffic dropped — visits down sharply with no robots.txt change. Usually a new Cloudflare bot mode, WAF rule, or rate limit. - Access mismatch — robots.txt allows the bot but most of its requests get denied at the origin.

1.robots.txt check — the Access tab also parses your robots.txt and flags a Disallow: / under User-agent: GPTBot (etc.). Bots respect this even if your WAF allowlist is correct.

Going further

Install crawler tracking

If you're still deciding which install method to use, the picker walks through all 18 options with prerequisites and step-by-step setup.

Crawlers dashboard

How to read crawler data once it's flowing — the three bot categories, the page funnel, alerts.