Nginx Log Retention Best Practices for SEO

Accurate crawl budget optimization relies on uninterrupted server log analysis, and that depends entirely on how long — and how cleanly — you keep your Nginx access logs. When retention is misconfigured, SEO teams hit three failure modes at once: data gaps where bot requests vanish before a parser reads them, disk exhaustion that silently halts ingestion, and skewed crawl metrics caused by premature truncation. This guide gives you exact Nginx and logrotate configuration, anonymization at write-time, and validation commands to maintain compliant, SEO-ready archives without dropping a single Googlebot hit.

The goal is a retention window long enough to survive batched parser cycles, short enough to satisfy privacy law, and structured so historical crawl-trend analysis stays intact. Align this with your broader log retention policies so crawler behavior remains fully traceable, and lean on solid log rotation strategies so the rotation that enforces retention never drops requests in the act.

Diagnosis: Confirming a Retention-Induced Data Gap

The symptom that sends most teams here is an SEO dashboard reporting artificially low Googlebot crawl rates despite verified live traffic. Before you touch any config, confirm the gap is real and caused by retention rather than by a parser bug. Count Googlebot hits per rotated file and look for a cliff:

for f in /var/log/nginx/access.log*; do
  printf '%s\t%s\n' "$f" "$(zgrep -c -i googlebot "$f" 2>/dev/null || grep -c -i googlebot "$f")"
done

Expected Output:

/var/log/nginx/access.log	14820
/var/log/nginx/access.log.1	13977
/var/log/nginx/access.log.2.gz	14102
/var/log/nginx/access.log.3.gz	0
/var/log/nginx/access.log.4.gz	0

The two trailing 0 files are the tell: rotation is producing files, but Googlebot data stops three days back. Either logs are being deleted before the parser reaches them, or compression mangled the archive. A healthy retention setup shows a roughly flat bot count across every retained file.

Check disk and inode pressure too, since a full filesystem is the other common cause of a silent ingestion halt:

df -h /var/log && df -i /var/log

Expected Output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G   49G  1.1G  98% /var/log
Filesystem     Inodes  IUsed  IFree IUse% Mounted on
/dev/sda1      3.2M    3.1M    98K   97% /var/log

At 98% blocks and 97% inodes, Nginx will soon fail to open new log files. Both numbers must come down before retention can be trusted.

Concept: Why Retention and Parser Cadence Collide

Retention is not just "how many days to keep." It is a contract between three clocks that rarely tick together: Nginx writes continuously, logrotate rotates on a fixed schedule or size trigger, and your log analyzer ingests in batches that often lag hours behind real time. If rotate or maxage deletes a file before the batch parser has read it, the crawl data in that file is gone forever — there is no replaying an access log.

The fix is a tiered window. Keep raw, uncompressed logs long enough to outlast the slowest parser cycle (commonly 14 days), then compress and hold for historical trend analysis (commonly 90 days), then delete. Delayed compression keeps the most recent file readable for real-time tooling while everything older is squeezed for storage. Privacy law adds a fourth constraint: raw IPs are personal data, so you anonymize at write-time rather than retaining identifiable data you would later have to purge. This is the same tiering logic behind log storage and archival best practices, applied specifically to keeping crawl telemetry intact.

Step-by-Step: A Retention-Safe Nginx Configuration

Step 1: Anonymize IPs at write-time.
Stripping the host portion before the line is written means you never retain personal data, so retention length is bounded only by SEO need, not by privacy risk. Add a map and a custom log format.

# /etc/nginx/conf.d/anonymize_ip.conf
map $remote_addr $anonymized_ip {
    ~(?P<ip>\d+\.\d+\.\d+)\.\d+               $ip.0;
    ~(?P<ip>[0-9a-fA-F:]+):[0-9a-fA-F]{1,4}$  $ip:0;
    default                                   0.0.0.0;
}

log_format combined_anon '$anonymized_ip - $remote_user [$time_local] '
                         '"$request" $status $body_bytes_sent '
                         '"$http_referer" "$http_user_agent"';

access_log /var/log/nginx/access.log combined_anon;

Validate the config before reloading.

sudo nginx -t

Expected Output:

nginx: configuration file /etc/nginx/nginx.conf test is successful

This zeroes the last IPv4 octet and the trailing IPv6 group while preserving path, status, and user-agent — everything crawl analysis needs. Tie the privacy rationale back to your GDPR compliance controls so retention length is defensible.

Step 2: Define the tiered logrotate policy.
This is the core retention rule: 14 days raw, compressed thereafter, with a hard ceiling.

# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
    daily
    missingok
    rotate 90
    compress
    delaycompress
    notifempty
    maxage 90
    dateext
    create 0640 www-data adm
    sharedscripts
    postrotate
        [ -f /var/run/nginx.pid ] && kill -USR1 $(cat /var/run/nginx.pid)
    endscript
}

Explanation: rotate 90 keeps 90 generations, maxage 90 deletes anything older even if the count is not reached, and delaycompress leaves the most recent rotated file uncompressed for real-time parsers. The postrotate block sends SIGUSR1 so Nginx reopens its log files gracefully. dateext names archives by date so a parser can map files to days unambiguously.

Step 3: Dry-run the rotation before trusting it.
Never deploy a retention change without simulating it first.

sudo logrotate -d /etc/logrotate.d/nginx

Expected Output:

rotating pattern: /var/log/nginx/*.log  after 1 days (90 rotations)
considering log /var/log/nginx/access.log
  log needs rotating
rotating log /var/log/nginx/access.log, log->rotateCount is 90
... (no errors)

Production Warning: logrotate -d is debug/dry-run and changes nothing, but the very next step (-f) forces a real rotation and fires postrotate. Run the forced rotation only in a maintenance window, and confirm nginx -t passed first — a broken config plus a forced reload can take the site down.

Step 4: Force one real rotation and confirm Nginx reopened cleanly.

sudo logrotate -f /etc/logrotate.d/nginx
sudo tail -n 2 /var/log/nginx/error.log

Expected Output:

2026/06/19 03:00:01 [notice] 8123#8123: signal 10 (SIGUSR1) received, reopening logs
2026/06/19 03:00:01 [notice] 8123#8123: reopened "/var/log/nginx/access.log"

The reopened line proves Nginx now writes to the fresh file rather than the renamed .1. If you see no such line, the postrotate signal did not land and new writes are going to the rotated file — the exact cause of post-rotation bot gaps.

Edge Cases

Parser still mid-ingestion when rotation fires. Batched analyzers can lag the rotation clock. Gate deletion on a parser-completion flag rather than time alone. Have your ingestion job touch a marker, and refuse to archive files newer than the marker:

# parser writes this on successful ingestion of a day's file
touch /var/log/nginx/.ingested-$(date +%F)
# archival job skips any access.log not yet marked ingested
[ -f "/var/log/nginx/.ingested-$(date +%F -d yesterday)" ] || exit 0

This makes retention parser-aware: a slow batch run delays cleanup instead of losing data.

Compressed archives unreadable by the analyzer. If zgrep on a .gz returns binary garbage or zero hits where the raw file had thousands, compression corrupted mid-write — usually because rotation fired while Nginx still held the descriptor. delaycompress avoids this by never compressing the file Nginx might still be writing. Confirm with zcat access.log.2.gz | head returning readable log lines.

Verification

Prove the full retention window is intact and parseable end to end with one command that counts Googlebot hits across every retained generation, decompressing as needed:

zgrep -c -i googlebot /var/log/nginx/access.log* 2>/dev/null | sort -t: -k2 -n | tail

Expected Output:

/var/log/nginx/access.log.88.gz:13980
/var/log/nginx/access.log.2.gz:14102
/var/log/nginx/access.log.1:13977
/var/log/nginx/access.log:14820

Every generation reports a healthy, comparable bot count with no zero-files and no gaps — retention is keeping crawl data continuous from today back through the full window.

Common Mistakes

  • Deleting raw logs before the parser finishes. Setting rotate/maxage purely by storage budget ignores parser lag and silently erases crawl telemetry. Gate cleanup on an ingestion-completion marker (Step 4 edge case) so a slow batch delays deletion instead of destroying data.
  • Retaining unanonymized IPs to extend the window. Keeping raw IPs for 90 days to "have more data" turns a retention policy into a compliance liability. Anonymize at write-time (Step 1); the crawl-relevant fields survive and the privacy clock never starts.
  • Omitting delaycompress. Compressing the just-rotated file forces real-time parsers to stream .gz immediately and risks corrupting a file Nginx may still hold open. Always keep the newest rotated generation uncompressed.

Frequently Asked Questions

How long should Nginx logs be retained for accurate SEO analysis?
Keep raw, uncompressed logs for about 14 days so the slowest batched parser cycle completes, then hold compressed archives to roughly 90 days for historical crawl-trend analysis. Tie the upper bound to your documented retention policy and to privacy limits — anonymizing IPs at write-time lets you keep the structural crawl data longer without retaining personal data.

Does log compression affect crawl budget parser accuracy?
No, provided your parser supports gzip decompression and you use delaycompress. Delayed compression keeps the most recent rotated file readable for real-time ingestion while older files are squeezed for storage. Corruption only appears when a file is compressed while Nginx still holds its descriptor, which delaycompress prevents.

Can I automate log archiving without breaking Nginx or losing bot data?
Yes. Use a postrotate block that sends SIGUSR1 to the Nginx master so it reopens log files without dropping connections, and gate archival on a parser-completion marker so files are never deleted before ingestion confirms it read them. The reopened line in the Nginx error log is your proof the handoff worked.

Part of the Log Retention Policies series.