Identifying Orphan Pages from Log Analysis
Orphan pages are URLs that exist on your site — they get requested by crawlers or appear in your XML sitemap — but have no internal links pointing to them. Search engines reach them anyway (from old links, the sitemap, or external references), spend crawl budget on them, yet your own navigation never surfaces them. This guide shows how to extract three URL sets from your data — crawled documents, sitemap entries, and link-reachable pages — and use sort, comm, and join to find the gap that defines an orphan, then how to tell a genuine orphan from an intentionally noindexed page.
The technique is part of diagnosing crawl budget waste: orphans soak up crawl requests that should be flowing to your linked, rankable content. By the end you will have a deduplicated list of orphan candidates and a decision rule for each one.
Diagnosis: Spotting the Orphan Signature in Logs
An orphan does not announce itself. It looks like a perfectly normal 200 in the access log — that is exactly the problem. The signature is a URL that Googlebot fetches but that your crawler (Screaming Frog, Sitebulb, or a custom spider) never discovers by following links.
Start by confirming a candidate exists. Pull document-type requests (HTML, not assets) served to verified crawlers:
awk '$9 == 200 && $7 !~ /\.(css|js|png|jpg|jpeg|gif|svg|woff2?|ico|map)(\?|$)/ {print $7}' access.log \
| sort -u | head
Expected Output:
/about/team/
/blog/2019/legacy-launch-notes/
/products/discontinued-widget/
/tag/uncategorized/
That list is every document a bot successfully fetched. Some are linked; some are not. The orphans are hiding in there, and we isolate them by set arithmetic against the pages your internal link graph can actually reach.
Concept: Why Orphans Happen and Why They Waste Budget
Orphans accumulate from predictable sources: a page was unpublished from the menu but never 410ed, a product was discontinued and de-linked but stayed live, a CMS auto-generates /tag/ and /author/ archives that no template links to, or a sitemap plugin keeps emitting URLs long after the content stopped being linked. Each one is a node in your URL space with inbound crawler traffic but zero inbound internal links.
The cost is twofold. First, crawl budget: Googlebot re-fetches orphans on its normal schedule, and those fetches come out of the same budget as your important pages — the same waste mechanism you fight when finding crawl waste from URL parameters. Second, ranking signals: a page with no internal links receives no internal PageRank flow, so even if it is valuable it ranks far below its potential.
Step-by-Step: Build the Three URL Sets
The whole method reduces to one set-difference question: which URLs are crawled-or-in-sitemap but NOT reachable by internal links? You need three normalized, sorted, deduplicated files.
Step 1: Extract the crawled document set from logs. Take the document URLs bots actually fetched, normalize them to path-only, strip query strings and trailing-slash inconsistencies, and sort uniquely.
awk '$9 ~ /^(200|304)$/ && $7 !~ /\.(css|js|png|jpg|jpeg|gif|svg|woff2?|ico|map|txt|xml)(\?|$)/ {print $7}' access.log \
| sed -E 's/\?.*$//; s#([^/])$#\1/#' \
| sort -u > crawled.txt
Explanation: The awk filter keeps only successful HTML responses. The sed strips the query string (?...) and appends a trailing slash to any path missing one, so /about and /about/ collapse to one entry. Expected Output (sample of crawled.txt):
/about/team/
/blog/2019/legacy-launch-notes/
/products/discontinued-widget/
Step 2: Extract the sitemap URL set. Pull every <loc> from your XML sitemap, strip the scheme and host so it matches the path-only log format, and sort.
curl -s https://example.com/sitemap.xml \
| grep -oE '<loc>[^<]+</loc>' \
| sed -E 's#</?loc>##g; s#https?://[^/]+##; s#([^/])$#\1/#' \
| sort -u > sitemap.txt
Explanation: grep -oE isolates each <loc> element; sed removes the tags and the https://host prefix, leaving /path/. For a sitemap index, expand the child sitemaps first. Expected Output:
/about/team/
/blog/2025/spring-update/
/products/flagship-widget/
Step 3: Extract the link-reachable set from a crawl export. Run a link crawler from your homepage (Screaming Frog → Bulk Export → "All Inlinks", or any spider) and export the list of internally linked URLs. Normalize it identically.
cut -d, -f1 inlinks_export.csv \
| sed -E 's/^"|"$//g; s#https?://[^/]+##; s#\?.*$##; s#([^/])$#\1/#' \
| grep '^/' | sort -u > linked.txt
Safety Note: Crawl your own site at a polite rate. A spider hammering production at full concurrency can itself look like an attack and skew the very logs you are analyzing.
Expected Output:
/about/team/
/blog/2025/spring-update/
/products/flagship-widget/
Step-by-Step: Find the Orphans with comm and join
With three sorted files, the orphan set is pure set arithmetic.
Step 4: Orphans that are crawled but not linked. comm -23 prints lines in the first file that are absent from the second.
comm -23 crawled.txt linked.txt > orphans_crawled.txt
cat orphans_crawled.txt
Explanation: comm requires both inputs sorted (they are). -23 suppresses column 2 (linked-only) and column 3 (in both), leaving only "crawled but never linked." Expected Output:
/blog/2019/legacy-launch-notes/
/products/discontinued-widget/
/tag/uncategorized/
Step 5: Orphans that are in the sitemap but not linked. Same operation against the sitemap set — these are URLs you are actively advertising to Google with no on-site link to support them.
comm -23 sitemap.txt linked.txt > orphans_sitemap.txt
Expected Output:
/blog/2019/legacy-launch-notes/
/landing/orphaned-campaign/
Step 6: Merge to a single orphan candidate list. Union both orphan files and deduplicate.
sort -u orphans_crawled.txt orphans_sitemap.txt > orphans_all.txt
wc -l orphans_all.txt
Expected Output:
14 orphans_all.txt
Step 7 (optional): Attach crawl frequency with join. To prioritize, join the orphan list against a per-URL hit count so you fix the budget-heaviest orphans first.
awk '$7 ~ /^\// {sub(/\?.*$/,"",$7); print $7}' access.log \
| sed -E 's#([^/])$#\1/#' | sort | uniq -c \
| awk '{print $2","$1}' | sort -t, -k1,1 > hits.csv
join -t, -1 1 -2 1 <(sed 's/$/,/' orphans_all.txt | sort -t, -k1,1) hits.csv \
| sort -t, -k2 -nr
Explanation: The first pipeline builds path,hitcount. join matches each orphan path to its hit count on the shared first field; the final sort ranks by crawler hits descending. Expected Output:
/tag/uncategorized/,1880
/products/discontinued-widget/,640
/blog/2019/legacy-launch-notes/,212
Edge-Case Handling: Orphan vs. Intentionally Noindexed
Not every unlinked URL is a problem to fix. Before you act, classify each candidate.
Gotcha 1 — Noindex pages look like orphans. A /thank-you/ or /cart/ page is deliberately unlinked from crawlable navigation and often carries noindex. It will appear in comm -23 output but is correct as-is. Check the live header before treating it as an orphan to reclaim:
while read -r path; do
code=$(curl -s -o /dev/null -w '%{http_code}' "https://example.com${path}")
robots=$(curl -sI "https://example.com${path}" | grep -i 'x-robots-tag' | tr -d '\r')
printf '%s\t%s\t%s\n' "$code" "${robots:-none}" "$path"
done < orphans_all.txt
Expected Output:
200 x-robots-tag: noindex /cart/
200 none /products/discontinued-widget/
410 none /blog/2019/legacy-launch-notes/
A 200 with none is a true orphan to either link or retire. A noindex is intentional. A 410/404 is already being retired — confirm it is no longer in your sitemap. Reading status codes precisely matters here; see understanding HTTP status codes in server logs.
Gotcha 2 — Soft 404s masquerade as live orphans. A discontinued product page may return 200 with a "no longer available" body. It is not a real page worth linking, and it is a separate waste category — handle it with the workflow for detecting soft 404s in server logs before adding it to your link-building list.
Gotcha 3 — Trailing-slash and case mismatches create false orphans. If linked.txt has /About/ and crawled.txt has /about/, comm reports a phantom orphan. The sed normalization above handles slashes; add tr 'A-Z' 'a-z' to every pipeline if your platform is case-insensitive.
Verification: Confirm the Orphan Count Drops
After fixing orphans (adding internal links, or 410ing dead ones and removing them from the sitemap), re-run the diff on fresh logs and confirm the candidate list shrinks.
comm -23 <(sort -u crawled_new.txt) <(sort -u linked_new.txt) | wc -l
Expected Output:
2
A drop from 14 to 2 (the remaining two being legitimate noindex utility pages) confirms the fix. Track this number over successive log windows; a re-climbing count signals a CMS that keeps generating unlinked archives.
Common Mistakes
- Comparing un-sorted files with
comm.commassumes both inputs are sorted in the same collation. Forgettingsort -uproduces silently wrong output — it will report URLs as orphans that are actually linked. Always pipe throughsort -u(or useLC_ALL=C sortconsistently across all three files). - Including assets in the crawled set. If you do not exclude
.css,.js, and image extensions, your "orphan" list fills with stylesheets and sprites. Filter document responses only, as theawkextension blacklist does — borrow tighter patterns from awk and grep commands for log filtering. - Treating every unlinked URL as a defect. Cart, checkout, account, and thank-you pages are supposed to be unlinked and noindexed. Run the
x-robots-tagcheck before opening tickets, or you will waste effort "fixing" pages that are working as designed.
Frequently Asked Questions
Do I need a sitemap to find orphan pages, or are logs enough?
Logs alone find orphans that crawlers already reach (the crawled-but-not-linked set), which catches most real problems. The sitemap set adds URLs you advertise to Google that may not yet appear in logs, and is the better source for catching freshly orphaned pages before they accumulate wasted crawls. Use both when available; use logs alone when you cannot get a clean crawl export.
Why use comm instead of grep -v to find the difference between two URL lists?
grep -vf linked.txt crawled.txt treats each linked URL as a regex substring, so /about/ would wrongly match /about/team/ and hide real orphans. comm -23 does exact, line-by-line set difference on sorted input, which is both correct and dramatically faster on large files because it is a single linear merge rather than an O(n×m) pattern scan.
How do I tell a true orphan from a page that is unlinked on purpose?
Fetch each candidate and inspect its status and X-Robots-Tag header. A 200 with no noindex and real content is a true orphan: either link it from relevant navigation or retire it with a 410. A 200 carrying noindex, or a utility page like /cart/, is intentional — leave it. A 404/410 is already retired; just make sure it is gone from your sitemap so you stop advertising it.
Related Guides
- Finding Crawl Budget Waste from URL Parameters — the sibling waste pattern where faceted URLs, not orphans, drain budget.
- Detecting Soft 404s in Server Logs — separate orphan "200s" from pages that return OK but have no content.
- Understanding HTTP Status Codes in Server Logs — read the status field correctly when classifying orphan candidates.
- awk and grep Commands for Log Filtering — the extraction primitives behind every pipeline on this page.
Part of the Diagnosing Crawl Budget Waste series.