Index bloat is what happens when Google knows more URLs on your WordPress site than you do. The symptoms are always the same: “Crawled — currently not indexed” grows like mold, the pages you actually care about wobble in rankings, and your server logs look like a bot convention.
You don’t fix this by “adding more content” or “submitting the sitemap again.” You fix it by deciding what deserves to be indexed, what should be crawled but not indexed, and what should not be crawled at all—without accidentally cutting internal links that help Google understand your site.
What index bloat really is (and why WordPress makes it easy)
Index bloat is not “Google indexing too much” in the abstract. It’s a measurable mismatch:
- URLs Google can crawl (discovered via internal links, sitemaps, external links, parameters)
- URLs Google chooses to index (kept in the index and eligible to rank)
- URLs you intended to exist (your canonical set: pages you want users to land on)
WordPress makes bloat easy because it manufactures “helpful” pages by default: tag archives, date archives, author archives, attachment pages, paginated archives, internal search result pages, and sometimes multiple URL versions for the same content via parameters and tracking codes.
The trap: many of those pages can still be useful for navigation and topical clustering inside the site. You want Google to understand that structure. You just don’t want Google to index the low-value variants.
Your target state: Google can crawl enough to understand the site, but indexes a tight set of canonical URLs that represent your best pages.
Operationally, treat index bloat like any other production problem: define the desired state, measure drift, then apply changes with guardrails. SEO is just SRE for content.
Facts and context: how we got here
Some historical context helps because WordPress SEO advice tends to fossilize. Here are a few concrete facts that still matter:
- Robots.txt started in 1994 as a voluntary convention; it was never an “access control” mechanism. It’s a polite sign, not a lock.
- Meta robots “noindex” predates modern SEO tooling and remains the cleanest way to say “you can crawl this, but don’t index it.”
- Canonical tags became mainstream in 2009 to reduce duplicate-content chaos. They’re a hint, not a guarantee, but still critical.
- Google’s crawl budget conversations became popular in the 2010s, but the practical takeaway is older: waste less crawler time on junk.
- WordPress attachment pages have been an SEO footgun for years because each media item can become a thin page with near-zero value.
- Tag archives were invented for navigation and discovery, not to be the 40th doorway page ranking for the same keyword.
- Faceted navigation exploded with e-commerce UX, and with it, parameterized URL bloat; WordPress “filters” and plugin-built facets recreate that mess.
- “Crawled — currently not indexed” is not a punishment; it’s Google being selective. Your job is to make selection easy.
- Modern Google can discover URLs without sitemaps through links and feeds; sitemaps are guidance, not a discovery dependency.
One quote worth keeping above your desk:
“Hope is not a strategy.” — Gene Kranz
It’s not about being clever. It’s about being explicit.
Rules of the road: noindex, canonical, robots, and link equity
Noindex is about indexing, not crawling
noindex says: “Do not include this page in search results.” It does not automatically stop crawling. Google may still crawl it periodically to re-check signals, links, and directives.
Disallow is about crawling, not indexing (mostly)
Disallow in robots.txt says: “Do not crawl.” But if a URL is discovered from external links, Google can still index the URL without content (the dreaded “indexed, though blocked by robots.txt” patterns). That’s how you get ghost entries that never consolidate signals properly.
Canonical consolidates duplicates—when it’s believable
A canonical tag is you proposing the “main” URL. It works best when:
- The content is substantially the same.
- Internal linking reinforces the canonical.
- The canonical target is accessible and indexable.
Internal links can still point to noindex pages—but do it with intent
Noindex doesn’t “kill” a link. But it changes how valuable that URL is as a destination. The right pattern is usually:
- Link to indexable pages when possible (posts, core categories you want ranking).
- Use noindex on low-value collections while still letting them exist for users.
- Don’t obsess over “link juice” like it’s a fluid you’re spilling on the floor. Think in terms of crawl paths and canonical targets.
Joke #1: Crawl budget is like your meeting calendar—if you let every invite in, you’ll be busy and still get nothing done.
The “indexable set” is a product decision, not a plugin setting
Before touching rules, write down what should be indexable:
- All posts? All pages? Products only?
- Category archives: yes/no (often yes, but only the ones with real editorial intent).
- Tag archives: usually no, unless you treat tags like curated hubs.
- Author/date archives: almost always no for single-author blogs or small editorial teams.
- Search results: no.
- Pagination: depends, but usually index page 1, noindex pages 2+ (or canonicalize carefully).
Fast diagnosis playbook: find the real bottleneck fast
When an SEO team says “index bloat,” it can mean three different operational failures. Check in this order.
First: is it a discovery problem, a duplication problem, or a quality problem?
- Discovery problem: important URLs not found/crawled. Symptoms: key pages missing, sitemap ignored, weak internal linking.
- Duplication problem: too many URL variants for the same content. Symptoms: parameters, multiple paths, inconsistent canonicals.
- Quality problem: Google sees lots of thin pages. Symptoms: “Crawled — currently not indexed,” “Duplicate without user-selected canonical.”
Second: validate directives and their conflicts
Find contradictions like:
noindexpages included in XML sitemapDisallowblocks pages you want indexed- Canonical points to a URL that 301s, 404s, or is noindexed
noindexcombined withnofolloweverywhere (a great way to make Google blind)
Third: check the crawler workload (logs), not your feelings
Server access logs will tell you what Googlebot is spending time on: tag pages, search, attachment pages, or infinite parameters. Fix the hotspots first.
Hands-on tasks: commands, outputs, and decisions (12+)
These are the tasks I actually run when I’m diagnosing WordPress index bloat in production. Each one includes: command, sample output, what it means, and what decision you make next.
Task 1: Count unique URL patterns Googlebot is hitting (top offenders)
cr0x@server:~$ zcat -f /var/log/nginx/access.log* | awk '$0 ~ /Googlebot/ {print $7}' | sed 's/\?.*$//' | awk -F/ '{print "/"$2"/"}' | sort | uniq -c | sort -nr | head
18231 /tag/
12044 /wp-content/
9312 /page/
7441 /category/
5220 /?s=
4109 /author/
What it means: Googlebot is spending real time on tag archives, pagination, search, authors, and even static assets.
Decision: Prioritize directives for /tag/, /?s=, and author archives. Also confirm static assets aren’t being linked as crawlable pages (they shouldn’t be in sitemaps).
Task 2: Identify parameter-heavy crawling (possible faceted/filter bloat)
cr0x@server:~$ zcat -f /var/log/nginx/access.log* | awk '$0 ~ /Googlebot/ {print $7}' | grep -E '\?.+=' | sed 's/.*?/?/' | cut -d'&' -f1 | sort | uniq -c | sort -nr | head
6421 ?replytocom=
3220 ?utm_source=
1411 ?amp=
1207 ?orderby=
988 ?filter=
What it means: Parameters like replytocom and UTM tracking are creating alternate URLs. Those are classic index-bloat accelerants.
Decision: Normalize with canonical handling, parameter rules (where available), and consider redirecting or stripping specific parameters at the edge if safe.
Task 3: Check sitemap URLs vs indexable directives (sample with curl)
cr0x@server:~$ curl -sS -I https://example.com/post-sitemap.xml | head
HTTP/2 200
content-type: application/xml; charset=UTF-8
cache-control: max-age=300, must-revalidate
What it means: Sitemap is accessible. Now confirm it contains only indexable URLs.
Decision: Fetch a few URLs from the sitemap and verify they are not noindex and return 200 with correct canonical.
Task 4: Validate a page’s robots meta and canonical (headers + HTML grep)
cr0x@server:~$ curl -sS -D- https://example.com/tag/widgets/ | sed -n '1,25p'
HTTP/2 200
content-type: text/html; charset=UTF-8
cr0x@server:~$ curl -sS https://example.com/tag/widgets/ | grep -iE 'robots|canonical' | head
<meta name="robots" content="noindex,follow">
<link rel="canonical" href="https://example.com/tag/widgets/" />
What it means: Tag page is set to noindex,follow. That’s often correct if tags are not meant to rank but still help users navigate.
Decision: Keep follow so internal links remain discoverable. Ensure tag pages are not included in sitemaps.
Task 5: Confirm robots.txt isn’t blocking something you want indexed
cr0x@server:~$ curl -sS https://example.com/robots.txt
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-json/
Disallow: /tag/
Sitemap: https://example.com/sitemap_index.xml
What it means: /tag/ is disallowed, which prevents crawling and can leave tag URLs indexed without content if discovered externally.
Decision: Prefer allowing crawl + noindex for tag archives, unless you have a strong reason to block crawl. Remove Disallow: /tag/ and handle via meta robots.
Task 6: Find attachment pages returning 200 (thin content candidates)
cr0x@server:~$ wp --path=/var/www/html --allow-root post list --post_type=attachment --post_status=inherit --fields=ID,post_title,post_date --format=table | head
+------+------------------------+------------+
| ID | post_title | post_date |
+------+------------------------+------------+
| 4123 | datasheet-widget-01 | 2023-05-10 |
| 4124 | widget-diagram | 2023-05-10 |
+------+------------------------+------------+
What it means: Attachments exist and likely have attachment pages unless disabled/redirected by theme/plugin settings.
Decision: Redirect attachment pages to the media file or parent post, and/or apply noindex to attachments site-wide.
Task 7: List public post types and taxonomies (what can generate archives)
cr0x@server:~$ wp --path=/var/www/html --allow-root eval 'print_r(get_post_types(["public"=>true]));'
Array
(
[post] => post
[page] => page
[attachment] => attachment
[product] => product
)
cr0x@server:~$ wp --path=/var/www/html --allow-root eval 'print_r(get_taxonomies(["public"=>true]));'
Array
(
[category] => category
[post_tag] => post_tag
[product_cat] => product_cat
[product_tag] => product_tag
)
What it means: You have taxonomies that can create archives: post_tag, product_tag, etc.
Decision: Decide which taxonomy archives should be indexable. Most sites only need a subset (often categories/product categories).
Task 8: Detect near-duplicate URL versions (http/https, www/non-www)
cr0x@server:~$ curl -sS -I http://example.com/ | head -n 5
HTTP/1.1 301 Moved Permanently
Location: https://example.com/
cr0x@server:~$ curl -sS -I https://www.example.com/ | head -n 5
HTTP/2 301
location: https://example.com/
What it means: Redirects are in place. Good. Duplication here is controlled.
Decision: Confirm all variants consolidate to one canonical host and scheme everywhere, including sitemaps and canonical tags.
Task 9: Check for trailing slash inconsistencies (duplicate paths)
cr0x@server:~$ curl -sS -I https://example.com/category/widgets | head -n 6
HTTP/2 301
location: https://example.com/category/widgets/
What it means: WordPress is normalizing to the trailing-slash version via redirect.
Decision: Ensure canonical tags always use the normalized version to avoid “duplicate, Google chose different canonical” noise.
Task 10: Find “thin” archives by counting posts per term
cr0x@server:~$ wp --path=/var/www/html --allow-root term list post_tag --fields=term_id,name,count --format=csv | awk -F',' 'NR>1 && $3+0<3 {print}' | head
102,blue-widgets,1
141,widget-ideas,2
177,old-campaign,1
What it means: Many tags have 1–2 posts. Indexing those archive pages is usually worthless and increases bloat.
Decision: Set tag archives to noindex,follow, or curate a small subset of “hub tags” and noindex the rest (advanced).
Task 11: Confirm internal search pages exist and are indexable (they shouldn’t be)
cr0x@server:~$ curl -sS -I "https://example.com/?s=widgets" | head -n 12
HTTP/2 200
content-type: text/html; charset=UTF-8
cr0x@server:~$ curl -sS "https://example.com/?s=widgets" | grep -i 'robots' | head
<meta name="robots" content="noindex,follow">
What it means: Search results are correctly set to noindex. If they weren’t, you’d get infinite low-quality pages.
Decision: Keep them noindex,follow. Also avoid linking to internal search results in templates.
Task 12: Spot pagination bloat (page/2, page/3…)
cr0x@server:~$ zcat -f /var/log/nginx/access.log* | awk '$0 ~ /Googlebot/ {print $7}' | grep -E '/page/[0-9]+/' | sed 's/[0-9]\+/\{n\}/' | sort | uniq -c | sort -nr | head
8022 /category/{n}/
2210 /tag/{n}/
What it means: Googlebot is crawling deep pagination, especially category pages. That can be normal, but it’s often wasted if paginated pages have low unique value.
Decision: Generally: keep page 1 indexable, set pages 2+ to noindex,follow. Ensure pagination links are crawlable so Google can reach deeper posts if needed.
Task 13: Check response codes for suspicious URL families (404/soft 404)
cr0x@server:~$ zcat -f /var/log/nginx/access.log* | awk '$9 ~ /404|410/ {print $7}' | head
/tag/obsolete/
/category/typoo/
/wp-content/uploads/2019/ghost.png
What it means: Broken archives and missing media can create crawl waste and index noise.
Decision: Fix internal links, redirect intentionally, and return 410 for truly gone URL sets when appropriate.
Task 14: Confirm you’re not shipping “noindex” on pages you want ranking
cr0x@server:~$ curl -sS https://example.com/important-landing-page/ | grep -i 'meta name="robots"' -n
24:<meta name="robots" content="noindex,nofollow">
What it means: Someone (plugin setting, environment toggle, or template) noindexed a money page. This is a fire.
Decision: Remove noindex immediately, purge caches, re-submit for indexing, and audit the setting that caused it so it doesn’t recur.
Noindex rules that don’t kill internal links (the practical list)
Here’s the set of rules I deploy most often on WordPress. They’re biased toward editorial sites and content marketing sites, but the logic applies broadly. For e-commerce, you’ll keep more taxonomy pages indexable, but you still noindex the junk variants.
1) Internal search results: noindex, follow
Do: noindex,follow on /?s=query and any pretty search path you use.
Why: Search results are infinite, unstable, and often thin. They’re also trivially spammed by query permutations.
Keep internal links: follow lets Google still traverse to real posts from those pages if it crawls them.
2) Tag archives: default to noindex, follow
Do: noindex tag archives unless you actively curate them.
Why: Tags are typically undisciplined. People create “blue-widget,” “blue-widgets,” “bluewidget,” then leave them all with one post each.
Exception: If you have a controlled vocabulary and each tag is a proper hub page with unique intro copy and meaningful curation, index it. Most teams don’t.
3) Author archives: noindex for single-author or small teams
Do: noindex author archives if they don’t provide unique value beyond “posts by X.”
Why: They duplicate post listings. They also create weird thin pages for “Admin” or legacy accounts.
Keep internal links: Leave author links for users if you must, but ensure the archive is noindexed.
4) Date archives: noindex almost always
Do: noindex date archives (monthly/daily archives).
Why: Dates are not topical. They create a parallel navigation system that’s useless for search and produces tons of low-value pages.
5) Attachment pages: redirect or noindex hard
Do: either:
- 301 redirect attachment pages to the parent post (preferred when parent exists), or
- Redirect to the media file URL, or
- Noindex attachment pages site-wide.
Why: Attachment pages are the SEO equivalent of a hallway with no doors.
6) Pagination: usually index page 1, noindex page 2+
Do: apply noindex,follow on paginated archives beyond page 1 (e.g., /category/widgets/page/2/).
Why: Paginated pages are often just “more of the same list,” and index bloat balloons fast. But you still want them crawled so deeper posts can be discovered.
Be careful: If page 2 has unique content or the category is huge and page 1 cannot represent it, you may allow indexing. That’s not common for blogs; more common for large catalogs.
7) Parameterized URLs: canonicalize and reduce at the source
Do:
- Keep canonical pointing to the clean URL (no tracking params).
- Stop generating links with UTM parameters inside your own site.
- For notorious params like
replytocom, disable the behavior or redirect.
Why: Parameters multiply. Google is good, but it’s not here to solve your math homework.
8) Filter/facet pages (plugin-driven): decide “indexable facets” explicitly
If you have WooCommerce or a faceted navigation plugin, you must decide which filters represent real landing pages.
- Indexable facets: high-intent combinations you can maintain, with stable content and demand.
- Noindex facets: everything else, especially multi-filter permutations that create millions of pages.
Keep internal links: It’s fine for users to filter; it’s not fine for Google to index every permutation. Use noindex,follow or canonical to a parent category where appropriate.
9) Duplicate taxonomy archives across post types: consolidate
WordPress setups sometimes create multiple taxonomies that look identical (“topics,” “tags,” “labels”). If two archives compete for the same intent, pick one and noindex or redirect the other.
10) Preview, staging, and query pages: block properly
Do: Make sure staging is blocked at the server (auth) and also with robots, and never exposed in sitemaps.
And for previews like ?preview=true: canonicalize to the published URL and avoid indexing. These URLs leak.
Joke #2: If you let every tag archive index, your SERP strategy becomes “hope the algorithm adopts us.” That’s not a plan; that’s a cry for help.
What “noindex rules that don’t kill internal links” actually means
It means you avoid nofollow by default. The pattern for low-value pages is:
noindex,followfor pages that can help discovery but shouldn’t rank.- 301 redirect for duplicates and obsolete URL families.
- Disallow only when crawling the content is actively harmful (infinite spaces, private areas, heavy endpoints), and you’re okay with the indexing side effects.
Three corporate mini-stories from the trenches
Mini-story 1: The incident caused by a wrong assumption
The company was mid-replatform, but “mid-replatform” is corporate for “everything is on fire, just politely.” Their WordPress blog sat behind a CDN and a caching layer, and someone noticed Googlebot hammering /?s= URLs. The quick fix seemed obvious: block internal search in robots.txt.
They added Disallow: /?s= and went home feeling productive. For about a week, it looked fine—crawl volume dropped and dashboards got quieter. Then organic traffic started sliding, slowly but steadily, like a storage array that’s losing a disk every Friday.
The wrong assumption: “If we disallow crawl, it won’t index.” What actually happened was worse. Google had already discovered thousands of search URLs through internal links and external referrers. With crawling blocked, it couldn’t re-fetch them to see a noindex directive or canonical, and it couldn’t consolidate signals cleanly.
Search Console started showing an ugly mix of “Indexed, though blocked by robots.txt” and canonical confusion. The indexed set became polluted with URL-only entries and low-quality snippets. Meanwhile, some real posts were crawled less because Googlebot was still discovering garbage, just with less ability to evaluate it.
The fix was boring and exact: remove the disallow, apply noindex,follow to search pages, and stop linking to search URLs in navigation. Crawl went up briefly, then normalized. Index cleaned up over the next weeks. The lesson: don’t use robots.txt as a broom; it’s more like a “do not enter” sign that still lets people write your address down.
Mini-story 2: The optimization that backfired
An enterprise marketing team decided to “simplify SEO” by setting noindex,nofollow on all tag pages, category pages, and author pages. The thinking was: only posts should rank, so everything else is noise.
It worked—sort of. Index bloat dropped. But so did discovery of older posts. The site had years of content and the primary internal navigation was category-based. Categories were the connective tissue that led crawlers and users into deep archives.
With nofollow everywhere on those pages, the crawl graph thinned. Google started treating parts of the site like isolated islands. A bunch of long-tail posts quietly lost impressions because they weren’t being reached as frequently, and internal anchor context weakened.
They rolled back the nofollow portion, leaving noindex,follow on low-value archives. For key categories that served as real landing pages, they made them indexable and added short editorial intros. The result wasn’t sexy, but it was stable: fewer junk URLs indexed, and internal linking did its job again.
Mini-story 3: The boring but correct practice that saved the day
A publisher ran WordPress at scale with multiple custom post types and a revolving door of plugins. Their SEO team wanted aggressive noindexing to reduce “Crawled — currently not indexed.” The SRE team’s response: “Fine, but we’re doing it with change control.” Everyone groaned, as is tradition.
They kept a simple inventory document: which URL types exist, which are indexable, which are noindex, which are redirected, and which are blocked. Not a spreadsheet masterpiece—just a living map. Every proposed change had to name the URL family and expected outcomes in logs and Search Console.
They also ran weekly log sampling: top Googlebot paths, top 404s, and top parameterized URLs. When a plugin update suddenly introduced ?amp=1 variants and started linking them internally, it was caught within days, not months.
That boring practice prevented a major mess. They didn’t need heroics. They needed a feedback loop: deploy, observe, adjust. It’s the same discipline you’d apply to storage performance regressions—measure before and after, and don’t trust “should be fine.”
Common mistakes: symptoms → root cause → fix
1) Symptom: “Indexed, though blocked by robots.txt” grows
Root cause: You disallowed crawling of URLs that are discoverable via links, so Google indexes the URL without being able to evaluate content or directives.
Fix: Remove the disallow for that URL family and use noindex on-page instead. For truly infinite spaces, consider blocking plus removing internal links and returning 404/410 where appropriate.
2) Symptom: Important pages show “Duplicate without user-selected canonical”
Root cause: Multiple URL variants exist (parameters, trailing slash, http/https, www/non-www, pagination weirdness), and canonicals are inconsistent or implausible.
Fix: Normalize with redirects, ensure canonical tags point to the final 200 URL, and stop internal links from pointing to non-canonical variants.
3) Symptom: Tag pages outrank posts (and conversions tank)
Root cause: Tags are indexed, thin, and accidentally match broad queries. Google picks them as the “best” page because everything else is diluted or duplicated.
Fix: Noindex tag archives by default; make only curated hubs indexable. Strengthen internal links to the intended landing page.
4) Symptom: Lots of “Crawled — currently not indexed” for archives and pagination
Root cause: Google crawls them but doesn’t see enough unique value, or sees duplication signals.
Fix: Noindex low-value archives and deep pagination; add canonical where duplication is the main issue; reduce the number of discoverable junk URLs.
5) Symptom: Sudden deindexing after a plugin update
Root cause: SEO plugin or theme changed robots meta globally, or toggled “discourage search engines” flags.
Fix: Audit robots meta on key templates, lock configuration via environment checks, and add a monitoring test that curls a few critical pages for index,follow.
6) Symptom: Attachment pages are indexed with weird snippets
Root cause: Attachment pages are thin and sometimes auto-generated with little context; they can be linked from image search or internal media listings.
Fix: Redirect attachment pages to parent posts or media files; noindex as a fallback.
7) Symptom: Crawl rate is high but rankings don’t improve
Root cause: Crawl effort is spent on low-value URL sets: parameters, tag pages, internal search, and thin archives.
Fix: Reduce discoverability of junk and tighten indexable set. Then improve the pages that remain indexable.
Checklists / step-by-step plan
Step 1: Define your indexable set (write it down)
- Indexable: posts, pages, products, selected categories/product categories.
- Noindex: internal search, tag archives, date archives, most author archives, attachment pages, paginated pages 2+ (usually).
- Redirect: duplicates (http→https, www→non-www), attachment pages (preferred), obsolete URL sets.
- Block crawl: admin, login, heavy API endpoints you never want crawled.
Step 2: Align sitemaps with that set
- Only indexable URLs belong in XML sitemaps.
- Remove noindexed URL types from sitemaps via plugin settings.
- Confirm sitemap URLs return 200 and are canonical.
Step 3: Fix the biggest bloat multipliers first
- Internal search pages: ensure
noindex,follow. - Tags: set to
noindex,followunless curated. - Attachments: redirect or noindex.
- Parameters: stop generating them internally; canonicalize; consider edge normalization for known garbage params.
- Pagination: noindex page 2+ (usually).
Step 4: Validate via logs and spot checks
- Sample Googlebot requests weekly and track top URL families.
- Curl a handful of representative pages and confirm robots meta + canonical.
- Watch for new URL families after plugin/theme releases.
Step 5: Add guardrails (monitoring)
- Automate checks for robots meta on 10–20 key pages.
- Alert on spikes in parameterized crawling or 404s.
- Keep an “SEO directives inventory” alongside infrastructure docs.
FAQ
Should I noindex category pages?
Usually no. Categories often represent real topics and can be strong landing pages. Index categories you’re willing to curate; noindex only if they’re thin and redundant.
Is noindex,follow still valid, or does Google ignore it?
It’s valid. The nuance is that Google may treat nofollow as a hint in some contexts, but noindex remains the right directive for “don’t index this page.”
What’s better: noindex or canonical for duplicates?
If the page is a true duplicate variant that shouldn’t exist as a destination, prefer redirects and canonicalization. Use noindex for pages that have a user purpose but shouldn’t rank (search results, some archives).
Will noindexing tag pages hurt internal linking?
If you use noindex,follow, you keep crawl paths. The bigger risk is setting nofollow everywhere or blocking crawl in robots.txt.
Should I disallow tag archives in robots.txt to save crawl budget?
Rarely. Disallow prevents crawling, which can create “indexed but blocked” artifacts if tag URLs are discovered elsewhere. Let them be crawled and noindexed instead.
How long does it take for index bloat to shrink after changes?
Expect weeks, not days. Google has to recrawl pages to see directives, and index cleanup isn’t instantaneous. Your logs will reflect behavior changes sooner than the index will.
Do I need to remove noindexed URLs from sitemaps?
Yes. Sitemaps are a list of “please index these.” Including noindexed URLs sends mixed signals and wastes attention.
What about paginated pages—should I canonicalize page 2+ to page 1?
Be careful. Canonicalizing page 2 to page 1 can hide deeper items and confuse Google about distinct lists. A common safer pattern is noindex,follow on page 2+ while keeping proper pagination links.
Can I just delete tag pages?
You can remove internal links to them and stop generating them, but existing URLs may persist externally. Noindex and/or 301 redirects are cleaner than pretending they never existed.
Does blocking /wp-json/ help?
Sometimes. For many sites it’s fine to disallow if it creates crawl noise, but be cautious: some front-end features and plugins rely on it. Test before you slam the door.
Conclusion: next steps that actually work
Stop treating index bloat like a mystical SEO curse. It’s just an uncontrolled URL surface area. WordPress will happily generate pages until the heat death of the universe; your job is to decide which ones matter.
Do this next, in order:
- Inventory URL families (posts, pages, categories, tags, authors, dates, attachments, search, parameters, pagination).
- Set defaults: search, tags, date, attachment →
noindex,follow(or redirect attachments). Keep categories indexable only if they’re real hubs. - Remove contradictions: noindex URLs out of sitemaps; remove robots.txt disallows that create “indexed but blocked” problems.
- Watch logs weekly until the crawl mix looks sane (less parameters, fewer tag pages, fewer deep paginations).
- Add guardrails so a plugin update can’t silently noindex your best pages again.
If you do that, you’ll end up with a smaller, cleaner index footprint and a site that’s easier for crawlers to understand. That’s the whole game: make the intended structure obvious, and stop offering Google a buffet of leftovers.