WebSecurity.mobi

Focused legacy troubleshooting archive

Curated guide

XML Sitemap Generator Not Reading Past First Page

Troubleshoot sitemap runs that stop at the first page, skip deeper URLs, or fail after the home page in the legacy AuditMyPC tool.

Problem Summary

The archive pattern behind this guide is not a single bug report. It is a cluster of very similar failures: the crawler opens, accepts the site address, and then either stops at the home page, starts throwing connection-reset errors, or produces a crawl that is far smaller than the site owner expects.

Across threads like removing-dynamic-urls, status-failed-error-connection-reset, sitemap-stops-after-only-2191-of-9600-pages, and google-sitemap-stops-after-the-index-page, the same theme kept showing up. The tool usually was not refusing to crawl for no reason. It was reacting to the way the site exposed URLs, redirects, filters, or alternate hostnames.

Comment Highlights

  • One user reported the exact same failure on a catalog site, which shows this was not limited to a single setup or one broken crawl job.
  • Another run produced 173 connect-reset errors on a site the owner described as only five pages wide, including errors on URLs that looked like redirect variants instead of real content pages.
  • A beginner-level thread described the symptom as the index page always showing up but deeper pages never appearing, even though the crawl had been submitted correctly from the user's point of view.
  • A separate thread only started behaving after the user stopped reloading an old saved project and reran the crawl with fresh settings, which is a good reminder that filter changes and project state were easy to misunderstand in this tool.

Likely Causes

  • The site resolves under more than one hostname or default path, such as www and non-www, or the crawler sees both the bare address and a default file like index.php as separate paths.
  • Exclude rules or crawl settings were changed, but the user restarted from an old project file and assumed the new parameters were in effect when they were not.
  • Dynamic or rewritten URLs generated redirect chains, alternate versions of the same page, or crawl paths that looked much larger than the site owner expected. That overlap often turned into the duplicate patterns covered in Duplicate Title Tags and Sitemap Crawling.
  • The failure was local to one browser or one machine. Several archive cases suddenly worked on a different run, another browser, or a second system without any site-side change.

What Still Applies

  • Start with the final canonical site address you actually want indexed. If the site should live on www, use that version consistently and redirect the alternate version to it.
  • Keep the crawl scope narrow until the basic crawl works. Exclude obvious noise, rerun the crawl from a clean project, and confirm that rewritten URLs are not multiplying the same page under different paths. If they are, the next place to look is Duplicate Title Tags and Sitemap Crawling.
  • If the home page works but deeper paths fail, compare the crawler's failed URLs to the site's redirect rules, default filenames, and generated internal links before blaming the sitemap export itself. When the crawl also slows down badly on a large queue, compare the symptoms with Google Sitemap Restrictions for Large Sites.
  • When results look inconsistent, test from another browser or machine. That does not prove the site is correct, but it helps separate a local runtime problem from a repeatable crawl problem.

Legacy Notes

The original threads came from a Java-based webmaster tool and an older Google Sitemaps workflow. Specific browser advice is dated, but the canonical-host and redirect logic is still relevant.

Some examples refer to index.php, .htaccess, and early SEO-friendly URL modules. The names are old; the underlying lesson is not. A crawler can only follow the version of the site your links and redirects actually expose.

Related Guides

Parent Hub

hub

XML Sitemap Generator Help

Legacy support hub for the AuditMyPC XML Sitemap Generator, including crawl limits, Java errors, odd exports, and duplicate URL problems.