WebSecurity.mobi

Focused legacy troubleshooting archive

Curated guide

XML Sitemap Strange Characters in HTML Sitemap

Troubleshoot garbled characters, encoding issues, and odd text output in generated HTML sitemaps from the legacy AuditMyPC tool.

Problem Summary

The archive threads behind this guide all describe the same unsettling result: the crawler seems to read the site, but the exported HTML sitemap contains mangled characters, broken symbols, or title text that no longer matches what the site owner sees in the browser.

The key detail is that the problem often showed up in the export, not necessarily in the crawl itself. One user noted that the title column inside the tool displayed UTF-8 text correctly, but the exported HTML file broke the same characters on output.

Comment Highlights

  • One site owner reproduced the issue on both a larger store and a smaller blog, which suggests the output problem was not tied to one particular page size or crawl depth.
  • Another comment explained that the titles looked correct inside the webmaster tool but became corrupted only after export to HTML, especially where the page title contained umlauts or other non-ASCII characters.
  • The Russian-symbol thread confirms this was not an isolated Western-encoding problem. The archive owner explicitly acknowledged a known issue with Windows-1251 and Russian characters.
  • At least one reproduction disappeared on a later run in both Firefox and Internet Explorer, which is a warning that the display layer and the export layer could behave inconsistently from one session to the next.

Likely Causes

  • The site content and the export file were using different character encodings, so the crawl data was correct but the rendered output was not.
  • The HTML export step handled extended characters differently than the internal title display inside the tool.
  • A browser or local display issue made the broken output look like a crawl failure even when the crawler had captured the page titles correctly.
  • Legacy encodings such as Windows-1251 or mixed UTF-8 handling pushed the exporter into cases it did not normalize cleanly.

What Still Applies

  • Separate crawl correctness from export correctness. If the tool sees the page titles properly but the saved file does not, you are likely dealing with an encoding or export issue, not a discovery issue.
  • Check the source pages for a consistent charset declaration and compare that to the encoding used when the sitemap or report is saved.
  • Test the same output in another browser or on another machine before assuming the site itself is corrupt. If the crawl itself is also failing or stopping early, start instead with XML Sitemap Generator Not Reading Past First Page.
  • When non-ASCII titles matter, a small test crawl on a few known pages can expose whether the problem is global or limited to one encoding family.

Legacy Notes

This guide comes from an older exporter and an older browser landscape. The exact browser behavior is dated, but the charset mismatch problem still exists in many modern tools and pipelines.

The Windows-1251 example is especially old, but it is a useful reminder that encoding bugs usually surface at export or render time first, especially when the source pages mix character sets.

Related Guides

Parent Hub

hub

XML Sitemap Generator Help

Legacy support hub for the AuditMyPC XML Sitemap Generator, including crawl limits, Java errors, odd exports, and duplicate URL problems.