Can’t index all pages since upgrade

I loved the old version, but now I’ve started to enjoy the new.

I am trying to build a sitemap of http://iomfats.org It always used to collect all pages faithfully. Currently it fails from a random page, for a while, then starts mapping pages after a while. Retying failed pages makes no difference.

The old version always ran to normal EOJ with no errors

Java is up to date. I use firefox (current build) and XP prof (fully up to date)

I alter settings to 9 threads (always have done) and otherwise take all the defaults

Example failure is
20.02.07 10:13:20, Warning: Entry processing failed, url=http://iomfats.org/poetrycorner/poets/skylor, error=Connect timeout

Suggestions, please

Comments

  1. AMPC says:

    The old version cruised along nicely, but it was not as detailed and skipped over a number of errors without notice.

    For example, if I visit:
    ht tp: //iomfats.org/poetrycorner/poets/skylor, I receive a HTTP/1.1 301 Moved Permanently
    but, if I visit
    ht tp: //iomfats.org/poetrycorner/poets/skylor/ I get a 200 OK.

    I just ran the webmaster tool on your site and it found only one error:
    ht tp: //iomfats.org/oddsandends/guestbooks/%20ht tp: //members.xoom.com/Bossy_Boots%20

    It completed to EOJ and took only a few minutes.

    It noticed the redirect, followed it to the new url and spidered it without problems.

    Does this happen to you every time you try to create a sitemap (same error?).

    What happens if you visit that url itself?

    So – to sum it up, I am not seeing the error you mention on my end.

    Can you tell me what free, total, and max memory settings are located under the system tab of the webmaster tool?

    Regards,

    Jim.

  2. IOMfAtS says:

    Interesting. I ran it with threads=5 and delay of 0.2sec and it also ran to EOJ. I suspect what happened was overdriving the server in some manner. The old version ran with 9 threads "happily".

    Oddly the new version finds fewer pages. Very odd indeed. So I guess the answer is to go slowly.

    Sorry I had no chance to get back to you quickly

  3. AMPC says:

    Running at 9 threads or 5 has absolutely no impact on the tool’s ability to grab all the pages, just time and memory.

    The new sitemap generator will not count a page in error or other non standard state (but will report it), so this would be the only reason the old tool produced more pages.

    If your web host is delivering content based on the speed of the request, then this is a serious issue. You can change the user agent to that of the googlebot and see if your results change.

  4. IOMfAtS says:

    As I discovered empirically, you are right. But you would, be, of course.

    the settings you need:

    Free: 5.51M
    Total: 14.6M
    Max: 95.3M

    I just kicked the job off again. It may or may not reach normal EOJ.

    Not yet changed the user agent, but I will when this finishes to see if there is a difference, and report back

  5. IOMfAtS says:

    It failed again with default settings. The user agent changes do not have "googlebot" in the list I can see, just browser names?

    Oddly it has hung on to 4 files for over 27 minutes and refuses to come to EOJ. The server is running normally with no untoward incidents reported.

  6. AMPC says:

    It took my webmaster tool 2:54 to spider your entire site (without images). I ended up with a total of 1698 pages. I found only one page that was that was in error (good jub!) which was guestbooks

    I am running with the -Xmx256m which gives me 254M in my Java environment.

    Follow these instructions on Increasing Java Memory

    -bump your up to 256 and try this again.

    Also, can you try this on another machine?

    You know, this gives me a great idea

  7. IOMfAtS says:

    before I saw your reply I reran on a different machine (no difference), retried failed severla times and came to a core of 10 pages it faile don come hell or high water, and tried every user agent with no joy.

    The guestbooks page I have been in to and it is a weird anomaly, or it was. The html was fine! But I have ripped that out and trashed the link it burped on.

    I run htcheck so the site will be "perfect" for lack of 404s. I use 301 redirects anyway to make sure I mop them up.

    Java memory is now increased. Wish it only took me 3 minutes to run. It takes at least 15 from here.

    Settings are now:

    6.41
    16.4 (Wheeeee, an anagram!)
    254 (someone stole 2!)

    While I am wittering, the java window is about 20 pixels too wide for my firefox browser at the max width of my laptop, so the verticla slider control is a swine to get at. I run at 1024/768

    Regrettably the run is still giving failures. It tends to fail for a while and then come "back" and process successfully. I am using the default agent and 9 threads

    Failures are failry arbitrary. Sometimes pics (etc), sometimes html

  8. AMPC says:

    I want to figure this out, so let

  9. IOMfAtS says:

    ok, I quoted the lot, but I’m escaping form the board’s syntax!

    To change the user agent, you can select from the dropdown list – or

  10. AMPC says:

    Can someone else here run his site and see if what type of results you receive and post them. Did you manage to see all the pages? How many? Any errors?

    Thanks!

    Jim.

  11. IOMfAtS says:

    Just a note, please, to prevent off topic discussions or offence. The site is a gay literature site. If the concept offends anyone then please simply leave it to someone else.

  12. alph says:

    I ran it against the site with no problems. See screen shots below..

  13. AMPC says:

    Thanks Alph!

    Thanks for warning others, but I have a clear policy about off topic discussions and won’t tolirate that here.

    This forum is all about helping people, anything else just won’t fly!

    As for the site, if anyone gets the time, I’d appreciate it – but I suspect that it’s on your end.

    I too have tested it on two computers using different ISP’s in Different States and no problems.

    Regards,

    Jim.

  14. IOMfAtS says:

    I guess it must be a bizarre issue with the ISP I connect through, in that case. All I can do is to thank you gentlemen for proving it.

    It can’t be the server. It isn’t my clients. The common area of concern has to be the ISP.

    Glad you have the clear policy against off topic discussions. I tend to warn people because some folks live in less open minded parts of the world.

Speak Your Mind

Comment moderation is enabled. Your comment may take some time to appear.