You are here:  » Unknown Spider problems??

Support Forum



Unknown Spider problems??

Submitted by paul30 on Wed, 2009-04-08 07:07 in

Does anyone have any problems with spiders??

I have a pretty strong VPS (1.5GB guaranteed ram, 2+GHZ guaranteed processor power, etc) and I have a medium size website (around 50K products)

The problem is that lately the server can not handle the pressure, the site uses over 100% available to my account ram and because of that sometimes the DB crashes and requires a hard reset to get back to normal.

I dont have an increase in traffic in the last month, and the whole issue just came out of the blue, so I checked my statistics closely and I see that I get lots of "Unknown robots" (identified by empty user agent string) requests - in 5 days those "unknown robots have spidered over 6.5GB worth of Bandwidth.

Does anyone think that those 2 can be related? Any ideas how to track/block an "Unknown robots"??

Thanks
Paul

Submitted by support on Wed, 2009-04-08 08:40

Hi Paul,

Do you use robots.txt to limit spider activity to the main pages - that might help (it has before where people have been hammered by robots, particularly with big sites, and a robot starts requesting hundreds of search pages with different sort orders etc.

This is the robots.txt that I use on my Price Tapestry sites which is the sort of thing i'd recommend - and should help a lot - even good robots will be wasting resources otherwise..

User-agent: Mediapartners-Google*
Disallow:
User-agent: ia_archiver
Disallow: /
User-agent: *
Disallow: /categories.php
Disallow: /brands.php
Disallow: /reviews.php
Disallow: /category/
Disallow: /brand/
Disallow: /review/
Disallow: /admin/
Disallow: /search.php
Disallow: /jump.php

Cheers,
David.

Submitted by webie on Sat, 2009-04-25 14:22

Hi Paul,

I had problem with crawler crashing my price tapestry install on my server even after i blocked them via robots.txt they still kept coming and i even sent an email to the crawler in question i sorted it out by blocking them within .htaccess file

I found this on forum which done the trick for me. You can crab a copy of openwebspider and change bot name in the config and do crawl on your site and it should not allow you too and give message forbidden.

If you using Linux and have ssh support CD into Dir where your logs live and issue the command 'tail -f access_log' do this when the bot hits your server and look for the rogue bot.

######### Block Rogue Web Crawlers ###############
RewriteCond %{HTTP_USER_AGENT} ^BadBot
RewriteRule ^.* - [F]

Kind Regards

Darren

Submitted by support on Sat, 2009-04-25 16:09

Thanks for posting that, Darren.

Anybody using this, note that when the text to match in a RewriteCond rule begins with ^ it means anything matching the following text at the beginning, so you don't need to include the entire user-agent here - just the main part of the name at the beginning, so in addition to just "BadBot", the above rule would also block:

BadBot1.0
BadBot2.0

etc.

Cheers,
David.