You are here:  » Google


Google

Submitted by babrees on Sun, 2014-03-30 02:51 in

Hi. Bit of a long shot but wondered if anybody else has had this problem?

One of my sites bandwidth is being eaten alive by googlebot. AW stats show it ate 43.88 GB yesterday. Now I don't want to discourage google from visiting but does anybody have any ideas how I can stop them from eating me?

Submitted by support on Sun, 2014-03-30 11:10

Hi Jill,

Wow - that's some resource quota your site has earned!

The first thing to consider of course is robots.txt and looking at what pages you want indexed. On my sites, I tend to limited the /category/ and /brand/ A-Z pages, leaving the /merchant/ A-Z pages available to search engines that don't use robots.txt as that is the only A-Z index guaranteed to link to all /product/ pages on your site. However, many users (particularly on niche sites) don't limit the Category and Brand index pages as they can contain rich keyword variety pages which product good results.

Depending on what you choose, edit your robots.txt as required, for example I use:

User-agent: *
Disallow: /category/
Disallow: /brand/
Disallow: /jump.php
User-agent: *
Crawl-delay: 2

Another option is to limit search.php which, (when using clean URLs) as this is only used for user searches, and when an A-Z search (merchant, category or brand) is sorted from the search results page, in which case, add to the Disallow list:

Disallow: /search.php

(that could make a big difference)

Crawl-delay can be useful in taming good robots. If it still seems excessive if you could let me know how many products are in the database (it does sound like a lot!)

Cheers,
David.
--
PriceTapestry.com

Submitted by babrees on Sun, 2014-03-30 12:33

Hi David

Unfortunately I already had those in my robots.txt, plus others! Here is what I had...

User-agent: *
Disallow: /merchant.php
Disallow: /categories.php
Disallow: /brands.php
Disallow: /reviews.php
Disallow: /category/
Disallow: /brand/
Disallow: /review/
Disallow: /admin/
Disallow: /search.php
Disallow: /jump.php
User-agent: Googlebot
Crawl-delay: 10

user-agent: AhrefsBot
disallow: /

There are 644439 total products.
---------
Jill

Submitted by support on Sun, 2014-03-30 15:05

Hi Jill,

Would it be at all possible for you to email me a short extract from your access log showing Googlebot requests?

It's possible that a modification has inadvertantly created more crawlable paths which I should be able to identify from the logs, or if not, if you could let me know the installation URL (I'll remove before publishing your reply) I'll check it out and see if there's anything else that could be added to robots.txt for you...

Thanks,
David.
--
PriceTapestry.com

Submitted by Rocket32 on Mon, 2014-07-14 00:04

Hello, I am trying to see how to block a list of known bots & ip addresses from spidering the site. if I have a list of bots along with their ip address, how do I block them?

Submitted by support on Mon, 2014-07-14 07:58

Hi,

To block by user-agent, add the following to your .htaccess file:

Order Allow,Deny
SetEnvIfNoCase User-Agent "Blocked-User-Agent-1" block
SetEnvIfNoCase User-Agent "Blocked-User-Agent-2" block
Deny from env=block

Simply repeat the SetEnvIfNoCase line for each user agent you wish to block.

Specific IP address can be blocked by adding:

Deny from a.b.c.d

...or by block:

Deny from a.b.c.

Hope this helps!

Cheers,
David.
--
PriceTapestry.com