Bandwidth Limitation for Robots
I noticed the increasing ammount of searchbots trying to spider my blog. Especially „slurp“ and „msn bot“ are reaching a critical stage concerning the already limited bandwith. Already half of the bandwith is wasted to the robots, and the more active and publically aware my blog becomes, the sooner the bandwidth will be reached.
I suggest to allow the users at Blogspirit to create their own robot.txt file to limit, restrict and possibly exclude the robots‘ aggressive behavior. For example, the entry „User-Agent: slurp“ with a „Crawl-Delay: 20“ already helps to delay the robots, while the meta tag invites them to index the complete blog. Here is an example from the „Stats > Detailed Statistics“ page:
Robots (from search engines)
Robots | Hits | Bandwidth | Last Visit |
slurp | 410 | 17 MB | 15/01/2005 |
msnbot | 101 | 2 MB | 15/01/2005 |
googlebot | 70 | 3 MB | 14/01/2005 |
crawl | 24 | 517 KB | 14/01/2005 |
ia_archiver | 5 | 146 KB | 14/01/2005 |
spider | 4 | 164 KB | 11/01/2005 |
webclipping.com | 3 | 125 KB | 13/01/2005 |
bbot | 1 | 42 KB | 09/01/2005 |
Do you ever get the feeling that someone is watching you, or better yet, looking for you? That is how I feel when I see these on my sites. Spooky…..
Well, in fact… ah – no. I don’t get this feeling. I know these are robots and automated software systems, so there’s nothing to be afraid of. Until now, they haven’t had a chance to threaten my digitial existence, you see? :)
Hi Mike,
I noticed the major increase in traffic from slurp, etc. as well as lots of commercial links in my detailed status report in January.
Do I need to worry about intrusions as a result of this traffic?
So much information at your blog – and so much that I simply cannot understand!
Best wishes,
Ann
Dear Ann,
the robots are basically harmless since they are nothing but the services ran by search engine providers like Google. The robots crawl (or „spider“) through your whole site and download all possible data, they follow the links and add the content to the search engine’s cache (this is again „spidering“). Since there is usually no limitation for robots, they continously crawl the site – the more links and entries you include in your blog, or the more people link to your own site, the robots return again. You can restrict them with a robots.txt file in the server’s root directory which would be i.e. https://mikeschnoor.com/robots.txt or include special meta tags in the html – refer to my source code for example:
Here I allow the robots to index the site and instruct them to follow the links.
Here I basically inform the browser (ms ie, firefox) and any other robot not to cache the site.
This expiry is set for 1 hour. 60 seconds x 60 minutes = 3600 seconds.
The robot is instructed to revisit this site in 14 days, and not within one day which is the usual routine.
Best regards
Mike