Bandwidth Limitation for Robots

I noticed the increasing ammount of searchbots trying to spider my blog. Especially „slurp“ and „msn bot“ are reaching a critical stage concerning the already limited bandwith. Already half of the bandwith is wasted to the robots, and the more active and publically aware my blog becomes, the sooner the bandwidth will be reached.

I suggest to allow the users at Blogspirit to create their own robot.txt file to limit, restrict and possibly exclude the robots‘ aggressive behavior. For example, the entry „User-Agent: slurp“ with a „Crawl-Delay: 20“ already helps to delay the robots, while the meta tag invites them to index the complete blog. Here is an example from the „Stats > Detailed Statistics“ page:

Robots (from search engines)

RobotsHitsBandwidthLast Visit
slurp410 17 MB 15/01/2005
msnbot101 2 MB 15/01/2005
googlebot70 3 MB 14/01/2005
crawl24 517 KB 14/01/2005
ia_archiver5 146 KB 14/01/2005
spider4 164 KB 11/01/2005
webclipping.com3 125 KB13/01/2005
bbot1 42 KB 09/01/2005

4 Kommentare
  1. Tweedle DEE sagte:

    Do you ever get the feeling that someone is watching you, or better yet, looking for you? That is how I feel when I see these on my sites. Spooky…..

  2. Mike Schnoor sagte:

    Well, in fact… ah – no. I don’t get this feeling. I know these are robots and automated software systems, so there’s nothing to be afraid of. Until now, they haven’t had a chance to threaten my digitial existence, you see? :)

  3. Ann sagte:

    Hi Mike,
    I noticed the major increase in traffic from slurp, etc. as well as lots of commercial links in my detailed status report in January.

    Do I need to worry about intrusions as a result of this traffic?

    So much information at your blog – and so much that I simply cannot understand!

    Best wishes,
    Ann

  4. Mike Schnoor sagte:

    Dear Ann,

    the robots are basically harmless since they are nothing but the services ran by search engine providers like Google. The robots crawl (or „spider“) through your whole site and download all possible data, they follow the links and add the content to the search engine’s cache (this is again „spidering“). Since there is usually no limitation for robots, they continously crawl the site – the more links and entries you include in your blog, or the more people link to your own site, the robots return again. You can restrict them with a robots.txt file in the server’s root directory which would be i.e. https://mikeschnoor.com/robots.txt or include special meta tags in the html – refer to my source code for example:

    Here I allow the robots to index the site and instruct them to follow the links.

    Here I basically inform the browser (ms ie, firefox) and any other robot not to cache the site.

    This expiry is set for 1 hour. 60 seconds x 60 minutes = 3600 seconds.

    The robot is instructed to revisit this site in 14 days, and not within one day which is the usual routine.

    Best regards
    Mike

Kommentare sind deaktiviert.