Of Evil Bots and Good ones

Some Context

A recent estimate attributes half of the traffic of the whole internet to Bots. If you have any webpage out in the internet, and happen to look once in a while in the logs or any other form of web stats, it does not take you much effort to be persuaded of that.

It is mostly a jungle, where the minute you put out a static index.html, tons of foreign IP addresses will try to look for a phpmyadmin.php, wp-admin/, some dotfiles or anything framework-ish like Spring or Ruby on Rails or something weirder that at times is not even possible to understand.

Blocking the Bots

As a matter of fact, it would not be that difficult to block that craze. If you have anything like a reverse proxy or a load balancer in front of your precious assets, it's quite doable to use them to kill traffic that has a strange shape.

Putting the "strange shape" into practical terms: it's quite unlikely for a human being to reload the same page multiple time in a short span of seconds to retry a password login.

If the human in question is actually reading what you put in your beloved webpage, it's quite unlikely that would keep changing pages in less than say - a few seconds.

I do not go further to say that the user agent could be fingerprinted or the like... those timeouts should be enough in order to discern real users from bots.

If a bot is so smart to wait random minutes interval, it could still be detected, but well, then you can probably afford such a bot visitor.

The above should be enough to safeguard for example and API from abusers.

Bots you need

That being said, you still cannot kill all the bots because you know... there seems to be good ones.

Given that such a thing as a "good bot" actually exists, if you are publishing any material on the net, I would argue that you are interested in people accessing it.

And if that so... search engines are not going to find you unless you allow their spiders to crawl your pages, that is: bots, again.

Put it simply: no bots, no traffic.

The convoluted solution to such a conundrum is to basically gather intelligence about how exactly a "good bot" looks like, what kind of behavior it exhibits and, ultimately, what IP address it can come from.

Again, this is somewhat doable, most prominent search engines have started publishing a list of addresses used by their bots, see: google bing duckduckgo

Reflections

I will make the point here that - if you are building an allowlist out of those big player's IP addresses, what you are doing is basically reinforcing a monopoly of information over those existing search engines.

That means that there is not going to be an alternative to the existing search engines until we stop filtering everything else: it just cannot happen.

This is an open issue, and a big one, you would argue.

You definitely want to block nasty bot traffic that - as widely demonstrated, accounts for a half of your bandwidth costs.

You definitely are not going to block search engine crawlers, but then you are left with the duty to periodically assess whether some bot traffic is something that you might want to allow or not.

Different story is - of course - for APIs where definitely you don't want any automated unexpected client as agent.

There you can just drop the bad traffic.

[api]