Figuring out which traffic is legitimate and which isn’t, can be tricky. If a (ro)bot script is written well, and often they are, they appear to be just like any other normal visitor.
The challenge that we’re presented with is being able to spot the bad bots and isolate their traffic from the “good” traffic – the traffic we want to keep.
Bot, or Not?
Bots don’t load pages randomly. They do it for a reason. That particular reason depends on its purpose and what they want to achieve.
If the bot is a web crawler, like Google or Bing, then the reason is clear.
But other bots can have more nefarious motives.
It could be a bot designed to test your login. It may try a dictionary-based attack to “guess” login credentials to gain access to the admin area.
It could be a SPAM bot whose sole mission in life is posting SPAM comments to your blog articles.
Or they could simply be probing your site, building up a picture of it so it has a better idea of how your site is structured and what it’s “made of”. The reasons for doing this are endless, but nothing good is likely to come from it.
Shield Security has protection against various types of requests and uses different tools to thwart bots depending on what exactly is happening. It uses the particular nature of the requests to identify them as coming from a bot, or not.
But in the case of a “probing” bot that’s just browsing your site, it can be difficult to isolate this traffic and “see” a bot request when it looks just like a normal visitor.
We don’t want our sites probed. We don’t want bots building up a picture of our site for later use.
So then, how can we reliably identify bots that don’t look like bots?
Legitimate Bots Follow The Rules; Bad bots don’t care.
There’s a file on all your WordPress sites, called
robots.txt. You can see ours here: robots.txt
It’s not a real file, but one that WordPress creates on-the-fly as it’s needed.
The purpose of robots.txt is to provide a lists of rules for bots and web crawlers, like Google and Bing. It tells them what they’re allowed to index, and what they need to ignore. The search engines will only index pages that are permitted, and wont even scan the ones that are disallowed.
Legitimate bots and crawlers will honour these directives.
Since malicious bots don’t care about the rules, they’ll completely ignore your robots.txt and do whatever they like.
That’s outrageous, you might say, but we can use this to our advantage, as we’ll see a bit later.
There’s another rule that web crawlers look at when they’re indexing your site:
rel="nofollow". This property is added the web links to tell crawlers not to index the page behind the link – to basically ignore it.
Again, good bots should adhere to these rules, bad bots aren’t going to care.
How to use robots.txt rules and rel=”nofollow” to identify bad bots
Now that we know bots will blatantly ignore the rules, we can use this to our advantage.
Imagine you placed a link on your page:
- which was hidden, so only bots could see it and human visitors were oblivious
- which had
- which linked to a page on your site that was disallowed by your robots.txt rules.
Now imagine that you saw traffic to that link. How could that traffic possibly have been generated?
The only traffic that would go to that link are bots (that could see the link on your page) and ignored all the rules that said “do go there”.
Mouse Trap for Bad Bots comes to Shield Security Pro 7.3
Shield Security Pro 7.3+ has a Mouse Trap feature for detecting bad bots. This new feature arrives along with along with a few other bot detection signals which we’ll discuss elsewhere in more detail.
The Mouse Trap works by leaving a bit of “cheese” in the form a link that alerts Shield to a bad bot whenever it’s been accessed.
It’s a simple matter of switching on the option and Shield will take care of the rest. When enabled Shield Security will:
- automatically update your
robots.txtfile to indicate the virtual links that are not to be indexed.
- automatically include an invisible, fake link at the footer of all pages on your site. This is the cheese to tempt the bad bots.
Options for handling bots that nibble on the link cheese
There are 2 options for dealing with any bots that access the fake link in the mouse trap:
- Log the event in the Audit Trail
- Increment the transgression counter for that IP
- Immediately block the IP address
How your site responds is entirely up to you. To err on the side of caution we recommend option #1. But if your patience is a little thin for bad bots, choose the option #2 and immediately block them from further probing.
We definitely advocate a more lenient approach in the beginning. Test this feature on your site by assigning black marks against and IP address several times before outright blocking it. In this way you can be sure the feature works well for your site before firming up your protection and instantly blocking offenders.
Please do leave us comments and questions below and we’ll get right back to you.
[At the time of writing, Shield 7.3 hasn’t been released as it’s undergoing final testing. It wont be long though… ]
[2019/04/10] Update regarding SEO implications
Following some comments about this feature and the potential impact on SEO and the view Google might take on this with respect to “cloaking”. While this, of course, isn’t bad cloaking, it’s still cloaking – the act of hiding or changing content based on whether a visitor is a Google Bot.
As a result, we’ve changed how the feature operates. Now, it’ll always print the link regardless of whether the visitor is an official web crawler such as Google Bot. or not. And since legitimate bots honour directives set out in
robots.txt, they’ll not follow the links anyway.