How to make a Bot Trap using php

Post Reply
User avatar
Saman
Lieutenant Colonel
Lieutenant Colonel
Posts: 828
Joined: Fri Jul 31, 2009 10:32 pm
Location: Mount Lavinia

How to make a Bot Trap using php

Post by Saman » Fri Aug 19, 2011 4:23 pm

Many bad bots either carry well known bad user-agent names (well, most modern bad bots don't and hide behind faked UA strings (either other bots' names or user browser UA stings (like "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" et al.), but some old silly ones still do, like "Internet Explore 5.x", "Mozilla/3.0 (compatible; Indy Library)") or ignore the http://www.robotstxt.org/wc/exclusion.html robots.txt standard.

Originally, the Indy Library is a programming library which is available at http://www.nevrona.com/Indy or http://indy.torry.net under an Open Source license. This library is included with Borland Delphi 6, 7, C++Builder 6, plus all of the Kylix versions. Unfortunately, this library is hi-jacked and abused by some Chinese spam bots. All recent user-agents with the unmodified "Indy Library" string were of Chinese origin.

The following may help against rude and robots.txt-ignorant bots. However, there are more bad but smarter bots out there, which need more sophisticated countermeasures.

Trap them in a special /bot-trap directory:
Disclaimer: The following scripts are working for me. I share them just for your information. If you use them, you are doing so at own risk. Be sure not to screw up your web site or web server.

[*]Set up a special subdirectory /bot-trap (use another own name),


[*]Put an exclude statement into your /robots.txt,

Code: Select all

User-agent: *
Disallow: /bot-trap/

[*]Include a hidden link (a transparent 1x pixel GIF) at the beginning of your main entrance page /

Code: Select all

<a href="/bot-trap/"><img src="images/pixel.gif" border="0" alt=" " width="1" height="1"></a> 
Wait watching your webserver's log to see who gets trapped. Most human users won't ever see this link, and good bots (like googlebot, etc) honour the robots.txt directives and won't visit your /bot-trap.

Caveats: this method is not bullet-proof, and there may be collateral damage, so check the trapped addresses regularly and in time to research and unblock those that may have been special and innocent user tools instead of bots.

Own experience here is about 1 or 2 false ones among 100 hits.
To prevent certain user-agents or IP address ranges from being trapped you could add some sort of whitelisting.


[*]Additionally, we have put a /bot-trap/index.php to notify the hostmaster by mail and automatically append the bot's IP address to an active blacklist file blacklist.dat. For the first start, create an empty ../blacklist.dat file and make it readable and writeable for the web server. Here is the text of the /bad-bot/index.php:

Code: Select all

<html>
<head><title> </title></head>
<body>
<p>There is nothing here to see. So what are you doing here ?</p>
<p><a href="http://your.domain.tld/">Go home.</a></p>
<?php
  /* whitelist: end processing end exit */
  if (preg_match("/10\.22\.33\.44/",$_SERVER['REMOTE_ADDR'])) { exit; }
  if (preg_match("Super Tool",$_SERVER['HTTP_USER_AGENT'])) { exit; }
  /* end of whitelist */
  $badbot = 0;
  /* scan the blacklist.dat file for addresses of SPAM robots
     to prevent filling it up with duplicates */
  $filename = "../blacklist.dat";
  $fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
  while ($line = fgets($fp,255)) {
    $u = explode(" ",$line);
    $u0 = $u[0];
    if (preg_match("/$u0/",$_SERVER['REMOTE_ADDR'])) {$badbot++;}
  }
  fclose($fp);
  if ($badbot == 0) { /* we just see a new bad bot not yet listed ! */
  /* send a mail to hostmaster */
    $tmestamp = time();
    $datum = date("Y-m-d (D) H:i:s",$tmestamp);
    $from = "[email protected]";
    $to = "[email protected]";
    $subject = "domain-tld alert: bad robot";
    $msg = "A bad robot hit $_SERVER['REQUEST_URI'] $datum \n";
    $msg .= "address is $_SERVER['REMOTE_ADDR'], agent is $_SERVER['HTTP_USER_AGENT']\n";
    mail($to, $subject, $msg, "From: $from");
  /* append bad bot address data to blacklist log file: */
    $fp = fopen($filename,'a+');
    fwrite($fp,"$_SERVER['REMOTE_ADDR'] - - [$datum] \"$_SERVER['REQUEST_METHOD'] $_SERVER['REQUEST_URI'] $_SERVER['SERVER_PROTOCOL']\" $_SERVER['HTTP_REFERER'] $_SERVER['HTTP_USER_AGENT']\n");
    fclose($fp);
  }
?>
</body>
</html>

[*]To exclude those known bad bots from further visits of my pages, all my active PHP pages begin with a check of the blacklist.dat file by including this statement at the beginning:

Code: Select all

<?php include($DOCUMENT_ROOT . "/blacklist.php"); ?>
Here is the text of the blacklist.php include file to exclude the bad ones:

Code: Select all

<?php
$badbot = 0;
/* look for the IP address in the blacklist file */
$filename = "../blacklist.dat";
$fp = fopen($filename, "r") or die ("Error opening file ... <br>\n");
while ($line = fgets($fp,255))  {
  $u = explode(" ",$line);
  $u0 = $u[0];
  if (preg_match("/$u0/",$_SERVER['REMOTE_ADDR'])) {$badbot++;}
}
fclose($fp);
if ($badbot > 0) { /* this is a bad bot, reject it */
  sleep(12);
  print ("<html><head>\n");
  print ("<title>Site unavailable, sorry</title>\n");
  print ("</head><body>\n");
  print ("<center><h1>Welcome ...</h1></center>\n");
  print ("<p><center>Unfortunately, due to abuse, this site is temporarily not available ...</center></p>\n");
  print ("<p><center>If you feel this in error, send a mail to the hostmaster at this site,<br>
         if you are an anti-social ill-behaving SPAM-bot, then just go away.</center></p>\n");
  print ("</body></html>\n");
  exit;
}
?>
If you try to locate and access /bot-trap yourself, your IP address will be locked immediately.

Article courtesy of Kloth
Post Reply

Return to “PHP & MySQL”