Verifying search bots with forward and reverse DNS in PHP

The following function returns true if the current user is one of the bots we consider “good.” (Not that they are better than other legit bots, but ones that we provide a special service to)

This code can be used to allow search engines to index content that is normally behind a registration form. For a good user experience though, it should be used in conjunction with a “first click free” type implementation (which I’m still working on).

In a WordPress local install you can put this in your blogs functions.php page and then call it in the loop as part of the decision to output the_excerpt() or the_content().


# is the remote client (current web browser requesting page that calls
# this function) one of the 
# search bots that we would like to serve alternate content to? (e.g.
# should they get full text version
# of content pages, or should we show them the preview + registration
# form?)
function is_good_bot()
{
    
    # to avoid unecessary lookup, only check if the UA matches one of
    # the bots we like
    $ua = $_SERVER['HTTP_USER_AGENT'];
    if(
        preg_match("/Yahoo! Slurp/i", $ua) ||
        preg_match("/googlebot/i", $ua) ||
        # for testing purposes, put something from your current user
        # agent string in below
        # preg_match("/2009042315/", $ua) ||
        preg_match("/msnbot/i", $ua)
        )
    {
    
        # user agent contains one of the magic phrases, now do a
        # forward and reverse DNS check
        # each of the search providers that we use asserts that their
        # bot domains will always 
        # end in the strings in the below preg_match(es)
        # check forward/reverse to make IP address / hostname spoofing
        # very hard.
        $ip=$_SERVER['REMOTE_ADDR'];
        $hostname=gethostbyaddr($ip);    
        $ip_by_hostname=gethostbyname($hostname);                
        if ($ip_by_hostname == $ip) {
            if(
                preg_match("/\.googlebot\.com$/", $hostname) ||
                preg_match("/search\.msn\.com$/", $hostname) ||
                # testing: enter your hostname here
                # preg_match("/example.com$/", $hostname) ||
                preg_match("/crawl\.yahoo\.net$/", $hostname)         
       
                )
            {
                # good bot. 
                return true;
            } else {
                # bad bot, and possible bad person all around.
                return false;
            }
        } else {
            # bad bot, and possible bad person all around.
            return false;
        }

    } else {
        # If the UA of a prefered bot isn't present, just skip the 2x
        # DNS checks
        return false;
    }     
}

Checking IP Ranges for Sanity / Validity

This doesn’t cover a lot of cases so it is really a sanity check for IP Ranges. It takes two IP addresses a low one and a high one and makes sure they are valid IPv4 addresses and it take a look to make sure (at least I hope it makes sure, anyone want to help me out here?) the low_ip is ‘lower’ than the high_ip.

This is used by some code that allows a user to specify an IP address range and then when the application ‘sees’ a client coming from within a known range, it associates the client with the ‘user’ who set up the range. (Think library systems that restrict content via institutional IPs)

$low_ip = '18.0.0.0';
$high_ip = '18.255.255.255';

if( validate_ip_range($low_ip, $high_ip) ) {
    print "yes\n";
} else {
    print "no\n";
}
function validate_ip_address($ip)
{
    if( ($long_ip = ip2long($ip)) !== false) {
        if($ip == long2ip($long_ip)) {
            return true;
        } else {
            return false;
        }
    } else {
        return false;
    }
}

function validate_ip_range($low_ip, $high_ip)
{
    if(validate_ip_address($low_ip) && validate_ip_address($high_ip)) {
        $long_low_ip = ip2long($low_ip);
        $long_high_ip = ip2long($high_ip);
        if( ($low_ip == long2ip($long_low_ip)) && ($high_ip == long2ip($long_high_ip)) ) {
            if($long_low_ip <= $long_high_ip) {
                # now check that 1st octet matches in each ip (18.* == 18.*)
                $low_octets = explode(".", $low_ip);
                $high_octets = explode(".", $high_ip);
                if($low_octets[0] != $high_octets[0]) { return false; }
                # for each of the remaining 3 octets, low_ip's octet <= high ips.
                # e.g. 18.151.1.0 18.151.1.255
                # 18.57.0.41 - 18.57.0.100
                for($i = 1; $i < 4; $i++) {
                    if($low_octets[$i] > $high_octets[$i]) { return false; }
                }
                $num_ips = ( sprintf("%u", $long_high_ip) - sprintf("%u", $long_low_ip) );
                if($num_ips > 65535) { echo "warn: above a class C address space\n"; }
                # if we made it this far, should be a valid rang for our purposes
                return true;
            } else {
                return false;
            }
        } else {
            return false;
        }
    } else {
        return false;
    }
}

Attached is a text file disguised as a .doc with the test “harness” for this function and the function suggested by tevis below. For malformed IP ranges such as $low_ip = ’18.0.0.14′ and $high_ip = ’18.255.255.1′; the suggestion below doesn’t seem to work, but the cumbersome process does correctly indicate that there’s a problem with that kind of IP range.

PHP, split string in half / Insert into the middle of a string

There’s got to be an easier way of doing this…. I wanted to harden an existing random password generator by insterting a random special character into the middle of the string. Task at hand: split string into two parts, left and right; return a string with the special characters in the middle of the left, right parts.

Here’s what I came up with in PHP:

$str = "somestringofcharacters";
$middle = "^";
$half = (int) ( (strlen($str) / 2) ); // cast to int incase str length is odd
$left = substr($str, 0, $half);
$right = substr($str, $half);
echo $left.$middle.$right;

This is what it would look like in python:


>>> s = "somestringofcharacters"
>>> m = "^"
>>> s[:len(s)/2] + m + s[len(s)/2:]
'somestringo^fcharacters'