How to scrape web sites as feeds using php

Post by **Neo** » Tue Mar 16, 2010 3:03 am

<?php

        $url = 'https://robot.lk/';
        $title = 'ROBOT.LK Engineering Community';
        $description = 'Links';

        $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

        header('Content-type: text/xml; charset=utf-8', true);

        echo '<?xml version="1.0" encoding="UTF-8"?'.'>' . "n";
        echo '<rss version="2.0">' . "n";
        echo '<channel>' . "n";
        echo '  <title>' . $title . '</title>' . "n";
        echo '  <link>' . $url . '</link>' . "n";
        echo '  <description>' . $description . '</description>' . "n";

        $curl = curl_init($url);
        curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
        curl_setopt($curl, CURLOPT_AUTOREFERER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
        curl_setopt($curl, CURLOPT_TIMEOUT, 2 );                

        $html = curl_exec( $curl );

        $html = @mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');   

        curl_close( $curl );

        $dom = new DOMDocument();

        @$dom->loadHTML($html);

        $nodes = $dom->getElementsByTagName('*');

        $date = '';

        foreach($nodes as $node){

                if($node->nodeName == 'h2'){
                        $date =  strtotime($node->nodeValue);
                }

                if($node->nodeName == 'dt'){

                        $inodes = $node->childNodes;

                        foreach($inodes as $inode){

                                if($inode->nodeName == 'a' && $inode->getAttribute('class') == 'permalink'){
                                        echo '<item>' . "n";
                                        echo '<title>' . @mb_convert_encoding(htmlspecialchars($inode->getAttribute('title')), 'utf-8') . '</title>' . "n";
                                        echo '<link>' . $inode->getAttribute('href') . '</link>' . "n";
                                        if($date){
                                                echo '<pubDate>' . date(DATE_RSS, $date) . '</pubDate>' . "n";
                                        }
                                        echo '</item>' . "n";
                                }
                        }
                }
        }

        echo '</channel></rss>';
?>

The first bit sets the params I'll use below in the RSS feed I generate - including the original URL, the title of the site, and the description. These are the basics for RSS and will be fine for our purposes. And then to make sure the script isn't easily bounced, I have a variable I set to make the User Agent to look like GoogleBot. That helps a lot with a variety of sites out there who just check for that and nothing else.. If you're doing any sort of googlebot check without verifying the IP, you're not being smart. Next I just echo out the top part of the RSS feed with my params, nothing special. I'm using UTF-8 to encode the XML, this is important when I'm writing out stuff later, so keep it in mind.

The next chunk is the curl calls to go get the content. PHP has a few ways to do this, but one of the best things about PHP is that most of the functionality it provides are just light wrappers around native libraries on Linux. In this case, it's using curl to go grab the content of the page - and the options are set to make sure it passes my custom user-agent, follows redirects, pass the redirect headers (in case the site watches for that) and gets rid of the header stuff for me, as I don't need that info.

Side note - another great thing about PHP is you can test it on the command line easily - just type "php scriptname.php' and it'll run your code, and you can see the HTML output right there or any errors. Also you can have your error output on the command line be more verbose as well, which helps debugging these quick scripts quite a bit. So as I'm writing my code, I normally stop and echo out as I go along to test that things are working as they should up to that point.

Now, the next step is to pass the HTML I got back from curl to the DOM - which again, is just a light wrapper around libxml - which works really well for processing HTML docs as well as XML. No messing with Regular Expressions here - someone else has figured out all the hard stuff about parsing markup and recovering from errors, etc. so why try to re-create that wheel? One tip though, even though the DOM has various ways to specify the encoding, it seems to be happiest when you just convert whatever your passing it to use HTML entities. So that's what I do using the "multi-byte" string converter. I use this rather than the regular converter as this makes sure that the odd Japanese or Finnish character doesn't get munged and mess up the works. That said I still use the @ symbol in front of those calls which tells PHP to ignore any errors, as something always is wrong.

Okay, once we have the DOM, then it's just a matter of iterating through the "nodes" (i.e. tags/elements, etc.) until you find what you're looking for. In this case, John has his links in specially marked anchors, which are usually inside dt's. However, the descriptive text isn't really easily grabable, as it's not in a containing DIV or anything, so I just decided to ignore that and go with the stuff that's easy to snag - like the title attribute in the anchor tag. By making sure I look for "a" nodes which also have an attribute of "class" that's called 'permalink', then I'm able to write out the individual items from there without having to do much more logic. And as I was iterating through the nodes, I also took a sec to grab the h2 tags, which seem to have the date. RSS Items don't actually need a pubdate, but they're nice to have, so if I can grab that info from a tag, I do. Another great thing about PHP? The strtotime() function - which will read just about any date/time in plain english and convert it into a usable date. It's a *very* convenient function to have. Finally, before I write out the title in my item, I need to make sure I convert the ampersands into &amps;amps; and also convert the outgoing text back into utf-8 (as XML is in UTF-8).

The end result is a reasonable and valid feed that shouldn't have duplicate updates which is "good enough" for your feed reader to call periodically to get the links. Most sites nowadays have feeds, and full feeds, but every once in a while you'll run across a site that is adamant on making you go there to view their content. As I'm not re-publishing this feed, but just using it for private use, it's pretty close to just visiting the site anyways (if you're being very morally ambiguous).