How to read RSS feed XML Using PHP cURL

Post by **Neo** » Wed Dec 23, 2009 2:42 am

Using PHP cURL

To use cURL on windows you only need to uncomment it in php.ini file, on Linux (like always) you need to compile PHP with –with-curl. To connect to some RSS feeds with cURL, first you need to init cURL resource handle this is done with:

Code: Select all

$ch = curl_init("http://localhost/curl/rss.xml");

where obviously, the first param is the URL of website (feeds in our case) to which you want to connect to, next we need to setup few connection options by using curl_setopt() with three parameters, where first param is cURL resource we created earlier, second cURL option key and third option value, for our simple connection we will need only two options.

Code: Select all

curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, 0);

The most important here is to set CURLOPT_RETURNTRANSFER, because by default in PHP curl only sends data to server and do not wait to get response, it sometimes useful to only send request and do not wait for response, however in our case we want to parse XML feeds and it will be quite difficult if the server we are connecting to won’t send them to us.

Next step is to execute connection, wait for a response, and close it, sounds like a lot of work but it is not, actually it is done with only two lines of code:

Code: Select all

$data = curl_exec($ch);
curl_close($ch);

Half of the work is done if everything went right and page we connected to contained RSS feeds or some kind of XML data, then $data variable contains string which can be no parsed. A lot of newbies try to parse such string with regular expressions or explode string and pseudo parse it line by line, by removing tags with str_replace(). This is something i do not encourage, not only because it is very unprofessional, but also because since PHP 5.x we have built in tools for parsing XML data it will be not only more pro but a lot easier to do then pseudo parsing line by line.

Working with SimpleXML

Currently in PHP manual there are described 13 libraries for handling XML data, quite a lot, but most of them is designed to help build XML file not parse, so in our case the best bet is to use SimpleXML library, which is not only the best for converting string to XML object but is also built into PHP core so there is no need to install it. We have our XML string in $data variable so everything we need to do to parse it is this:

Code: Select all

$doc = new SimpleXmlElement($data, LIBXML_NOCDATA);

Note, that we could also pass third (boolean) parameter to the constructor, which by default is set to false. If we would set it to true then first argument should be an URL pointing to XML document instead of XML data, so actually we do not have to use cURL at all.

Now doc is an instance of SimpleXmlElement, which basically consists only from fields and arrays, if node occurs only once in XML document then it is a field, if it occurs many times then it is an array … well usually. If you want to know what is inside this object use old fashioned:

Code: Select all

print_r($doc);

So far so good, but now comes the hard part, as you know on the web there are two popular feeds standards RSS and ATOM, each of them has a different structure, and what is worst different node names, fortuntely with SmpleXML it is easy to check it.

Code: Select all

if(isset($doc->channel))
{
    parseRSS($doc);
}
if(isset($doc->entry))
{
    parseAtom($doc);
}

All RSS documents have <channel> node so if our document contains this node then there is a chance that it is a RSS document on the other hand if it contains <entry> node there is a chance that it is an ATOM document. I used here if{ … } if { … } instead of if { … } else { … } because there is a chance that document we parsed is neither RSS nor ATOM. In code you can also see two functions parseRSS() and parseAtom() this are functions we will use to get data out of SimpleXmlElement objects and we are going to write them right now.

Code: Select all

function parseRSS($xml)
{
    echo "<strong>".$xml->channel->title."</strong>";
    $cnt = count($xml->channel->item);
    for($i=0; $i<$cnt; $i++)
    {
    $url     = $xml->channel->item[$i]->link;
    $title     = $xml->channel->item[$i]->title;
    $desc = $xml->channel->item[$i]->description;
 
    echo '<a href="'.$url.'">'.$title.'</a>'.$desc.'';
    }
}

RSS is much more easier to handle than ATOM because it does not contain important data in attributes; Well there is not really much to talk about here you have access to any of nodes by using simple syntax $xml->node->childNode, if node is an array then you slightly change the code to $xml->node[$i]->childNode->childChildNode.

The following example will a bit more complicated because in order to access entry URL we need to read <link> node attribute:

Code: Select all

function parseAtom($xml)
{
    echo "<strong>".$xml->author->name."</strong>";
    $cnt = count($xml->entry);
    for($i=0; $i<$cnt; $i++)
    {
    $urlAtt = $xml->entry->link[$i]->attributes();
    $url    = $urlAtt['href'];
    $title     = $xml->entry->title;
    $desc    = strip_tags($xml->entry->content);
 
    echo '<a href="'.$url.'">'.$title.'</a>'.$desc.'';
    }
}

Note, how we get to node attributes: $urlAtt = $xml->entry->link[0]->attributes(), now $urlAtt is associative array where attribute name is array key and attribute value is a value for this key.