How to scrap links on a web page using PHP

Post by **Neo** » Tue Mar 16, 2010 2:55 am

In this tutorial you will learn how to build a PHP script that scrapes links from any web page.

What You’ll Learn

How to use cURL to get the content from a website (URL)
Call PHP DOM functions to parse the HTML so you can extract links
Use XPath to grab links from specific parts of a page
Store the scraped links in a MySQL database
Put it all together into a link scraper
What else you could use a scraper for
Legal issues associated with scraping content

What You Will Need

Basic knowledge of PHP and MySQL
A web server running PHP 5
The cURL extension for PHP
MySQL – if you want to store the links

Get The Page Content
cURL is a great tool for making requests to remote servers in PHP. It can imitate a browser in pretty much every way. Here’s the code to grab our target site content:

Code: Select all

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
    echo "<br />cURL error number:" .curl_errno($ch);
    echo "<br />cURL error:" . curl_error($ch);
    exit;
}

If the request is successful $html will be filled with the content of $target_url. If the call fails then we’ll see an error message about the failure.

Code: Select all

curl_setopt($ch, CURLOPT_URL,$target_url);

This line determines what URL will be requested. For example if you wanted to scrape this site you’d have $target_url = “/makebeta/”. I won’t go into the rest of the options that are set (except for CURLOPT_USERAGENT – see below).

Tip: Fake Your User Agent

Many websites won’t play nice with you if you come knocking with the wrong User Agent string. What’s a User Agent string? It’s part of every request to a web server that tells it what type of agent (browser, spider, etc) is requesting the content. Some websites will give you different content depending on the user agent, so you might want to experiment. You do this in cURL with a call to curl_setopt() with CURLOPT_USERAGENT as the option:

Code: Select all

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

This would set cURL’s user agent to mimic Google’s. You can find a comprehensive list of user agents here: User Agents.

Common User Agents
I’ve done a bit of the leg work for you and gathered the most common user agents:

Search Engine User Agents

Google – Googlebot/2.1 ( http://www.googlebot.com/bot.html)
Google Image – Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html)
MSN Live – msnbot-Products/1.0 (+http://search.msn.com/msnbot.htm)
Yahoo – Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
ask

Browser User Agents

Firefox (WindowsXP) – Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
IE 7 – Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)
IE 6 – Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)
Safari – Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/522.11 (KHTML, like Gecko) Safari/3.0.2
Opera – Opera/9.00 (Windows NT 5.1; U; en)

Using PHP’s DOM Functions To Parse The HTML
PHP provides with a really cool tool for working with HTML content: DOM Functions. The DOM Functions allow you to parse HTML (or XML) into an object structure (or DOM – Document Object Model). Let’s see how we do it:

Code: Select all

$dom = new DOMDocument();
@$dom->loadHTML($html);

Wow is it really that easy? Yes! Now we have a nice DOMDocument object that we can use to access everything within the HTML in a nice clean way.

Tip: You may have noticed I put @ in front of loadHTML(), this suppresses some annoying warnings that the HTML parser throws on many pages that have non-standard compliant code.

XPath Makes Getting The Links You Want Easy
Now for the real magic of the DOM: XPath! XPath allows you to gather collections of DOM nodes (otherwise known as tags in HTML). Say you want to only get links that are within unordered lists. All you have to do is write a query like “/html/body//ul//li//a” and pass it to XPath->evaluate(). I’m not going to go into all the ways you can use XPath because I’m just learning myself and someone else has already made a great list of examples: XPath Examples. Here’s a code snippet that will just get every link on the page using XPath:

Code: Select all

$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

Iterate And Store Your Links
Next we’ll iterate through all the links we’ve gathered using XPath and store them in a database. First the code to iterate through the links:

Code: Select all

for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    storeLink($url,$target_url);
}

$hrefs is an object of type DOMNodeList and item() is a function that returns a DOMNode object for the specified index. The index can be between 0 and $hrefs->length. So we’ve got a loop that retrieves each link as a DOMNode object.

Code: Select all

$url = $href->getAttribute('href');

DOMNodes inherit the getAttribute() function from the DOMElement class. getAttribute() returns any attribute of the node (in this case an <a> tag with the href attribute). Now we’ve got our URL and we can store it in the database.

We’ll want a database table that looks something like this:

Code: Select all

CREATE TABLE `links` (
`url` TEXT NOT NULL ,
`gathered_from` TEXT NOT NULL ,
`time_stamp` TIMESTAMP NOT NULL
);

We’ll a storeLink() function to put the links in the database. I’ll assume you know the basics of how to connect to a database (If not grab a MySQL & PHP tutorial here).

Code: Select all

function storeLink($url,$gathered_from) {
    $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
    mysql_query($query) or die('Error, insert query failed');
}

Your Completed Link Scraper

Code: Select all

function storeLink($url,$gathered_from) {
    $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
    mysql_query($query) or die('Error, insert query failed');
}

$target_url = "https://robot.lk/";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
    echo "<br />cURL error number:" .curl_errno($ch);
    echo "<br />cURL error:" . curl_error($ch);
    exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    storeLink($url,$target_url);
    echo "<br />Link stored: $url";
}

What Else Could I Do With This Thing?
The possibilities are limitless. For starters you might want to store a list of sites that you want scraped in a database and then set up the script so it runs on a regular basis to scrap those sites. You could then compare the link structure over time or maybe republish the links in some sort of directory. Leave a comment below and say what you’re using this script for. Here are a few other things people have done with scrapers in the past:

Build a search engine from the content you gather. Ex: Google
Analyze a site to determine how well it is SEO optomized for keywords. SEO Book’s Keyword Density Tool.
Republish free content dynamically on your website.
Create an RSS feed from a website. See Using PHP To Scrape Web Sites As Feeds

Is Scraping Content Legal?
There is no easy answer to this question. Many organizations scrap content from all over the web – Google, Yahoo, Microsoft, and many others. These companies get away with it under fair use and because site owners want to be included in the search results. However, there have been copyright infringement rulings against these companies.

The real answer is that it depends who you scrape and what you do with the content. Basic copyright law gives authors an automatic copyright on everything they create. But the same law permits fair use of copyrighted material. Fair use includes: criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research. But even these uses could be considered copyright infringement in some circumstances. So be careful before you claim “fair use” as your defense!

Here’s a couple sites that have granted you the right to use their content. They do require you to attribute the content to the author or the URL you scraped it from:

Wikipedia – GNU Free Documentation License
Open Directory Project – Open Directory License
Creative Commons – Creative Commons Attribution 3.0
Many sites publish their content under some form of the Creative Commons license. You can search for creative commons licensed works here: Creative Commons Search. Remember that it’s your responsibility to verify the copyright rules for anything you use, even stuff found using the Creative Commons Search.