Hey all,
I have setup a link scraper (using some tutorials I found from my trusty friend Google) in a PHP script and the code looks like this:
(I have modified the code a little from tutorials so that it only returns links that start with http - excuse the sloppy code as I am just messing about with this)
PHP Code:
<?php
if(isset($_POST['target_url'])){
$target_url = $_POST['target_url'];
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo '<div class="error"><p align="left">An error occurred. Please make sure you have typed the correct address and try again.</p></div>';
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
}
?>
<form action="" method="POST" name="frmurl">
<input type="text" name="target_url">
<input type="submit" value="Get links">
</form>
<?php
if(isset($_POST['target_url'])){
$num = 0;
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
//See if site starts with http://
$tmp = strpos($url, "http");
//If it does then continue
if($tmp !== 0){
}
else{
$fullurl = $url;
//Filter it so only the www.address.com is returned
$url = parse_url($url, PHP_URL_HOST);
//Trim the site to make sure it has no blanks at either end (verification)
$site_raw_address = trim($site_full_address);
//Check if site begins with WWW. - if it does, remove it
$www_exists = strpos($url, "www.");
if($www_exists !== FALSE){
$url = substr($url,4);
}
echo $url . '<br>';
}
}
}
?>
The format of the returned sites is just as I want, and looks like this (using BBC - Homepage as an example):
Code:
newsvote.bbc.co.uk
bbc.co.uk
bbc.co.uk
bbc.co.uk
bbc.co.uk
bbc.co.uk
bbc.co.uk
bbc.co.uk
bbc.co.uk
bbc.co.uk
bbc.co.uk
The problem I have got is, as you can see, 'bbc.co.uk' is returned more than once (obviously more than 1 link on the page for it) but I don't want it to be.
Can anybody help me modify this code so it does not display duplicates like this - so it should only be returning:
Code:
newsvote.bbc.co.uk
bbc.co.uk
Thanks!