+ Post New Thread
Results 1 to 9 of 9
Web Development Thread, Link Scraping in Coding and Web Development; Hey all, I have setup a link scraper (using some tutorials I found from my trusty friend Google) in a ...
  1. #1

    Hightower's Avatar
    Join Date
    Jun 2008
    Location
    Cloud 9
    Posts
    4,920
    Thank Post
    494
    Thanked 690 Times in 444 Posts
    Rep Power
    242

    Link Scraping

    Hey all,

    I have setup a link scraper (using some tutorials I found from my trusty friend Google) in a PHP script and the code looks like this:

    (I have modified the code a little from tutorials so that it only returns links that start with http - excuse the sloppy code as I am just messing about with this)

    PHP Code:
    <?php
    if(isset($_POST['target_url'])){
    $target_url $_POST['target_url'];
    $userAgent 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

    // make the cURL request to $target_url
    $ch curl_init();
    curl_setopt($chCURLOPT_USERAGENT$userAgent);
    curl_setopt($chCURLOPT_URL,$target_url);
    curl_setopt($chCURLOPT_FAILONERRORtrue);
    curl_setopt($chCURLOPT_FOLLOWLOCATIONtrue);
    curl_setopt($chCURLOPT_AUTOREFERERtrue);
    curl_setopt($chCURLOPT_RETURNTRANSFER,true);
    curl_setopt($chCURLOPT_TIMEOUT10);
    $htmlcurl_exec($ch);
    if (!
    $html) {
        echo 
    '<div class="error"><p align="left">An error occurred. Please make sure you have typed the correct address and try again.</p></div>';
    }

    // parse the html into a DOMDocument
    $dom = new DOMDocument();
    @
    $dom->loadHTML($html);

    // grab all the on the page
    $xpath = new DOMXPath($dom);
    $hrefs $xpath->evaluate("/html/body//a");
    }
    ?>

    <form action="" method="POST" name="frmurl">
        <input type="text" name="target_url">
        <input type="submit" value="Get links">
    </form>

    <?php
    if(isset($_POST['target_url'])){
        
    $num 0;
        for (
    $i 0$i $hrefs->length$i++) {
            
    $href $hrefs->item($i);
            
    $url $href->getAttribute('href');
            
            
    //See if site starts with http://
            
    $tmp strpos($url"http");
            
    //If it does then continue
            
    if($tmp !== 0){
            }
            else{
                
    $fullurl $url;
                
                
    //Filter it so only the www.address.com is returned
                
    $url parse_url($urlPHP_URL_HOST);
                
                
    //Trim the site to make sure it has no blanks at either end (verification)
                
    $site_raw_address trim($site_full_address);
                
                
    //Check if site begins with WWW. - if it does, remove it
                
    $www_exists strpos($url"www.");
                if(
    $www_exists !== FALSE){
                    
    $url substr($url,4);
                }
                
                echo 
    $url '<br>';
            }
        }

    }
    ?>
    The format of the returned sites is just as I want, and looks like this (using BBC - Homepage as an example):

    Code:
    newsvote.bbc.co.uk
    bbc.co.uk
    bbc.co.uk
    bbc.co.uk
    bbc.co.uk
    bbc.co.uk
    bbc.co.uk
    bbc.co.uk
    bbc.co.uk
    bbc.co.uk
    bbc.co.uk
    The problem I have got is, as you can see, 'bbc.co.uk' is returned more than once (obviously more than 1 link on the page for it) but I don't want it to be.

    Can anybody help me modify this code so it does not display duplicates like this - so it should only be returning:

    Code:
    newsvote.bbc.co.uk
    bbc.co.uk
    Thanks!

  2. #2
    contink's Avatar
    Join Date
    Jul 2006
    Location
    South Yorkshire
    Posts
    3,791
    Thank Post
    303
    Thanked 327 Times in 233 Posts
    Rep Power
    119
    Here you go...

    Find:
    Code:
    <?php
    if(isset($_POST['target_url'])){
        $num = 0;
        for ($i = 0; $i < $hrefs->length; $i++) {
            $href = $hrefs->item($i);
            $url = $href->getAttribute('href');
    Replace with:
    Code:
    <?php
    $visited = array();
    
    if(isset($_POST['target_url'])){
        $num = 0;
        for ($i = 0; $i < $hrefs->length; $i++) {
            $href = $hrefs->item($i);
            $url = $href->getAttribute('href');
            if($visited[$url]){
    		continue;
    	} else {
    		$visited[$url] = true;
    	}

    That's untested but I think it'll work.

  3. #3

    tmcd35's Avatar
    Join Date
    Jul 2005
    Location
    Norfolk
    Posts
    6,069
    Thank Post
    902
    Thanked 1,013 Times in 825 Posts
    Blog Entries
    9
    Rep Power
    350
    put the url's into an array and use a loop to check against previous array enteries.

    Code:
    $url = $href->getAttribute('href');
    becomes

    Code:
    $newurl = $href->getAttribute('href');
    $accepturl=1;
    while ($loop != $urlcount + 1)
    {
       if ($url[$loop] == $newurl) {$accepturl = 0;}
    }
    if ($accepturl != 0)
    {
      $urlcount = $urlcount + 1;
      $url[$urlcount] = $newurl;
    }
    Then refer to the URL by it's array entry in the rest of the code - $url[$urlcount] rather than $url, if that make sense

  4. #4

    CESIL's Avatar
    Join Date
    Nov 2006
    Location
    Hampshire
    Posts
    1,405
    Thank Post
    109
    Thanked 267 Times in 198 Posts
    Rep Power
    169
    array_unique() takes input array and returns a new array without duplicate values

  5. Thanks to CESIL from:

    Hightower (3rd March 2009)

  6. #5

    ZeroHour's Avatar
    Join Date
    Dec 2005
    Location
    Edinburgh, Scotland
    Posts
    5,838
    Thank Post
    974
    Thanked 1,407 Times in 851 Posts
    Blog Entries
    1
    Rep Power
    460
    Could you not use array_unique() at the end or something?
    There are many ways to do this including the in_array function.

  7. Thanks to ZeroHour from:

    Hightower (3rd March 2009)

  8. #6

    Hightower's Avatar
    Join Date
    Jun 2008
    Location
    Cloud 9
    Posts
    4,920
    Thank Post
    494
    Thanked 690 Times in 444 Posts
    Rep Power
    242
    Quote Originally Posted by contink View Post
    That's untested but I think it'll work.
    Just tried that one - didn't work

  9. #7

    Hightower's Avatar
    Join Date
    Jun 2008
    Location
    Cloud 9
    Posts
    4,920
    Thank Post
    494
    Thanked 690 Times in 444 Posts
    Rep Power
    242
    Thanks for the help guys - the array_unique has worked a treat and only needed to change a couple of lines.

  10. #8

    Hightower's Avatar
    Join Date
    Jun 2008
    Location
    Cloud 9
    Posts
    4,920
    Thank Post
    494
    Thanked 690 Times in 444 Posts
    Rep Power
    242
    Ok, I thought it worked a treat, but the two address are left in 81 & 141 of the array. How do I get them to move to the start of the array?

  11. #9

    Hightower's Avatar
    Join Date
    Jun 2008
    Location
    Cloud 9
    Posts
    4,920
    Thank Post
    494
    Thanked 690 Times in 444 Posts
    Rep Power
    242
    Doesn't matter - got it by using

    PHP Code:
    array_keys(array_flip($array)); 
    Instead of

    PHP Code:
    array_unique($array); 



SHARE:
+ Post New Thread

Similar Threads

  1. How to link a component to admin control quick link icons box
    By DirtySnipe in forum EduGeek Joomla 1.5 Package
    Replies: 1
    Last Post: 15th June 2008, 09:02 PM
  2. useful link
    By Inbir316 in forum Other Stuff
    Replies: 5
    Last Post: 28th April 2008, 11:03 PM
  3. Scraping info from apps
    By localzuk in forum Coding
    Replies: 4
    Last Post: 7th April 2008, 02:58 PM
  4. D-Link
    By D-Link in forum Recommended Suppliers
    Replies: 5
    Last Post: 22nd January 2007, 08:29 PM

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •