Creating a Simple Web Crawler in PHP

4/14/2016 10:17:07 PM

In this article I will show you how to create a simple web crawler in PHP.

 

Step 1. 

Add an input box and a submit button to the web page. We can enter web page address into the input box. Regular Expressions are needed when extracting data.

 

Step 2. 

Regular expressions are needed when extracting data.

 

function preg_substr($start, $end, $str) // Regular expression      

{      

    $temp = preg_split($start, $str);      

    $content = preg_split($end, $temp[1]);      

    return $content[0];      

}   

 

Step 3.

String Split is needed when extracting data.

 

function str_substr($start, $end, $str) // string split       

{      

    $temp = explode($start, $str, 2);      

    $content = explode($end, $temp[1], 2);      

    return $content[0];      

}

 

Step 4.

Add a function to save the content of extraction:

 

function writelog($str)

{

  @unlink("log.txt");

  $open=fopen("log.txt","a" );

  fwrite($open,$str);

  fclose($open);

 

When the content we extracted are inconsistent with what are displayed in the browser, we couldn’t find the correct regular expressions. Here we can open the saved .txt file to find the correct string.

 

function writelog($str)

{

@unlink("log.txt");

$open=fopen("log.txt","a" );

fwrite($open,$str);

fclose($open);

}

 

Step 5.

A function would be needed as well if you need to capture pictures.

 

function getImage($url, $filename='', $dirName, $fileType, $type=0)

   {

    if($url == ''){return false;}

    //get the default file name

    $defaultFileName = basename($url);

    //file type

    $suffix = substr(strrchr($url,'.'), 1);

    if(!in_array($suffix, $fileType)){

        return false;

    }

    //set the file name

    $filename = $filename == '' ? time().rand(0,9).'.'.$suffix : $defaultFileName;

          

    //get remote file resource

    if($type){

        $ch = curl_init();

        $timeout = 5;

        curl_setopt($ch, CURLOPT_URL, $url);

        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);

        $file = curl_exec($ch);

        curl_close($ch);

    }else{

        ob_start();

        readfile($url);

        $file = ob_get_contents();

        ob_end_clean();

    }

    //set file path

    $dirName = $dirName.'/'.date('Y', time()).'/'.date('m', time()).'/'.date('d',time()).'/';

    if(!file_exists($dirName)){

        mkdir($dirName, 0777, true);

    }

    //save file

    $res = fopen($dirName.$filename,'a');

    fwrite($res,$file);

    fclose($res);

    return $dirName.$filename;

    }

 

Step 6.

We will write the code for extraction. Let’s take a web page from Amazon as an example. Enter a product link. 

 

if($_POST[‘URL’]){

//---------------------example-------------------

$str = file_get_contents($_POST[‘URL’]);

$str = mb_convert_encoding($str, ‘utf-8’,’iso-8859-1’);

writelog($str);

//echo $str;

echo(‘Title:’ . Preg_substr(‘/<span id= “btAsinTitle”[^>}*>/’,’/<Vspan>/$str));

echo(‘<br/>’);

$imgurl=str_substr(‘var imageSrc = “’,’”’,$str);

echo ‘<img src=”’.getImage($imgurl,”,’img’ array(‘jpg’)); 

 

Then we can see what we extract. Below is the screenshot.

 

 

 

You don't need to code a web crawler any more if you have an automatic web crawler.

 

 

 

 

Author: The Octoparse Team

 

 

 

Download Octoparse Today

 

 

For more information about Octoparse, please click here.

Sign up today.

 

 

Author's Picks

 

About Octoparse

Octoparse 6.0 is Now Available

What A Price Monitor Can Help you?

Examples of Businesses Who Use Data Scraping

Collect Data from Facebook

Collect Data from Craigslist

Collect Data from LinkedIn

 

 

 

Recent Posts

Contact
us

Leave us a message

Your name*

Your email*

Subject*

Description*

Attachment(s)

Attach file
Attach file
Please enter details of your issue and we will get back to you ASAP.