Jump to content

Beginner scraping script with PHP and cURL


Brandon

Recommended Posts

So this is my first little mini tutorial here. Hope someone will like it/find it useful.

 

Basically what we are going to do is scrape some data from a remote website using PHP and cURL. cURL is a "client URL transfer library" for making all sorts of remote requests and is very useful for many things like getting data, logging in automatically, auto filling out forms etc.

 

Lets get cracking!

 

First and foremost we have to enable the cURL extension as this is not enabled by default. On a Windows machine edit your php.ini file and uncomment ;extension=php_curl.dll and restart your server. If you are using Ubuntu sudo apt-get install php5-curl and restart server.

I use a WAMP server at home and it is super easy to install extensions on it simply:

Go to icon down in the right corner of your screen->left click WAMPSERVER icon->PHP->PHP extensions->click on php_curl and then restart server. Voila!

 

Alright now we are going to initiate cURL and make a request to another site and display the html with an echo:

 

<?php
$url = "http://www.nytimes.com/";
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
curl_close($ch);
echo $result;
?>

 

Now that we have the html inside $result we can extract the data we are after using regular expressions. In this case I took a regex from http://regexlib.com/ to extract links and modded it just a little bit to make it work. You can just comment out the previous echo $result; and paste these 2 lines in there.

 

preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']*(.*?)[\"\'].*?>([^<]+|.*?)?<\/a>/is", $result, $match, PREG_SET_ORDER);
print_r($match);

 

This is how this stuff works. Pretty easy basic stuff. You can read more about PHPs cURL support here http://php.net/manual/en/book.curl.php. I especially recommend the curl_setopt part were you can make all kinds of cool stuff like setting an user agent, set referrer, set cookie and a bunch of other stuff to mimic your request coming from an actual user.

 

Any questions or suggestions, just fire away in the thread!

 

More information on cURL: http://curl.haxx.se/

Link to comment
Share on other sites

  • Administrators

I'm literally working on some php screen scraping right now and banging my head against the wall :P

 

Mind giving me a hand?

 

Here is my current code:


<?php
$data = file_get_contents('http://www.facebook.com/pages/The-Quota/312465117876');

$regex = '#<span class="uiNumberGiant fsxxl fwb">(.+?)</span>#';
preg_match($regex,$data,$match);
var_dump($match); 
echo $match[1];
?>

 

It returns: array(2) { [0]=> string(51) "11,822" [1]=> string(6) "11,822" } 11,822

 

I only want to return the # of likes the page has so 11,822

 

What am I doing wrong?

Link to comment
Share on other sites

I'll do my best!

 

I'm sorry but I don't really understand what you are after.

It returns: array(2) { [0]=> string(51) "11,822" [1]=> string(6) "11,822" } 11,822

 

I only want to return the # of likes the page has so 11,822

 

What am I doing wrong?

 

What do you mean with "#"? Because it seems to me that you are getting the right result if that is the number of likes 11,822.

 

I didn't get your code to work for some reason but this regex works for me:

#<span class="uiNumberGiant fsxxl fwb">(.*?)</span#is

Link to comment
Share on other sites

  • Administrators

Well when I do my

 

echo $match[1];

It returns:

array(2) { [0]=> string(51) "11,822" [1]=> string(6) "11,822" } 11,822

 

I don't need all that garbage info it returns. I want to cut that down so it only echo's out "11,822"

Link to comment
Share on other sites

  • Administrators

Ah I see

Edit out or comment out:

var_dump($match);

When I do that I just get a blank screen for results...

 

 

Here is my result set with my code: http://Digitize Design.net/test/likes.php as you can see all that extra garbage in there in stead of just returning the number of likes.

 

Your code with that commented out is here: http://Digitize Design.net/test/brandon.php Shows nothing.

Link to comment
Share on other sites

That is just weird because the

array(2) { [0]=> string(51) "11,822" [1]=> string(6) "11,822" }

is from the var_dump and the trailing 11,822 is from the echo statement.

 

When I try your code facebook deny me the page because i don't use a proper user agent. In brandon.php what do you get if you

echo $data;

?

Link to comment
Share on other sites

  • Administrators

When I try your code facebook deny me the page because i don't use a proper user agent. In brandon.php what do you get if you

echo $data;

?

Give it a try now, I modified it. The problem you are having is because you need to change a setting in your PHP.ini.

Link to comment
Share on other sites

Very strange, and you are keeping

echo $match[1];

in brandon.php right?

 

If you have cURL enabled this works for me

<?php
$url = "http://www.facebook.com/pages/The-Quota/312465117876";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);  
$result = curl_exec($ch);
curl_close($ch);
//echo $result;
$regex = '#<span class="uiNumberGiant fsxxl fwb">(.*?)</span#is';
preg_match($regex,$result,$match);
//var_dump($match);
echo $match[1];
?>

Link to comment
Share on other sites

  • Administrators

Yes I did keep that, but still no luck.

 

So I just tried you code using cURL and that works perfect!

http://Digitize Design...est/brandon.php

 

I wonder though, how come it doesn't work on some Facebook pages? Such as when I change the URL to:

<?php
$url = "http://www.facebook.com/RockerLips";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
$result = curl_exec($ch);
curl_close($ch);
//echo $result;
$regex = '#<span class="uiNumberGiant fsxxl fwb">(.*?)</span#is';
preg_match($regex,$result,$match);
//var_dump($match);
echo $match[1];
?>

Link to comment
Share on other sites

  • Administrators

Ah that must be the problem, thanks again.

 

So now I will have to come up with a way for this script to login with my credentials to Facebook first I guess. If you know of a way or tutorials for this please let me know :P

Link to comment
Share on other sites

I have never done one myself for facebook and I guess it will be tricky. They often put some random Javascript redirect stuff to make it harder to log in, and cracking that can be a daunting task. Finding good tutorials about this can be a real pain unfortunately. I can see if I can crack it but I can't really promise anything.

 

Get firebug and read up on POST:ing with cURL in the mean time(you will have to save cookies in an cookies.txt file aswell). With firebug you can use the NET tab and collect and save all calls going back and fourth when loading pages. Then try to mimic the exact header requests there with cURL. It is actually pretty fun when you get the hang of it.

Link to comment
Share on other sites

  • 2 years later...

Hi Frined,

 

 

am new for curl. here  i will scrap datas using curl in normal website.. but i need to scrap an login website url.

 

for example :- https://www.nmddata.com.au

 

this url has login access i need to scrap further following url as to save as html.

 

https://www.nmddata.com.au/members/search_adv_do.php?state=New+South+Wales&region=&bed=&cat=&pricemin=&pricemax=&orderby=date&updown=desc&submit=Sort

 

i have try this following code :-

 

<?php
$name_val     = urlencode('mani');
$password_val = urlencode('roshan');
$region_val = urlencode('Sydney City');
$sector_val = urlencode('Residential');
//$message_val  = urlencode('This is a test & = +');
$str= "Name=".$name_val."&Password=".$password_val."Region=".$region_val."Sector=".$sector_val;
//print $str;
 
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,'');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($ch, CURLOPT_USERPWD, "mani:roshan");

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);   // no verify
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $str);
$data = curl_exec($ch);
curl_close($ch);
file_put_contents('testurl.html',$data);
echo"<h3>Data Saved Successfully</h3>";
//echo $data;
?>

 

But i cant get login access data.. it shows as join now link.

 

if possiable please suggest me

 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...