Brandon Posted March 7, 2012 Share Posted March 7, 2012 So this is my first little mini tutorial here. Hope someone will like it/find it useful. Basically what we are going to do is scrape some data from a remote website using PHP and cURL. cURL is a "client URL transfer library" for making all sorts of remote requests and is very useful for many things like getting data, logging in automatically, auto filling out forms etc. Lets get cracking! First and foremost we have to enable the cURL extension as this is not enabled by default. On a Windows machine edit your php.ini file and uncomment ;extension=php_curl.dll and restart your server. If you are using Ubuntu sudo apt-get install php5-curl and restart server. I use a WAMP server at home and it is super easy to install extensions on it simply: Go to icon down in the right corner of your screen->left click WAMPSERVER icon->PHP->PHP extensions->click on php_curl and then restart server. Voila! Alright now we are going to initiate cURL and make a request to another site and display the html with an echo: <?php $url = "http://www.nytimes.com/"; $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $result = curl_exec($ch); curl_close($ch); echo $result; ?> Now that we have the html inside $result we can extract the data we are after using regular expressions. In this case I took a regex from http://regexlib.com/ to extract links and modded it just a little bit to make it work. You can just comment out the previous echo $result; and paste these 2 lines in there. preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']*(.*?)[\"\'].*?>([^<]+|.*?)?<\/a>/is", $result, $match, PREG_SET_ORDER); print_r($match); This is how this stuff works. Pretty easy basic stuff. You can read more about PHPs cURL support here http://php.net/manual/en/book.curl.php. I especially recommend the curl_setopt part were you can make all kinds of cool stuff like setting an user agent, set referrer, set cookie and a bunch of other stuff to mimic your request coming from an actual user. Any questions or suggestions, just fire away in the thread! More information on cURL: http://curl.haxx.se/ Nathan 1 Quote Link to comment Share on other sites More sharing options...
Administrators Nathan Posted March 7, 2012 Administrators Share Posted March 7, 2012 I'm literally working on some php screen scraping right now and banging my head against the wall Mind giving me a hand? Here is my current code: <?php $data = file_get_contents('http://www.facebook.com/pages/The-Quota/312465117876'); $regex = '#<span class="uiNumberGiant fsxxl fwb">(.+?)</span>#'; preg_match($regex,$data,$match); var_dump($match); echo $match[1]; ?> It returns: array(2) { [0]=> string(51) "11,822" [1]=> string(6) "11,822" } 11,822 I only want to return the # of likes the page has so 11,822 What am I doing wrong? Quote Link to comment Share on other sites More sharing options...
Brandon Posted March 7, 2012 Author Share Posted March 7, 2012 I'll do my best! I'm sorry but I don't really understand what you are after. It returns: array(2) { [0]=> string(51) "11,822" [1]=> string(6) "11,822" } 11,822 I only want to return the # of likes the page has so 11,822 What am I doing wrong? What do you mean with "#"? Because it seems to me that you are getting the right result if that is the number of likes 11,822. I didn't get your code to work for some reason but this regex works for me: #<span class="uiNumberGiant fsxxl fwb">(.*?)</span#is Quote Link to comment Share on other sites More sharing options...
Administrators Nathan Posted March 7, 2012 Administrators Share Posted March 7, 2012 Well when I do my echo $match[1]; It returns: array(2) { [0]=> string(51) "11,822" [1]=> string(6) "11,822" } 11,822 I don't need all that garbage info it returns. I want to cut that down so it only echo's out "11,822" Quote Link to comment Share on other sites More sharing options...
Brandon Posted March 7, 2012 Author Share Posted March 7, 2012 Ah I see Edit out or comment out: var_dump($match); Quote Link to comment Share on other sites More sharing options...
Administrators Nathan Posted March 7, 2012 Administrators Share Posted March 7, 2012 Ah I see Edit out or comment out: var_dump($match); When I do that I just get a blank screen for results... Here is my result set with my code: http://Digitize Design.net/test/likes.php as you can see all that extra garbage in there in stead of just returning the number of likes. Your code with that commented out is here: http://Digitize Design.net/test/brandon.php Shows nothing. Quote Link to comment Share on other sites More sharing options...
Brandon Posted March 7, 2012 Author Share Posted March 7, 2012 That is just weird because the array(2) { [0]=> string(51) "11,822" [1]=> string(6) "11,822" } is from the var_dump and the trailing 11,822 is from the echo statement. When I try your code facebook deny me the page because i don't use a proper user agent. In brandon.php what do you get if you echo $data; ? Quote Link to comment Share on other sites More sharing options...
Administrators Nathan Posted March 7, 2012 Administrators Share Posted March 7, 2012 When I try your code facebook deny me the page because i don't use a proper user agent. In brandon.php what do you get if you echo $data; ? Give it a try now, I modified it. The problem you are having is because you need to change a setting in your PHP.ini. Quote Link to comment Share on other sites More sharing options...
Brandon Posted March 7, 2012 Author Share Posted March 7, 2012 Very strange, and you are keeping echo $match[1]; in brandon.php right? If you have cURL enabled this works for me <?php $url = "http://www.facebook.com/pages/The-Quota/312465117876"; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1"; curl_setopt($ch, CURLOPT_USERAGENT, $useragent); $result = curl_exec($ch); curl_close($ch); //echo $result; $regex = '#<span class="uiNumberGiant fsxxl fwb">(.*?)</span#is'; preg_match($regex,$result,$match); //var_dump($match); echo $match[1]; ?> Nathan 1 Quote Link to comment Share on other sites More sharing options...
Administrators Nathan Posted March 7, 2012 Administrators Share Posted March 7, 2012 Yes I did keep that, but still no luck. So I just tried you code using cURL and that works perfect! http://Digitize Design...est/brandon.php I wonder though, how come it doesn't work on some Facebook pages? Such as when I change the URL to: <?php $url = "http://www.facebook.com/RockerLips"; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1"; curl_setopt($ch, CURLOPT_USERAGENT, $useragent); $result = curl_exec($ch); curl_close($ch); //echo $result; $regex = '#<span class="uiNumberGiant fsxxl fwb">(.*?)</span#is'; preg_match($regex,$result,$match); //var_dump($match); echo $match[1]; ?> Quote Link to comment Share on other sites More sharing options...
Brandon Posted March 7, 2012 Author Share Posted March 7, 2012 I'm glad I could be of help! Well when I try the url in the browser it redirects to the front page. Is the http://www.facebook.com/RockerLips a public facebook page or do you have to be logged in to visit it? Nathan 1 Quote Link to comment Share on other sites More sharing options...
Administrators Nathan Posted March 7, 2012 Administrators Share Posted March 7, 2012 Ah that must be the problem, thanks again. So now I will have to come up with a way for this script to login with my credentials to Facebook first I guess. If you know of a way or tutorials for this please let me know Quote Link to comment Share on other sites More sharing options...
Brandon Posted March 7, 2012 Author Share Posted March 7, 2012 I have never done one myself for facebook and I guess it will be tricky. They often put some random Javascript redirect stuff to make it harder to log in, and cracking that can be a daunting task. Finding good tutorials about this can be a real pain unfortunately. I can see if I can crack it but I can't really promise anything. Get firebug and read up on POST:ing with cURL in the mean time(you will have to save cookies in an cookies.txt file aswell). With firebug you can use the NET tab and collect and save all calls going back and fourth when loading pages. Then try to mimic the exact header requests there with cURL. It is actually pretty fun when you get the hang of it. Quote Link to comment Share on other sites More sharing options...
vishnusr Posted May 21, 2014 Share Posted May 21, 2014 Hi Frined, am new for curl. here i will scrap datas using curl in normal website.. but i need to scrap an login website url. for example :- https://www.nmddata.com.au this url has login access i need to scrap further following url as to save as html. https://www.nmddata.com.au/members/search_adv_do.php?state=New+South+Wales®ion=&bed=&cat=&pricemin=&pricemax=&orderby=date&updown=desc&submit=Sort i have try this following code :- <?php$name_val = urlencode('mani');$password_val = urlencode('roshan');$region_val = urlencode('Sydney City');$sector_val = urlencode('Residential');//$message_val = urlencode('This is a test & = +');$str= "Name=".$name_val."&Password=".$password_val."Region=".$region_val."Sector=".$sector_val;//print $str; $ch = curl_init();curl_setopt($ch,CURLOPT_URL,'');curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_ANY);curl_setopt($ch, CURLOPT_USERPWD, "mani:roshan");curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // no verifycurl_setopt($ch, CURLOPT_POST, 1);curl_setopt($ch, CURLOPT_POSTFIELDS, $str);$data = curl_exec($ch);curl_close($ch);file_put_contents('testurl.html',$data);echo"<h3>Data Saved Successfully</h3>";//echo $data;?> But i cant get login access data.. it shows as join now link. if possiable please suggest me Quote Link to comment Share on other sites More sharing options...
vishnusr Posted May 22, 2014 Share Posted May 22, 2014 curl_setopt($ch,CURLOPT_URL,'https://www.nmddata.com.au/members/search_adv_do.php?state=New+South+Wales&orderby=suburb&updown=asc'); Quote Link to comment Share on other sites More sharing options...
Administrators Nathan Posted May 23, 2014 Administrators Share Posted May 23, 2014 You got it working then? Quote Link to comment Share on other sites More sharing options...
vishnusr Posted May 26, 2014 Share Posted May 26, 2014 You got it working then? No yet i didnt get output.. i cant access login page.. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.