Search the Community
Showing results for tags 'scraping'.
Found 1 result
So this is my first little mini tutorial here. Hope someone will like it/find it useful. Basically what we are going to do is scrape some data from a remote website using PHP and cURL. cURL is a "client URL transfer library" for making all sorts of remote requests and is very useful for many things like getting data, logging in automatically, auto filling out forms etc. Lets get cracking! First and foremost we have to enable the cURL extension as this is not enabled by default. On a Windows machine edit your php.ini file and uncomment ;extension=php_curl.dll and restart your server. If you are using Ubuntu sudo apt-get install php5-curl and restart server. I use a WAMP server at home and it is super easy to install extensions on it simply: Go to icon down in the right corner of your screen->left click WAMPSERVER icon->PHP->PHP extensions->click on php_curl and then restart server. Voila! Alright now we are going to initiate cURL and make a request to another site and display the html with an echo: <?php $url = "http://www.nytimes.com/"; $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $result = curl_exec($ch); curl_close($ch); echo $result; ?> Now that we have the html inside $result we can extract the data we are after using regular expressions. In this case I took a regex from http://regexlib.com/ to extract links and modded it just a little bit to make it work. You can just comment out the previous echo $result; and paste these 2 lines in there. preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']*(.*?)[\"\'].*?>([^<]+|.*?)?<\/a>/is", $result, $match, PREG_SET_ORDER); print_r($match); This is how this stuff works. Pretty easy basic stuff. You can read more about PHPs cURL support here http://php.net/manual/en/book.curl.php. I especially recommend the curl_setopt part were you can make all kinds of cool stuff like setting an user agent, set referrer, set cookie and a bunch of other stuff to mimic your request coming from an actual user. Any questions or suggestions, just fire away in the thread! More information on cURL: http://curl.haxx.se/