|Data scraping using cURL in PHP|
|CodeFire Blog - Technology|
|Written by Pranjal Srivastava|
|Friday, 01 July 2011 15:23|
(Please ensure that you are a genuine user of the site, and site allows you to do some automation and does not consider it as hacking attempt)
Let’s assume that the task for the day is to pull some data out of a site (eg. download a csv file) programmatically where data is behind a login. We shall use PHP/cURL for this automation. I am not going to be talking about cURL or any basic technical details associated with it. So please do not treat this as tutorial for learning cURL J. I am only going to talk about high level process to be followed and some gotchas that could make your life difficult in this process.
First step: You will need to login to the site using cURL. Inspect the login form using a tool such as Firebug or view source to see what all fields are being sent and what is the endpoint of the request. You will need all this information to send login request to the site using cURL. In fact, make a small HTML version of form (using code from the site) on local machine and try and login with that. If it works part of the job is done.
Another point that could be very helpful here is, if the form uses POST request convert that to GET and send the request to a local url to see what all parameters are passed. Sometimes there could be some hidden variables which are not very easy to track. Now inspect the query string from the GET request and create a url for cURL POST based on this string. One important point while writing login request is not to forget saving the cookie. So set the option CURLOPT_COOKIEFILE with filename. Also, you could get the filename using $tmp_fname = tempnam("/tmp", "COOKIE"); to make it platform independent (windows, Linux, Mac)
Every site comes with its own site of rules for login but broadly keeping above points in mind you should be able to login to any site using cURL.
Second Step: After login the next step is to get the file and save it on disk. If the site URL is simple (non dynamic) then there is no problem just invoke a simple GET request for that URL with cURL and save it to disk. However if the URL for the file is dynamic then you need to fetch the page which has the link, search for the link in the page of that text (knowledge of REGEX would come handy here) and get the dynamic URL. And then invoke cURL again on the dynamic URL to get the data. One point to keep in mind in case you are dealing with dynamic URL is that when you get the string for URL in php variables if there is any & it gets converted to & so if you directly invoke the url to get the data it will not work. Use htmlspecialchars_decode to get actual URL and you should be able to save the data.