Sergey Zhuk has posted the third part of his series covering the use of ReactPHP to scrape content from another source on the web. In this third part of the series he improves on his scripts from before (scraping from the IMDB site) to add in a proxy server.
n the previous article, we have created a scraper to parse movies data from used a simple in-memory queue to avoid sending hundreds or thousands of concurrent requests and thus to avoid being blocked. But what if you are already blocked? The site that you are scraping has already added your IP to its blacklist and you don’t know whether it is a temporal block or a permanent one.
Such issued can be resolved with a proxy server. Using proxies and rotating IP addresses can prevent you from being detected as a scraper.
He then shows how to use the clue/reactphp-buzz package to write an asynchronous HTTP request to
google.com making use of promises rather than normal synchronous request handling. He then installs the clue/reactphp-socks package to make the connection to the proxy server(s) and modifies the Buzz client to use that as a connection. After finding a proxy server to use, he updates the scraper code created previously with the new Buzz+Socks combination and shows it in action scraping data. The post finishes with a look at adding some error handling and how to handle when the proxy requests authentication before use.