Sergey Zhuk has posted the second part of his "fast web scraping" series that makes use of the ReactPHP package to perform the requests. In part one he laid some of the groundwork for the scraper and made a few requests. In this second part he improves on this basic script and how to throttle the requests so as to not overload the end server.
t is very convenient to have a single HTTP client which can be used to send as many HTTP requests as you want concurrently. But at the same time, a bad scraper which performs hundreds of concurrent requests per second can impact the performance of the site being scraped. Since the scrapers don’t drive any human traffic on the site and just affect the performance, some sites don’t like them and try to block their access. The easiest way to prevent being blocked is to crawl nicely with auto throttling the scraping speed (limiting the number of concurrent requests). The faster you scrap, the worse it is for everybody. The scraper should look like a human and perform requests accordingly. A good solution for throttling requests is a simple queue.
He shows how to integrate the clue/mq-react package into the current scraper to interface with a RabbitMQ instance and handle the reading of and writing to the queue. He includes the code needed to update the ReactPHP client. The
mq-react package makes the update simple with the HTTP client reading from the queue instance rather than the array of URLs. One the queue is integrated, he then shows how to create a "parser" that can read in the HTML and extract only the wanted data using the DomCrawler component.