Looking for more information on how to do PHP the right way? Check out PHP: The Right Way

Sergey Zhuk:
Fast Web Scraping With ReactPHP. Part 2: Throttling Requests
Mar 19, 2018 @ 09:20:55

Sergey Zhuk has posted the second part of his "fast web scraping" series that makes use of the ReactPHP package to perform the requests. In part one he laid some of the groundwork for the scraper and made a few requests. In this second part he improves on this basic script and how to throttle the requests so as to not overload the end server.

t is very convenient to have a single HTTP client which can be used to send as many HTTP requests as you want concurrently. But at the same time, a bad scraper which performs hundreds of concurrent requests per second can impact the performance of the site being scraped. Since the scrapers don’t drive any human traffic on the site and just affect the performance, some sites don’t like them and try to block their access. The easiest way to prevent being blocked is to crawl nicely with auto throttling the scraping speed (limiting the number of concurrent requests). The faster you scrap, the worse it is for everybody. The scraper should look like a human and perform requests accordingly. A good solution for throttling requests is a simple queue.

He shows how to integrate the clue/mq-react package into the current scraper to interface with a RabbitMQ instance and handle the reading of and writing to the queue. He includes the code needed to update the ReactPHP client. The mq-react package makes the update simple with the HTTP client reading from the queue instance rather than the array of URLs. One the queue is integrated, he then shows how to create a "parser" that can read in the HTML and extract only the wanted data using the DomCrawler component.

tagged: http reactphp client scraping web tutorial throttle request queue imdb

Link: http://sergeyzhuk.me/2018/03/19/fast-webscraping-with-reactphp-limiting-requests/

Sergey Zhuk:
Fast Web Scraping With ReactPHP
Feb 12, 2018 @ 10:55:42

Sergey Zhuk has a new ReactPHP-related post to his site today showing you how to use the library to scrape content from the web quickly, making use of the asynchronous abilities the package provides.

Almost every PHP developer has ever parsed some data from the Web. Often we need some data, which is available only on some website and we want to pull this data and save it somewhere. It looks like we open a browser, walk through the links and copy data that we need. But the same thing can be automated via script. In this tutorial, I will show you the way how you can increase the speed of you parser making requests asynchronously.

In his example he creates a scraper that goes to a movie's page on the IMDB website and extracts the title, description, release date and the list of genres it falls into. Instead of creating a single-threaded process that can only fetch a single page at a time, he uses ReactPHP to speed things up and provide it a list of pages to fetch all at the same time. He starts by walking through the setup of the package and the creation of the browser instance. He then includes the code to make the request and crawl the contents of the result for the data. The post ends with the full code for the client and a way to add in a timeout in case the request fails.

tagged: scraping reactphp tutorial imdb movie crawl dom

Link: http://sergeyzhuk.me/2018/02/12/fast-webscraping-with-reactphp/