PHPDeveloper: PHP News, Views and Community

Subscribe

@phpdeveloper.org

News Archive

Community News: Latest PECL Releases (04.22.2025)

Community News: Latest PECL Releases (04.15.2025)

Community News: Latest PEAR Releases (04.14.2025)

Community News: Latest PECL Releases (04.08.2025)

Community News: Latest PEAR Releases (04.07.2025)

Community News: Latest PECL Releases (04.01.2025)

Community News: Latest PEAR Releases (03.31.2025)

Community News: Latest PECL Releases (03.25.2025)

Community News: Latest PECL Releases (03.18.2025)

Community News: Latest PEAR Releases (03.17.2025)

Looking for more information on how to do PHP the right way? Check out PHP: The Right Way

Sergey Zhuk:
Fast Web Scraping With ReactPHP. Part 3: Using Proxy

byChris Cornutt Jun 26, 2018 @ 17:27:43

Sergey Zhuk has posted the third part of his series covering the use of ReactPHP to scrape content from another source on the web. In this third part of the series he improves on his scripts from before (scraping from the IMDB site) to add in a proxy server.

n the previous article, we have created a scraper to parse movies data from used a simple in-memory queue to avoid sending hundreds or thousands of concurrent requests and thus to avoid being blocked. But what if you are already blocked? The site that you are scraping has already added your IP to its blacklist and you don’t know whether it is a temporal block or a permanent one.
Such issued can be resolved with a proxy server. Using proxies and rotating IP addresses can prevent you from being detected as a scraper.

He then shows how to use the clue/reactphp-buzz package to write an asynchronous HTTP request to google.com making use of promises rather than normal synchronous request handling. He then installs the clue/reactphp-socks package to make the connection to the proxy server(s) and modifies the Buzz client to use that as a connection. After finding a proxy server to use, he updates the scraper code created previously with the new Buzz+Socks combination and shows it in action scraping data. The post finishes with a look at adding some error handling and how to handle when the proxy requests authentication before use.

Sergey Zhuk:
Fast Web Scraping With ReactPHP. Part 2: Throttling Requests

byChris Cornutt Mar 19, 2018 @ 14:20:55

Sergey Zhuk has posted the second part of his "fast web scraping" series that makes use of the ReactPHP package to perform the requests. In part one he laid some of the groundwork for the scraper and made a few requests. In this second part he improves on this basic script and how to throttle the requests so as to not overload the end server.

t is very convenient to have a single HTTP client which can be used to send as many HTTP requests as you want concurrently. But at the same time, a bad scraper which performs hundreds of concurrent requests per second can impact the performance of the site being scraped. Since the scrapers don’t drive any human traffic on the site and just affect the performance, some sites don’t like them and try to block their access. The easiest way to prevent being blocked is to crawl nicely with auto throttling the scraping speed (limiting the number of concurrent requests). The faster you scrap, the worse it is for everybody. The scraper should look like a human and perform requests accordingly. A good solution for throttling requests is a simple queue.

He shows how to integrate the clue/mq-react package into the current scraper to interface with a RabbitMQ instance and handle the reading of and writing to the queue. He includes the code needed to update the ReactPHP client. The mq-react package makes the update simple with the HTTP client reading from the queue instance rather than the array of URLs. One the queue is integrated, he then shows how to create a "parser" that can read in the HTML and extract only the wanted data using the DomCrawler component.

tagged: http reactphp client scraping web tutorial throttle request queue imdb

Link: http://sergeyzhuk.me/2018/03/19/fast-webscraping-with-reactphp-limiting-requests/

Sergey Zhuk:
Fast Web Scraping With ReactPHP

byChris Cornutt Feb 12, 2018 @ 16:55:42

Sergey Zhuk has a new ReactPHP-related post to his site today showing you how to use the library to scrape content from the web quickly, making use of the asynchronous abilities the package provides.

Almost every PHP developer has ever parsed some data from the Web. Often we need some data, which is available only on some website and we want to pull this data and save it somewhere. It looks like we open a browser, walk through the links and copy data that we need. But the same thing can be automated via script. In this tutorial, I will show you the way how you can increase the speed of you parser making requests asynchronously.

In his example he creates a scraper that goes to a movie's page on the IMDB website and extracts the title, description, release date and the list of genres it falls into. Instead of creating a single-threaded process that can only fetch a single page at a time, he uses ReactPHP to speed things up and provide it a list of pages to fetch all at the same time. He starts by walking through the setup of the package and the creation of the browser instance. He then includes the code to make the request and crawl the contents of the result for the data. The post ends with the full code for the client and a way to add in a timeout in case the request fails.

tagged: scraping reactphp tutorial imdb movie crawl dom

Link: http://sergeyzhuk.me/2018/02/12/fast-webscraping-with-reactphp/

Phil Sturgeon:
Benchmarking Codswallop: NodeJS v PHP

byChris Cornutt Nov 12, 2013 @ 15:21:29

Phil Sturgeonhas posted about some Node.js vs PHP benchmarks that someone linked him to concerning web scraping. The article suggests that Node.js "owns" PHP when it comes to this but, as Phil finds out, there's a bit more to the story than that.

Sometimes people link me to articles and ask for my opinions. This one was a real doozy. Oh goody, a framework versus language post. Let's try and chew through this probable linkbait [where] we're benchmarking NodeJS v PHP. Weird, but I'll go along with it. Well, now we're testing cheerio v PhpQuery which is a bit different, but fine, let's go along with it.

Through a little discovery, Phil noticed phpQuery using file_get_contents, a blocking method for fetching the remote pages to scrape. Node.js instead uses a non-blocking method, meaning multiple files can be fetched at the same time. In answer to this blocking vs non-blocking, he decided to run benchamrks against a few cases - Node.js/Cherrio, PHP/phpQuery and his own, more correct comparison to the Node option - PHP/ReactPHP/phpQuery. He's shared his results, showing a major difference between the straight phpQuery and the React-based version.

It seems likely to me that people just assume PHP can't do this stuff, because by default most people arse around PHP with things like MAMP, or on their shitty web-host where is is hard to install things and as such get used to writing PHP without utilizing many extensions. It is probably exactly this which makes people think PHP just can't do something, when it easily can.

tagged: nodejs reactphp webpage scraping benchmark compare

Link: http://philsturgeon.co.uk/blog/2013/11/benchmarking-codswallop-nodejs-v-php

Gary Sieling:
Scraping Google Maps Search Results with Javascript and PHP

byChris Cornutt Jul 29, 2013 @ 17:23:21

Gary Sieling has a new post to his site about scraping Google Maps data with a combination of PHP and some simple Javascript. It makes use of callbacks and timers to get the data already returned from their API.

Google Maps provides several useful APIs for accessing data: a geocoding API to convert addresses to latitude and longitude, a search API to provide locations matching a term, and a details API for retrieving location metadata. For many mapping tasks it is valuable to get a large list of locations (restaurants, churches, etc) – since this is valuable, Google places a rate limiter on the information, and encourages caching query results.

He includes the code (both front- and back-end) that you'll need to make the system work. It makes a request to the Google Maps API as usual but then adds a listener with a callback. This takes the latitude/longitude data and runs a "get details" method to get more information. The result is then POSTed to PHP and written out to a file.

tagged: googlemaps google search results scraping api javascript tutorial

Link: http://garysieling.com/blog/scraping-google-maps-search-results-with-javascript-and-php

Robert Basic's Blog:
Book review - Guide to Web Scraping with PHP

byChris Cornutt Jun 01, 2011 @ 14:28:42

In this new post to his blog Robert Basic has a review of a book from php|architect (by Matthew Turland), "Guide to Web Scraping with PHP".

It took me a while to grab myself a copy of Matthew Turland’s "Guide to Web Scraping with PHP", but a few weeks ago a copy finally arrived and I had the pleasure of reading it. [...] My overall impression of the book is that it was worth the time and I’m really glad that I bought it. Matthew did a great job explaining all the tools we have at our disposal for writing web scrapers and how to use them.

He talks about the content of a few specific chapters (the HTTP protocol, client libraries you can use and how to prepare documents for parsing) and notes that there's not much bad he can think of about the book:

It is a guide, clear and straight-to-the-point, explaining what tools are there, which one to use and how for writing scrapers and that’s exactly what I wanted to know.

tagged: web scraping book review matthewturland

Link:

Matthew Turland's Blog:
"Web Scraping with PHP" Now Available in Print!

byChris Cornutt Sep 20, 2010 @ 17:03:49

If you've been waiting for the print edition of Matthew Turland's "Web Scraping with PHP" book (from php|architect Press) your wait is over. According to a new post on his blog the print version is now available for order.

I know a number of my readers have been waiting for this announcement: my book, Web Scraping with PHP, is now available for sale in hard copy form! That's right, you can now finally order your very own print edition copy. [...] To those who felt forced into buying the PDF edition to get access to the content because a print edition was not available until now, you have my most sincere and profound apologies.

His web scraping book covers topics like understanding HTTP requests on a base level, working with several HTTP clients like cURL, pecl_http, Zend_Http_Client and how to analyze the remote page's information with things like SimpleXML, the DOM functions and the XMLReader extension. If the print version's not your thing, you can still get the PDF from the php|architect store too.

tagged: webscraping scraping book phparchitect matthewturland pdf

Link:

Community News:
php|architect Releases "Guide to Web Scraping"

byChris Cornutt Apr 22, 2010 @ 13:25:36

php|architect has officially released one of their latest guides - this time it's Matthew Turland's "Guide to Web Scraping".

Matthew talks a bit about it in his latest blog entry:

What Iâ€™m announcing in this blog post has been in the works since early 2008 when I first pitched the idea. It was rejected by several major publishers who basically said the same thing: the idea was in too small of a niche or simply wasnâ€™t marketable. php|architect Press respectfully disagreed with them and decided to publish what is now a book written by me that you can purchase.

The book covers all things related to pulling content from remote pages including an understanding of HTTP codes, a look at tools you can use (including cURL, pecl_http and Zend_Http_Client) and how to use technologies like DOM, SimpleXML and regular expressions to match content.

tagged: guide web scraping book release matthewturland

Link:

Sameer Borate's Blog:
Web scraping tutorial

byChris Cornutt Mar 09, 2009 @ 12:52:43

In a new tutorial on his blog today, Sameer shows a library that you can use (simplehtmldom) to parse remote sites and pull out just the information you need (aka "web scraping").

There are three ways to access a website data. One is through a browser, the other is using a API (if the site provides one) and the last by parsing the web pages through code. The last one also known as Web Scraping is a technique of extracting information from websites using specially coded programs. In this post we will take a quick look at writing a simple scraper using the simplehtmldom library.

His three (really more) step process guides yo through installing the library, installing Firebug and some example code to create your first scraper - an example that pulls some of the "Featured Links" from the Google search results sidebar. The second example illustrates grabbing the list of the table of contents from the most recent issue of Wired.

tagged: web scraping tutorial simplehtmldom google search results wired tableofcontents

Link:

Juozas Kaziukenas' Blog:
Web scraping with PHP and XPath

byChris Cornutt Feb 18, 2009 @ 16:28:08

In this new post to his blog Juozas Kaziukenas takes a look at one method for getting the information out of a remote page - parsing it with PHP and XPath (assuming the page is correctly formatted).

When I was writing about how I use web scraping, I was still hadn’t tried using Xpath (shame on me). [...] It turned out, that using Xpath is extremely easy, really. When you master it, you can do everything in seconds. Yes, you need to know how XML works and how to write correct Xpath queries (brief explanation of Xpath syntax is available at W3Schools), but hey - these topics are in 1st year of university.

He includes both some sample code (to fetch a titles and prices for cameras from bhphotovideo.com) and a link to a XPath checker you can use to ensure that your query is correctly formatted. It's good that he also includes a quick reminder about the ethical issue with web scraping - it could be considered stealing depending on where the information comes from and who is providing it.

tagged: web scraping xpath tutorial price title ethical steal information

Link: