Looking for more information on how to do PHP the right way? Check out PHP: The Right Way

Christian Schaefer's Blog:
Using PHP Web Scraper Goutte in a Console Task in a Silex project
Oct 10, 2011 @ 13:26:24

In a recent post to his blog Christian Schaefer shows how to use the Goutte tool (a web scraper) to pull information from one site and use it in another Silex-powered one. His tutorial uses a custom service provider for the integration.

Since I discovered the free Facebook App hosting by heroku I keep wanting to make something useful out of it. So I thought about a small service app. Without going into details yet about its nature there was one immediate problem to be solved. How to get hold of the data? So I thought to scrape it off some website. I know this isn't very nice but unfortunately there is no feed I can use.. And how to best scrape a website? Use Goutte!

All you'll need is two things - the goutte.phar and Silex phar files. The code for the service provider is a simple registration of namespaces. With that integrated, it's as simple as making a client object and calling it with a URL.

tagged: silex goutte webscraping tutorial serviceprovider phar

Link:

Zend Developer Zone:
php|architect's guide to Web Scraping with PHP - Don't let the title fool you.
Sep 21, 2010 @ 16:20:51

On the Zend Developer Zone there's a recent post about a book from Matthew Turland (recently available in print) - the php|architect's Guide to Web Scraping with PHP - and why you shouldn't judge a book by its cover.

I was really hesitant to commit to reviewing the book because I tend not to review books I don't like and this subject matter just wasn't doing it for me. So with great fear and trepidation, I popped open my review copy. (PDF so I could read it on my iPad) I was ever so surprised and in a very good way.

He talks about the different parts of the book - the foreword from Ben Ramsey ("expert in all things HTTP") and the two halves of the book. The first half deals with accessing the information on remote sites and the second talks about the actual scraping of the information (parsing out the content with things like regular expressions and SimpleXML).

tagged: webscraping matthewturland book review

Link:

Matthew Turland's Blog:
"Web Scraping with PHP" Now Available in Print!
Sep 20, 2010 @ 17:03:49

If you've been waiting for the print edition of Matthew Turland's "Web Scraping with PHP" book (from php|architect Press) your wait is over. According to a new post on his blog the print version is now available for order.

I know a number of my readers have been waiting for this announcement: my book, Web Scraping with PHP, is now available for sale in hard copy form! That's right, you can now finally order your very own print edition copy. [...] To those who felt forced into buying the PDF edition to get access to the content because a print edition was not available until now, you have my most sincere and profound apologies.

His web scraping book covers topics like understanding HTTP requests on a base level, working with several HTTP clients like cURL, pecl_http, Zend_Http_Client and how to analyze the remote page's information with things like SimpleXML, the DOM functions and the XMLReader extension. If the print version's not your thing, you can still get the PDF from the php|architect store too.

tagged: webscraping scraping book phparchitect matthewturland pdf

Link:

Matthew Turland's Blog:
An Update on "Web Scraping with PHP"
Jul 19, 2010 @ 16:08:54

If you've been looking forward to his book on web scraping from the php|architect publishing group and are wondering about an update on a print copy, Matthew Turland has an update for you.

Several people have asked me the same question recently, so I decided to take a blog post to provide an answer. The question is, "When will 'Web Scraping with PHP' be available in print?" Answering requires a bit of background to paint a full picture of where things are now.

He mentions delays from the publisher, miscommunication regarding a box of the print edition making it to php|tek this year and his struggle to get in contact to find out more about them. He also encourages you, the patient reader, to send them a message about the book to voice your opinion.

tagged: phparchitect tek10 webscraping book update

Link:

Developer Tutorials Blog:
Parallel web scraping in PHP: cURL multi functions
Jul 29, 2008 @ 12:57:00

The Developer Tutorials blog has posted a tutorial about scraping other website information in parallel (with their permission, of course) with the help of the cURL extension.

For anyone who's ever tried to fetch multiple resources over HTTP in PHP, the logic is trivial, but one key challenge is ever-present: latency delays. While web servers have perfectly good downstream links, latencies can increase script execution time tenfold just by downloading a few external URLs. But there's a simple solution: parallel cURL operations. In this tutorial, I'll show you how to use the "multi" functions in PHP's cURL library to get around this quickly and easily.

He starts with a basic cURL example, grabbing the content from example.com and putting it into a variable. He modifies this to make it a bit more complex and to run multiple fetches in parallel - creating more than one cURL object and using the culr_multi_* methods to manage them.

tagged: webscraping curl function multi parallel tutorial

Link:

PHP Thinktank Blog:
New Discussions (IRC Talks Series)
Jan 22, 2007 @ 13:49:00

The PHP Thinktank Blog has posted two new IRC logs of talks give over in their IRC channel on the Freenode network.

Now that all the yearly holiday chaos is out of the way, we bring you new logs of two recent IRC discussions. As usual, they are available on the google group.

The two talks were:

tagged: discussion injection webscraping log file google group discussion injection webscraping log file google group

Link:

PHP Thinktank Blog:
New Discussions (IRC Talks Series)
Jan 22, 2007 @ 13:49:00

The PHP Thinktank Blog has posted two new IRC logs of talks give over in their IRC channel on the Freenode network.

Now that all the yearly holiday chaos is out of the way, we bring you new logs of two recent IRC discussions. As usual, they are available on the google group.

The two talks were:

tagged: discussion injection webscraping log file google group discussion injection webscraping log file google group

Link:


Trending Topics: