News Feed
Sections




News Archive
Looking for more information on how to do PHP the right way? Check out PHP: The Right Way

Jonathan Street's Blog:
When scraping content from the web don't make it obvious
November 07, 2007 @ 11:26:00

Jonathan Street has a tip for those developers out there that have no other choice than scraping content from a remote site - don't make it obvious. He also includes a suggestion on how to make it a little less obvious.

A couple of hours ago I was playing around scraping some content from a website. All was going well until suddenly I couldn't get my script to fetch meaningful content. [...] The first thing I did was stop visiting the site for 15 minutes or so and then increase the time between requests. It briefly worked again but quickly stopped.

One simple change to his user agent string in his php.ini made the problem evaporate pointing to a user agent filtering happening on the remote side. His helpful hint involves two methods - one in just PHP and the other in cURL - to change the user agent that your scripts are sending. An even better sort of solution might be some sort of rotating array that would alternate between four or five strings to make things even more random.

0 comments voice your opinion now!
scrape content remote server useragent filter modify phpini scrape content remote server useragent filter modify phpini


blog comments powered by Disqus

Similar Posts

Gonzalo Ayuso: Playing with event dispatcher and Silex. Sending logs to a remote server.

International PHP Magazine: IPM Poll Question: Which One is Most Dangerous?

Chuck Burgess' Blog: Remote CLI Debugging via Eclipse PDT

BinaryTides.com: Setup Nginx + php-FPM + apc + MariaDB on Debian 7 - The perfect LEMP server

php[architect]: December 2014 Issue Released - Taming Content


Community Events





Don't see your event here?
Let us know!


conference version list opinion artisanfiles release symfony laravel library composer voicesoftheelephpant community series tool interview introduction security podcast language framework

All content copyright, 2014 PHPDeveloper.org :: info@phpdeveloper.org - Powered by the Solar PHP Framework