PHPDeveloper: PHP News, Views and Community

Subscribe

@phpdeveloper.org

News Archive

Community News: Latest PECL Releases (04.22.2025)

Community News: Latest PECL Releases (04.15.2025)

Community News: Latest PEAR Releases (04.14.2025)

Community News: Latest PECL Releases (04.08.2025)

Community News: Latest PEAR Releases (04.07.2025)

Community News: Latest PECL Releases (04.01.2025)

Community News: Latest PEAR Releases (03.31.2025)

Community News: Latest PECL Releases (03.25.2025)

Community News: Latest PECL Releases (03.18.2025)

Community News: Latest PEAR Releases (03.17.2025)

Looking for more information on how to do PHP the right way? Check out PHP: The Right Way

SitePoint PHP Blog:
Diffbot: Crawling with Visual Machine Learning

byChris Cornutt Aug 01, 2014 @ 16:37:12

On the SitePoint PHP blog Bruno Skvorc has posted a tutorial showing you how to use the Diffbot service to extract data from any page. He introduces both the service itself and walks you through a simple request via Guzzle.

Have you ever wondered how social networks do URL previews so well when you share links? How do they know which images to grab, whom to cite as an author, or which tags to attach to the preview? Is it all crawling with complex regexes over source code? Actually, more often than not, it isn’t. [...] If you want to build a URL preview snippet or a news aggregator, there are many automatic crawlers available online, both proprietary and open source, but you seldom find something as niche as visual machine learning. This is exactly what Diffbot is – a “visual learning robot” which renders a URL you request in full and then visually extracts data, helping itself with some metadata from the page source as needed.

He uses a combination of a Laravel installation (via a Homestead instance) and a Guzzle request using a fetched token. The service offers a 10k call limit on a 7 day free trial, so you can sign up and grab your token there. He includes code for an example request fetching a SitePoint page and parsing out the tags. He also briefly looks at the custom handling diffbot allows based on CSS-type rules.