1. The DomCrawler Component
The Crawler class provides methods to query and manipulate HTML and XML documents.
An instance of the Crawler represents a set (SplObjectStorage) of DOMElement objects, which are basically nodes that you can traverse easily.
More detail
2. PHPCrawl webcrawler library/framework
PHPCrawl is a framework for crawling/spidering websites written in the programming language PHP, so just call it a webcrawler-library or crawler-engine for PHP
PHPCrawl "spiders" websites and passes information about all found documents (pages, links, files ans so on) for futher processing to users of the library.
It is high configurable and provides several options to specify the behaviour of the crawler like URL- and Content-Type-filters, cookie-handling, robots.txt-handling, limiting options, multiprocessing and much more.
PHPCrawl "spiders" websites and passes information about all found documents (pages, links, files ans so on) for futher processing to users of the library.
It is high configurable and provides several options to specify the behaviour of the crawler like URL- and Content-Type-filters, cookie-handling, robots.txt-handling, limiting options, multiprocessing and much more.
3. PHP Simple HTML DOM Parser
Require PHP 5+.
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line.
No comments:
Post a Comment