Sunday, 30 August 2015

Some php crawler

1. The DomCrawler Component

The Crawler class provides methods to query and manipulate HTML and XML documents.

An instance of the Crawler represents a set (SplObjectStorage) of DOMElement objects, which are basically nodes that you can traverse easily.
More detail

2. PHPCrawl webcrawler library/framework

PHPCrawl is a framework for crawling/spidering websites written in the programming language PHP, so just call it a webcrawler-library or crawler-engine for PHP

PHPCrawl "spiders" websites and passes information about all found documents (pages, links, files ans so on) for futher processing to users of the library.

It is high configurable and provides several options to specify the behaviour of the crawler like URL- and Content-Type-filters, cookie-handling, robots.txt-handling, limiting options, multiprocessing and much more.

3. PHP Simple HTML DOM Parser

A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
Require PHP 5+.
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line.

Wednesday, 10 December 2014

How to Mount S3 Bucket on CentOS/RHEL and Ubuntu using S3FS

S3FS is FUSE (File System in User Space) based solution to mount an Amazon S3 buckets, We can use system commands with this drive just like as another Hard Disk in system. On s3fs mounted files systems we can simply use cp, mv and ls the basic Unix commands similar to run on locally attached disks.
If you like to access S3 buckets without mounting on system, use s3cmd command line utility to manage s3 buckets. s3cmd is also provides faster speed for data upload and download rather than s3fs. To work with s3cmd use next articles to install s3cmd in Linux systems and Windows systems.
This article will help you to install S3FS and Fuse by compiling from source, and also help you to mount S3 bucket on your CentOS/RHEL and Ubuntu systems.

Step 1: Remove Existing Packages

First check if you have any existing s3fs or fuse package installed on your system. If installed it already remove it to avoid any file conflicts.
CentOS/RHEL Users:
 # yum remove fuse fuse-s3fs

Ubuntu Users:
 $ sudo apt-get remove fuse

Step 2: Install Required Packages

After removing above packages. First we will install all dependencies for fuse and s3cmd. Install the required packages to system use following command.
CentOS/RHEL Users:
 # yum install gcc libstdc++-devel gcc-c++ curl-devel libxml2-devel openssl-devel mailcap

Ubuntu Users:
 $ sudo apt-get install build-essential libcurl4-openssl-dev libxml2-dev mime-support

Step 3: Download and Compile Latest Fuse

Download and compile latest version of fuse source code. For this article we are using fuse version 2.9.3. Following set of command will compile fuse and add fuse module in kernel.
# cd /usr/src/
# wget
# tar xzf fuse-2.9.3.tar.gz
# cd fuse-2.9.3
# ./configure --prefix=/usr/local
# make && make install
# export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
# ldconfig
# modprobe fuse

Step 4: Download and Compile Latest S3FS

Download and compile latest version of s3fs source code. For this article we are using s3fs version 1.74. After downloading extract the archive and compile source code in system.
# cd /usr/src/
# wget
# tar xzf s3fs-1.74.tar.gz
# cd s3fs-1.74
# ./configure --prefix=/usr/local
# make && make install

Step 5: Setup Access Key

Also In order to configure s3fs we would required Access Key and Secret Key of your S3 Amazonaccount. Get these security keys from Here.
# chmod 600 ~/.passwd-s3fs
Note: Change AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY with your actual key values.

Step 6: Mount S3 Bucket

Finally mount your s3 bucket using following set of commands. For this example, we are using s3 bucket name as mydbbackup and mount point as /s3mnt.
# mkdir /tmp/cache
# mkdir /s3mnt
# chmod 777 /tmp/cache /s3mnt

# s3fs -o use_cache=/tmp/cache mydbbackup /s3mnt
# sudo s3fs insing-profile-test -o use_cache=/tmp/cache -o allow_other /folder_local

copied from techadmin