YiiScraperModule
Overview ¶
YiiScraperModule is base scraper module to get information from Internet. It parses received HTML pages using simple_html_dom parser component and retrieves URLs from HTML page. URLs are saved to DB table and will be requested later. By default, HTML content is saved to DB table as well. You may select target containers for parsed URLs and stored content. You may write your own functions for URLs parsing and storing content.
Module is designed to run from cron as well. You can use it for periodical scraping. If DB table for links is empty, it is filled with seeds. Each URL is fetched then, marked as non-active and requested. Received HTML content is parsed then and new URLs are saved to the same table. Scraper work is terminated, when anyone of limits is exhausted. You can see respective message in scraper logs table. If scraper work is terminated, scraper may start later and request first active URL, so its work is proceeded. If there are no one active URL in the table, scraper stops its work. When scraper runs next time, it cleans DB table for URLs, inserts seeds there and process will be repeated from the very beginning.
Features ¶
- Automated install and uninstall.
- Designed to run periodically, from browser or cron.
- Defines content charset and converts is to UTF-8.
- You can define limits for scraping process (duration, received data size, received documents count, max depth to scrape)
- Able to scrape only inside specified URLs, if needed.
- Logging system (when scraping process started, how much bytes, documents and HTML documents were received, how much new URLs were saved to DB, what is the process status).
- Stores URL relations into separate table
- Uses simple_html_dom extension to parse received data. You can use CSS selectors to define, what URLs are added and what content is stored to DB table. Or you can write your own callback functions to handle this task on your own.
Requirements ¶
- Yii 1.1 or above
- CURL and MBSTRING PHP extensions
Usage ¶
To use module, please, copy it to '.../protected/modules/' folder. Then add lines to your '.../protected/config/main.php' file, to 'modules' part:
...
'modules'=>array(
'gii'=>array( ... ),
...
'yscraper' => array(
'class' => 'application.modules.YiiScraper.YiiScraperModule',
'installMode' => true,
'seeds' => 'http://pravda.com.ua',
'insideUrlsOnly' => 'pravda.com.ua',
'maxDuration' => 1000,
'contentSelector' => 'div#content',
),
...
),
...
Note, that 'installMode' option must be true. Then you need to install scraper. Please, open this link in your browser:
http://your.domainname.com/index.php?r=yscraper/default/install
or for uninstall
http://your.domainname.com/index.php?r=yscraper/default/uninstall
Then you need to remove line
'installMode' => true,
from config, adjust all other settings and run scraper with command:
Yii::app()->getModule('yscraper')->run();
If you want to use callbacks, you may use static methods from previously imported classes in your config file:
...
'yscraper' => array(
...
'contentCallback' => 'SomeModelClass::someCallbackFunction',
),
...
Or you may set callbacks during runtime in your controller file:
public function actionIndex()
{
...
Yii::app()->getModule('yscraper')->contentCallback = array($this, 'someCallbackFunction');
Yii::app()->getModule('yscraper')->run();
...
}
...
public function someCallbackFunction($currentURL, $content)
{
// process content here
}
Note, that linkCallback and contentCallback functions get two arguments: URL and content, received from that URL. And note, that linkCallback function should to return array of URLs to scrape later.
Feedbacks are greatly appreciated!
Dear friends!
Please, feel you free to post suggestions, notes and any other feedbacks.
Nice
Nice... will check it... thanks
What can I use this for?
Good work.. Can you please explain to me the practical use of this extension? How could it be of use on a website?
Thanks!
The goal
Hi, Beesho, thanks for good question.
Scraper can be used to gather information from other sites, or one site, or part of site, to process that information and use it in your own purposes.
For example, many portal sites use scrapers. They gather info about last news on other (news) sites and show news titles with respective links to portal user. You can use it for own search mini-engine. You can use it to gather info, which is dispersed accross a thousands of site pages in some blocks. Etc...
For example, I have developed this scraper, because I needed to collect all specimen items from one site. Then I have enhanced it, develop it as module and shared.
Very Nice
Very nice!
I will probably be using it sometime.
Thank you for your detailed explanation, vittron!
Install Error
Hi,
Thanks for extension.
You forgot to add prefix in Installer sql query on line 55 and 56.
It's: tbl_yiiscraper_link
Should be: {$prefix}yiiscraper_link
Indeed!
Hi! Yes, you are right, indeed! I will fix that bug. I hope, you didn't have much inconvenience with it. Thank you!
Thak you
Thanks for this module :
Bu can you explain please what mean this lines :
'seeds' => 'http://pravda.com.ua', 'insideUrlsOnly' => 'pravda.com.ua',
Thanks in advance
Seeds
Hello, samilo! 'seeds' is the url(s), where to start scraping. It is a string, or an array (if there are several seeds). If you want to scrape only inside some area (i.e. just one site, no outside links), you need to specify 'insideUrlsOnly' options. 'pravda.com.ua' is used only for example. You can change url according to your needs.
Please, let me know whether you have any questions. Thank you!
Must also do this: 'tablePrefix'=>'',// DECLARING THE PREFIX
Also must add tablePrefix with two single quotes '' with no space to your main.php if you don't have any table prefixes. If you don't you'll get an error and spend like an hour trying to figure it out. :) See code below for easy add.
'db'=>array( 'connectionString' => 'mysql:host=localhost;dbname=yourdatabasenamehere', 'emulatePrepare' => true, 'username' => 'yourusername', 'password' => 'yourpassword', 'charset' => 'utf8', **'tablePrefix'=>'',// DECLARING THE PREFIX**
great
Hi,
that is a great extension, i just have been digging into it :)
Surely will have some questions sometime
Laszlo from Hungary
If you have any questions, please ask in the forum instead.
Signup or Login in order to comment.