Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → reliqarts → Laravel Scavenger

reliqarts / Laravel Scavenger

Licence: mit

The most integrated web scraper package for Laravel.

Labels

laravel web scraper

Projects that are alternatives of or similar to Laravel Scavenger

Avbook

AV 电影管理系统， avmoo , javbus , javlibrary 爬虫，线上 AV 影片图书馆，AV 磁力链接数据库，Japanese Adult Video Library,Adult Video Magnet Links - Japanese Adult Video Database

Stars: ✭ 8,133 (+8837.36%)

Mutual labels: scraper, laravel

Novel

基于 Laravel 5.2 的小说网站

Stars: ✭ 172 (+89.01%)

Mutual labels: scraper, laravel

Laravel Ng Artisan Generators

Laravel artisan AngularJS generators

Stars: ✭ 91 (+0%)

Mutual labels: laravel

Lambda Phantom Scraper

PhantomJS/Node.js web scraper for AWS Lambda

Stars: ✭ 93 (+2.2%)

Mutual labels: scraper

Laravel Localize Middleware

Configurable localization middleware for your Laravel >=5.1 application

Stars: ✭ 92 (+1.1%)

Mutual labels: laravel

Larafast Fastapi

A Fast Laravel package to help you generate CRUD API Controllers and Resources, Model.. etc

Stars: ✭ 91 (+0%)

Mutual labels: laravel

Laravel Bandwagon

Social proof package for Laravel

Stars: ✭ 93 (+2.2%)

Mutual labels: laravel

Chip

A drop-in subscription billing UI for Laravel

Stars: ✭ 91 (+0%)

Mutual labels: laravel

Laravel Newsletter

Manage newsletters in Laravel

Stars: ✭ 1,318 (+1348.35%)

Mutual labels: laravel

Img

🖼Image hosting powered by laravel

Stars: ✭ 92 (+1.1%)

Mutual labels: laravel

Hockey Scraper

Python Package for scraping NHL Play-by-Play and Shift data

Stars: ✭ 93 (+2.2%)

Mutual labels: scraper

Flysystem Upyun

Laravel 又拍云文件存储，上传，删除。

Stars: ✭ 92 (+1.1%)

Mutual labels: laravel

Depictr

A middleware for rendering static pages when crawled by search engines

Stars: ✭ 92 (+1.1%)

Mutual labels: laravel

Officelife

OfficeLife manages everything employees do in a company. From projects to holidays to 1 on 1s to ... 🚀

Stars: ✭ 90 (-1.1%)

Mutual labels: laravel

Doesangue Core

Online platform that connects people interested in blood donation

Stars: ✭ 91 (+0%)

Mutual labels: laravel

Awesome Dl

This is a list of repositories and libraries that allow for scripted downloading of online content.

Stars: ✭ 93 (+2.2%)

Mutual labels: scraper

Laravel Plain Sqs

Custom SQS connector for Laravel (or Lumen) that supports third-party, plain JSON messages

Stars: ✭ 91 (+0%)

Mutual labels: laravel

Porter

A docker based multi-site setup for local PHP development. Inspired by Laravel Valet, Homestead and Vessel.

Stars: ✭ 92 (+1.1%)

Mutual labels: laravel

Dropzone Laravel Image Upload

Laravel 5.2 and Dropzone.js auto image uploads with removal links

Stars: ✭ 92 (+1.1%)

Mutual labels: laravel

Scrapoxy

Scrapoxy hides your scraper behind a cloud. It starts a pool of proxies to send your requests. Now, you can crawl without thinking about blacklisting!

Stars: ✭ 1,322 (+1352.75%)

Mutual labels: scraper

View All Similar Projects ➔

Laravel Scavenger

The most integrated web scraper package for Laravel.

Top Features

Scavenger provides the following features and more out-the-box.

Ease of use
- Scavenger is super-easy to configure. Simple publish the config file and set your targets.
Scrape data from multiple sources at once.
Convert scraped data into usable Laravel model objects.
- eg. You may scrape an article and have it converted into an object of your choice and saved in your database. Immediately available to your viewers.
You can easily perform one or more operations to each property of any scraped entity.
- eg. You may call a paraphrase service from a model or package of your choice on data attributes before saving them to your database.
Data integrity constraints
- Scavenger uses a hashing algorithm of your choice to maintain data integrity. This hash is used to ensure that one scrap (source article) is not converted to multiple output objects (model duplicates).
Console Command
- Once scavenger is configured, a simple artisan command launches the seeker. Since this is a console command it is more efficient and timeouts are less likely to occur.
- Artisan command: php artisan scavenger:seek
Schedule ready
- Scavenger can easily be set to scrape on a schedule. Hence, creating a someone autonomous website is super easy!
SERP
- Scavenger can be used to flexibly scrape Search Engine Result Pages.

Installation

Install via composer; in console:

composer require reliqarts/laravel-scavenger

or require in composer.json:

{
    "require": {
        "reliqarts/laravel-scavenger": "^3.1"
    }
}

then run composer update in your terminal to pull it in.

(Optional) Publish package resources and configuration:

php artisan vendor:publish --provider="ReliqArts\Scavenger\ServiceProvider"

You may opt to publish only configuration by using the scavenger-config tag:

php artisan vendor:publish --provider="ReliqArts\Scavenger\ServiceProvider" --tag="scavenger-config"

or only the migrations via the scavenger-migrations tag:

php artisan vendor:publish --provider="ReliqArts\Scavenger\ServiceProvider" --tag="scavenger-migrations"

Configuration

Scavenger is highly configurable. These configurations remain for use the next time around.

Structure

Below is an example of a typical config file structure, with explaining comments.

<?php

return [
    // debug mode?
    'debug' => false,

    // whether log file should be written
    'log' => true,

    // How much detail is expected in output, 1 being the lowest, 3 being highest.
    'verbosity' => 1,

    // Set the database config
    'database' => [
        // Scraps table
        'scraps_table' => env('SCAVENGER_SCRAPS_TABLE', 'scavenger_scraps'),
    ],

    // Daemon config - used to build daemon user
    'daemon' => [
        // Model to use for Daemon identification and login
        'model' => 'App\\User',

        // Model property to check for daemon ID
        'id_prop' => 'email',

        // Daemon ID
        'id' => '[email protected]',

        // Any additional information required to create a user:
        // NB. this is only used when creating a daemon user, there is no "safe" way
        // to change the daemon's password once he has been created.
        'info' => [
            'name' => 'Scavenger Daemon',
            'password' => 'pass',
        ],
    ],

    // guzzle settings
    'guzzle_settings' => [
        'timeout' => 60,
    ],

    // hashing algorithm to use
    'hash_algorithm' => 'sha512',

    // storage
    'storage' => [
        // This directory will live inside your application's log directory.
        'log_dir' => env('SCAVENGER_LOG_DIR', 'scavenger'),
    ],

    // different model entities and mapping information
    'targets' => [
        // NB. the "rooms" target shown below is for example purposes only. It has all posible keys explicitly.
        'rooms' => [
            'example' => true,
            'serp' => false,
            'model' => 'App\\Room',
            'source' => 'http://myroomslistingsite.1demo/section/rooms',
            'search' => [
                // keywords
                'keywords' => ['professional'],
                // form markup
                'form' => [
                    // search form selector (important)
                    'selector' => '#form',
                    // input element name for search term/keyword
                    'keyword_input_name' => 'keyword',
                    'submit_button' => [
                        // text on submit button (optional)
                        'text' => null,
                        // submit element id, use if button doesn't have text (optional)
                        'id' => null,
                    ],
                ],
            ],
            'pager' => [
                // link (a tag) selector
                'selector' => 'div.content #page a.pagingnav',
            ],
            // max. number of pages to scrape (0 is unlimited)
            'pages' => 0,
            // content markup: actual data to be scraped
            'markup' => [
                'title' => 'div.content section > table tr h3',
                // inside: content to be found upon clicking title link
                '__inside' => [
                    'title' => '#ad-title > h1 > a',
                    'body' => 'article .adcontent > p[align="LEFT"]:last-of-type',
                    // focus: focus detail on the following section
                    '__focus' => 'section section > .content #ad-detail > article',
                ],
                // wrapper/item/result: wrapping selector for each item on single page.
                // If inside special key is set this key becomes invalid (i.e. inside takes preference)
                '__result' => null,
            ],
            // split single attributes into multiple based on regex
            'dissect' => [
                'body' => [
                    'email' => '(([eE]mail)*:*\s*\w+\@(\s*\w)*\.(net|com))',
                    'phone' => '((([cC]all|[[tT]el|[Pp][Hh](one)*)[:\d\-,\sDL\/]*\d)|(\d{3}\-?\d{4}))',
                    'beds' => '([\d]+[\d\.\/\s]*[^\w]*([Bb]edroom|b\/r|[Bb]ed)s?)',
                    'baths' => '([\d]+[\d\.\/\s]*[^\w]*([Bb]athroom|bth|[Bb]ath)s?)',
                    // retain:  whether details should be left in source attribute after extraction
                    '__retain' => true,
                ],
            ],
            // modify attributes by calling functions
            'preprocess' => [
                // takes a callable
                // optional third parameter of array if callable method needs an instance
                // e.g. ['App\\Item', 'foo', true] or 'bar'
                'title' => null,
            ],
            // remap entity attributes to model properties (optional)
            'remap' => [
                'title' => null,
                'body' => null,
            ],
            // scraps containing any of these words will be rejected (optional)
            'bad_words' => [
                'office',
            ],
        ],

        // Google SERP example:
        'google' => [
            'example' => true,
            'serp' => true,
            'model' => 'App\\GoogleResult',
            'source' => 'https://www.google.com',
            'search' => [
                'keywords' => ['dog'],
                'form' => [
                    'selector' => 'form[name="f"]',
                    'keyword_input_name' => 'q',
                ],
            ],
            'pages' => 2,
            'pager' => [
                'selector' => '#foot > table > tr > td.b:last-child a',
            ],
            'markup' => [
                '__result' => 'div.g',
                'title' => 'h3 > a',
                'description' => '.st',
                // the 'link' and 'position' attributes make use of some of Scavengers available properties
                'link' => '__link',
                'position' => '__position',
            ],
        ],

        // Bing SERP example:
        'bing' => [
            'example' => true,
            'serp' => true,
            'model' => 'App\\BingResult',
            'source' => 'https://www.bing.com',
            'search' => [
                'keywords' => ['dog'],
                'form' => [
                    'selector' => 'form#sb_form',
                    'keyword_input_name' => 'q',
                ],
            ],
            'pages' => 3,
            'pager' => [
                'selector' => '.sb_pagN',
            ],
            'markup' => [
                '__result' => '.b_algo',
                'title' => 'h2 a',
                'description' => '.b_caption p',
                'link' => '__link',
                'position' => '__position',
            ],
        ],
    ],
];

Target Breakdown

The targets array is to contain a list of entities (to be scraped from) keyed by a unique target identifier. The structure is as follows.

model: Laravel DB model to create from target.
source: Source URL to scrape.
search: Search settings. Use if a search is to be performed before target data is shown. (optional)
- keywords: Array of keywords to search for.
- keyword_input: Keyword input text markup.
- form_markup: CSS selector for search form.
- submit_button_text: The text on the form's submit button.
pager: Next link CSS selector. To skip to next page.
markup: Array of attributes to scrape from main list. [attributeName => CSS selector]
- __inside: Sub markup for detail page. Markup for page which shows when article title is clicked/opened. (optional)
dissect: Split compound attributes into smaller attributes via REGEX. (optional)
preprocess: Array of attributes which need to be preprocessed. [attributeName => callable] (optional)
remap: Array of attributes which need to be renamed in order to be saved as target objects. [attributeName => newName] (optional)
bad_words: Any scraps found containing these words will be discarded. (optional)

Glossary of Terms

The following words may appear in context above.

Daemon: User instance to be used by the scavenger service.
Scrap: Scraped data before being converted to the target object.
Target: Configured source-model mapping for a single entity.
Target Object: Eloquent model object to be generated from scrap.

Acknowledgements

This library is heavily inspired by and dependent on the Guzzle library, although several concepts may have been adjusted.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 91

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗