All Projects → vantoozz → Proxy Scraper

vantoozz / Proxy Scraper

Licence: mit
Library for scraping free proxies lists

Projects that are alternatives of or similar to Proxy Scraper

Proxy Scraper
Proxy-Scraper is simple Perl script for scraping proxies from multiple websites.
Stars: ✭ 24 (-69.23%)
Mutual labels:  scraper, proxy-list
proxy-scraper
⭐️ A proxy scraper made using Protractor | Proxy list Updates every three hour 🔥
Stars: ✭ 201 (+157.69%)
Mutual labels:  scraper, proxy-list
Karate
Webscraper
Stars: ✭ 45 (-42.31%)
Mutual labels:  scraper
Jd Autobuy
Python爬虫,京东自动登录,在线抢购商品
Stars: ✭ 1,174 (+1405.13%)
Mutual labels:  scraper
Proxy List
A list of free, public, forward proxy servers. UPDATED DAILY!
Stars: ✭ 1,125 (+1342.31%)
Mutual labels:  proxy-list
Scrapstagram
An Instagram Scrapper
Stars: ✭ 50 (-35.9%)
Mutual labels:  scraper
Pastebin Scraper
Live-scraping pastebin to fight boredom.
Stars: ✭ 66 (-15.38%)
Mutual labels:  scraper
Django Dynamic Scraper
Creating Scrapy scrapers via the Django admin interface
Stars: ✭ 1,024 (+1212.82%)
Mutual labels:  scraper
Pittapi
An API to easily get data from the University of Pittsburgh
Stars: ✭ 74 (-5.13%)
Mutual labels:  scraper
Bad Robo
🐙 Get Daily 400-500 Real Followers 👽 [BadRobo] is Best Instagram Bot Available Now with All Features!. Our BOT did not violate any of Instagram's rules, so you don't have to worry about getting ACTION BLOCK!
Stars: ✭ 59 (-24.36%)
Mutual labels:  scraper
Skraper
Kotlin/Java library and cli tool for scraping posts and media from various sources with neither authorization nor full page rendering (Facebook, Instagram, Twitter, Youtube, Tiktok, Telegram, Twitch, Reddit, 9GAG, Pinterest, Flickr, Tumblr, IFunny, VK, Pikabu)
Stars: ✭ 72 (-7.69%)
Mutual labels:  scraper
Warta Scrap
Indonesia Index News Crawler, including 10 online media
Stars: ✭ 57 (-26.92%)
Mutual labels:  scraper
Pitchfork Npm
An Unofficial Pitchfork Music API client for Node.js
Stars: ✭ 50 (-35.9%)
Mutual labels:  scraper
Pitchfork
🎶 Unofficial python API for pitchfork.com reviews.
Stars: ✭ 67 (-14.1%)
Mutual labels:  scraper
Social Scraper
Tổng hợp script crawl dữ liệu từ các mạng xã hội & website tiếng Việt
Stars: ✭ 47 (-39.74%)
Mutual labels:  scraper
Goscraper
Golang pkg to quickly return a preview of a webpage (title/description/images)
Stars: ✭ 72 (-7.69%)
Mutual labels:  scraper
Repository.kodibae
Kodi Bae Repository - Kodi is a registered trademark of the XBMC Foundation. We are not connected to or in any other way affiliated with Kodi - DMCA: [email protected]
Stars: ✭ 45 (-42.31%)
Mutual labels:  scraper
Tangerine
Tangerine Bank scraper
Stars: ✭ 54 (-30.77%)
Mutual labels:  scraper
Scrape
Distributed Scraper
Stars: ✭ 65 (-16.67%)
Mutual labels:  scraper
Instascrape
🚀 A fast and lightweight utility and Python library for downloading posts, stories, and highlights from Instagram.
Stars: ✭ 76 (-2.56%)
Mutual labels:  scraper

Proxy Scraper

Library for scraping free proxies lists written in PHP

Build Status Coverage Status Codacy Badge Packagist

Quick start

composer require vantoozz/proxy-scraper:~3 guzzlehttp/guzzle:~7 guzzlehttp/psr7 hanneskod/classtools
<?php declare(strict_types=1);

use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

foreach (proxyScraper()->get() as $proxy) {
    echo $proxy . "\n";
}

Older versions

This is version 3 of the library. For version 2 please check v2 branch; for version 1 please check v1 branch.

Upgrade

How to upgrade

Setup

The library requires a PSR-18 compatible HTTP client. To use the library you have to install any of them, e.g.:

composer require guzzlehttp/guzzle:~7 guzzlehttp/psr7

All available clients are listed on Packagist: https://packagist.org/providers/psr/http-client-implementation.

Then install proxy-scraper library itself:

composer require vantoozz/proxy-scraper:~3

Usage

Auto-configuration

The simplest way to start using the library is to use proxyScraper() function which instantiates and configures all the scrapers.

Please note, auto-configuration function in addition to guzzlehttp/guzzle:~7 and guzzlehttp/psr7 requires hanneskod/classtools dependency.

composer require guzzlehttp/guzzle:~7 guzzlehttp/psr7 hanneskod/classtools
<?php declare(strict_types=1);

use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

foreach (proxyScraper()->get() as $proxy) {
    echo $proxy . "\n";
}
HTTP Client

In not using auto-configuration you will need an HTTP client.

The library provides guzzleHttpClient() function creating and configuring the client.

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Exceptions\ScraperException;

use function Vantoozz\ProxyScraper\guzzleHttpClient;
use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

$httpClient = guzzleHttpClient();

$scraper = proxyScraper($httpClient);

try {
    echo $scraper->get()->current()->getIpv4(). "\n";
} catch (ScraperException $e) {
    echo $e->getMessage() . "\n";
}

You can create own HTTP client by implementing HttpClientInterface:

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Exceptions\ScraperException;
use Vantoozz\ProxyScraper\HttpClient\HttpClientInterface;

use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

$httpClient = new class implements HttpClientInterface {
    /**
     * @param string $uri
     * @return string
     */
    public function get(string $uri): string
    {
        return "some string";
    }
};

$scraper = proxyScraper($httpClient);

try {
    echo $scraper->get()->current()->getIpv4(). "\n";
} catch (ScraperException $e) {
    echo $e->getMessage() . "\n";
}

Of course, you may manually configure the scraper and underlying HTTP client:

Single scraper

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Scrapers;

use function Vantoozz\ProxyScraper\guzzleHttpClient;

require_once __DIR__ . '/vendor/autoload.php';

$scraper = new Scrapers\UsProxyScraper(guzzleHttpClient());

foreach ($scraper->get() as $proxy) {
    echo $proxy . "\n";
}

Composite scraper

You can easily get data from many scrapers at once:

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Scrapers;

use function Vantoozz\ProxyScraper\guzzleHttpClient;

require_once __DIR__ . '/vendor/autoload.php';

$httpClient = guzzleHttpClient();

$compositeScraper = new Scrapers\CompositeScraper;

$compositeScraper->addScraper(new Scrapers\FreeProxyListScraper($httpClient));
$compositeScraper->addScraper(new Scrapers\CoolProxyScraper($httpClient));
$compositeScraper->addScraper(new Scrapers\SocksProxyScraper($httpClient));

foreach ($compositeScraper->get() as $proxy) {
    echo $proxy . "\n";
}

Error handling

Sometimes things go wrong. This example shows how to handle errors while getting data from many scrapers:

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Exceptions\ScraperException;
use Vantoozz\ProxyScraper\Ipv4;
use Vantoozz\ProxyScraper\Port;
use Vantoozz\ProxyScraper\Proxy;
use Vantoozz\ProxyScraper\Scrapers;

require_once __DIR__ . '/vendor/autoload.php';

$compositeScraper = new Scrapers\CompositeScraper;

// Set exception handler
$compositeScraper->handleScraperExceptionWith(function (ScraperException $e) {
    echo 'An error occurred: ' . $e->getMessage() . "\n";
});

// Fake scraper throwing an exception
$compositeScraper->addScraper(new class implements Scrapers\ScraperInterface {
    public function get(): Generator
    {
        throw new ScraperException('some error');
    }
});

// Fake scraper with no exceptions
$compositeScraper->addScraper(new class implements Scrapers\ScraperInterface {
    public function get(): Generator
    {
        yield new Proxy(new Ipv4('192.168.0.1'), new Port(8888));
    }
});

//Run composite scraper
foreach ($compositeScraper->get() as $proxy) {
    echo $proxy . "\n";
}

Will output

An error occurred: some error
192.168.0.1:8888

In the same manner you may configure exceptions handling for the scraper created with proxyScraper() function as it returns an instance of CompositeScraper:

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Exceptions\ScraperException;
use function Vantoozz\ProxyScraper\proxyScraper;

require_once __DIR__ . '/vendor/autoload.php';

$scraper = proxyScraper();

$scraper->handleScraperExceptionWith(function (ScraperException $e) {
    echo 'An error occurs: ' . $e->getMessage() . "\n";
});

Validating proxies

Validation steps may be added:

<?php declare(strict_types = 1);

use Vantoozz\ProxyScraper\Exceptions\ValidationException;
use Vantoozz\ProxyScraper\Ipv4;
use Vantoozz\ProxyScraper\Port;
use Vantoozz\ProxyScraper\Proxy;
use Vantoozz\ProxyScraper\Scrapers;
use Vantoozz\ProxyScraper\Validators;

require_once __DIR__ . '/vendor/autoload.php';

$scraper = new class implements Scrapers\ScraperInterface
{
    public function get(): \Generator
    {
        yield new Proxy(new Ipv4('104.202.117.106'), new Port(1234));
        yield new Proxy(new Ipv4('192.168.0.1'), new Port(8888));
    }
};

$validator = new Validators\ValidatorPipeline;
$validator->addStep(new Validators\Ipv4RangeValidator);

foreach ($scraper->get() as $proxy) {
    try {
        $validator->validate($proxy);
        echo '[OK] ' . $proxy . "\n";
    } catch (ValidationException $e) {
        echo '[Error] ' . $e->getMessage() . ': ' . $proxy . "\n";
    }
}

Will output

[OK] 104.202.117.106:1234
[Error] IPv4 is in private range: 192.168.0.1:8888

Metrics

A Proxy object may have metrics (metadata) associated with.

By default, Proxy object has source metric:

<?php declare(strict_types=1);

use Vantoozz\ProxyScraper\Proxy;
use Vantoozz\ProxyScraper\Scrapers;

use function Vantoozz\ProxyScraper\guzzleHttpClient;

require_once __DIR__ . '/vendor/autoload.php';

$scraper = new Scrapers\UsProxyScraper(guzzleHttpClient());

/** @var Proxy $proxy */
$proxy = $scraper->get()->current();

foreach ($proxy->getMetrics() as $metric) {
    echo $metric->getName() . ': ' . $metric->getValue() . "\n";
}

Will output

source: Vantoozz\ProxyScraper\Scrapers\UsProxyScraper

Note. Examples use Guzzle as HTTP client.

Testing

Unit tests
./vendor/bin/phpunit --testsuite=unit
Integration tests
./vendor/bin/phpunit --testsuite=integration
System tests
php ./tests/systemTests.php

Upgrade from version 2

The biggest difference from version 2 is the HTTP client configuration.

Instead of

$httpClient = new \Vantoozz\ProxyScraper\HttpClient\Psr18HttpClient(
    new \Http\Adapter\Guzzle6\Client(new \GuzzleHttp\Client([
        'connect_timeout' => 2,
        'timeout' => 3,
    ])),
    new \Http\Message\MessageFactory\GuzzleMessageFactory
);

the client should be instantiated like

$httpClient = \Vantoozz\ProxyScraper\guzzleHttpClient();
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].