Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

zrashwani / Arachnid

Licence: mit

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites

Labels

crawler seo scraping

Projects that are alternatives of or similar to Arachnid

Sitemap Generator Cli

Creates an XML-Sitemap by crawling a given site.

Stars: ✭ 214 (-4.46%)

Mutual labels: crawler, seo

Rendora

dynamic server-side rendering using headless Chrome to effortlessly solve the SEO problem for modern javascript websites

Stars: ✭ 1,853 (+727.23%)

Mutual labels: crawler, seo

Webmagic

A scalable web crawler framework for Java.

Stars: ✭ 10,186 (+4447.32%)

Mutual labels: crawler, scraping

Goose Parser

Universal scrapping tool, which allows you to extract data using multiple environments

Stars: ✭ 211 (-5.8%)

Mutual labels: crawler, scraping

Linkedin Profile Scraper

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.

Stars: ✭ 171 (-23.66%)

Mutual labels: crawler, scraping

Dotnetcrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

Stars: ✭ 100 (-55.36%)

Mutual labels: crawler, scraping

Prerender Java

java framework for prerender

Stars: ✭ 115 (-48.66%)

Mutual labels: crawler, seo

Newcrawler

Free Web Scraping Tool with Java

Stars: ✭ 589 (+162.95%)

Mutual labels: crawler, scraping

Sitemap Generator Crawler

Script that generates a sitemap by crawling a given URL

Stars: ✭ 169 (-24.55%)

Mutual labels: crawler, seo

Serpscrap

SEO python scraper to extract data from major searchengine result pages. Extract data like url, title, snippet, richsnippet and the type from searchresults for given keywords. Detect Ads or make automated screenshots. You can also fetch text content of urls provided in searchresults or by your own. It's usefull for SEO and business related research tasks.

Stars: ✭ 153 (-31.7%)

Mutual labels: scraping, seo

Geziyor

Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.

Stars: ✭ 1,246 (+456.25%)

Mutual labels: crawler, scraping

Googlescraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.

Stars: ✭ 2,363 (+954.91%)

Mutual labels: crawler, scraping

Awesome Python Primer

自学入门 Python 优质中文资源索引，包含书籍 / 文档 / 视频，适用于爬虫 / Web / 数据分析 / 机器学习方向

Stars: ✭ 57 (-74.55%)

Mutual labels: crawler, scraping

D4n155

OWASP D4N155 - Intelligent and dynamic wordlist using OSINT

Stars: ✭ 105 (-53.12%)

Mutual labels: crawler, scraping

Lulu

[Unmaintained] A simple and clean video/music/image downloader 👾

Stars: ✭ 789 (+252.23%)

Mutual labels: crawler, scraping

Scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

Stars: ✭ 42,343 (+18803.13%)

Mutual labels: crawler, scraping

Headless Chrome Crawler

Distributed crawler powered by Headless Chrome

Stars: ✭ 5,129 (+2189.73%)

Mutual labels: crawler, scraping

Easy Scraping Tutorial

Simple but useful Python web scraping tutorial code.

Stars: ✭ 583 (+160.27%)

Mutual labels: crawler, scraping

Ngmeta

Dynamic meta tags in your AngularJS single page application

Stars: ✭ 152 (-32.14%)

Mutual labels: crawler, seo

Antch

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

Stars: ✭ 198 (-11.61%)

Mutual labels: crawler, scraping

View All Similar Projects ➔

Arachnid Web Crawler

This library will crawl all unique internal links found on a given website up to a specified maximum page depth.

This library is using symfony/panther & FriendsOfPHP/Goutte libraries to scrap site pages and extract main SEO-related info, including: title, h1 elements, h2 elements, statusCode, contentType, meta description, meta keyword and canonicalLink.

This library is based on the original blog post by Zeid Rashwani here:

http://zrashwani.com/simple-web-spider-php-goutte

Josh Lockhart adapted the original blog post's code (with permission) for Composer and Packagist and updated the syntax to conform with the PSR-2 coding standard.

How to Install

You can install this library with Composer. Drop this into your composer.json manifest file:

{
    "require": {
        "zrashwani/arachnid": "dev-master"
    }
}

Then run composer install.

Getting Started

Basic Usage:

Here's a quick demo to crawl a website:

    <?php
    require 'vendor/autoload.php';

    $url = 'http://www.example.com';
    $linkDepth = 3;
    // Initiate crawl, by default it will use http client (GoutteClient), 
    $crawler = new \Arachnid\Crawler($url, $linkDepth);
    $crawler->traverse();

    // Get link data
    $links = $crawler->getLinksArray(); //to get links as objects use getLinks() method
    print_r($links);

Enabling Headless Browser mode:

Headless browser mode can be enabled, so it will use Chrome engine in background which is useful to get contents of javacript-based sites.

enableHeadlessBrowserMode method set the scraping adapter used to be PantherChromeAdapter which is based on Symfony Panther library:

    $crawler = new \Arachnid\Crawler($url, $linkDepth);
    $crawler->enableHeadlessBrowserMode()
            ->traverse()
            ->getLinksArray();

In order to use this, you need to have chrome-driver installed on your machine.

Advanced Usage:

Set additional options to underlying http client, by specifying array of options in constructor or creating Http client scrapper with desired options:

    <?php
        use \Arachnid\Adapters\CrawlingFactory;
        //third parameter is the options used to configure http client
        $clientOptions = ['auth_basic' => array('username', 'password')];
        $crawler = new \Arachnid\Crawler('http://github.com', 2, $clientOptions);
           
        //or by creating and setting scrap client
        $options = array(
            'verify_host' => false,
            'verify_peer' => false,
            'timeout' => 30,
        );
                        
        $scrapperClient = CrawlingFactory::create(CrawlingFactory::TYPE_HTTP_CLIENT, $options);
        $crawler->setScrapClient($scrapperClient);

You can inject a PSR-3 compliant logger object to monitor crawler activity (like Monolog):

    <?php    
    $crawler = new \Arachnid\Crawler($url, $linkDepth); // ... initialize crawler   

    //set logger for crawler activity (compatible with PSR-3)
    $logger = new \Monolog\Logger('crawler logger');
    $logger->pushHandler(new \Monolog\Handler\StreamHandler(sys_get_temp_dir().'/crawler.log'));
    $crawler->setLogger($logger);
    ?>

You can set crawler to visit only pages with specific criteria by specifying callback closure using filterLinks method:

    <?php
    //filter links according to specific callback as closure
    $links = $crawler->filterLinks(function($link) {
                        //crawling only links with /blog/ prefix
                        return (bool)preg_match('/.*\/blog.*$/u', $link); 
                    })
                    ->traverse()
                    ->getLinks();

You can use LinksCollection class to get simple statistics about the links, as following:

    <?php
    $links = $crawler->traverse()
                     ->getLinks();
    $collection = new LinksCollection($links);

    //getting broken links
    $brokenLinks = $collection->getBrokenLinks();
   
    //getting links for specific depth
    $depth2Links = $collection->getByDepth(2);

    //getting external links inside site
    $externalLinks = $collection->getExternalLinks();

How to Contribute

Fork this repository
Create a new branch for each feature or improvement
Apply your code changes along with corresponding unit test
Send a pull request from each feature branch

It is very important to separate new features or improvements into separate feature branches, and to send a pull request for each branch. This allows me to review and pull in new features or improvements individually.

All pull requests must adhere to the PSR-2 standard.

System Requirements

PHP 7.1.0+

Authors

Josh Lockhart https://github.com/codeguy
Zeid Rashwani http://zrashwani.com

License

MIT Public License

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 224

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (5) 🔗