All Projects → oltarasenko → Crawly

oltarasenko / Crawly

Licence: apache-2.0
Crawly, a high-level web crawling & scraping framework for Elixir.

Programming Languages

elixir
2628 projects
erlang
1774 projects

Projects that are alternatives of or similar to Crawly

Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+3430.68%)
Mutual labels:  crawler, spider, scraper, scraping, crawling
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (-61.14%)
Mutual labels:  crawler, spider, scraper, scraping, crawling
Ferret
Declarative web scraping
Stars: ✭ 4,837 (+999.32%)
Mutual labels:  crawler, scraper, scraping, crawling
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (-88.18%)
Mutual labels:  scraper, spider, scraping, crawling
Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+1065.68%)
Mutual labels:  crawler, scraper, scraping, crawling
Lulu
[Unmaintained] A simple and clean video/music/image downloader 👾
Stars: ✭ 789 (+79.32%)
Mutual labels:  crawler, scraper, scraping, crawling
Geziyor
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
Stars: ✭ 1,246 (+183.18%)
Mutual labels:  crawler, spider, scraper, scraping
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (-37.05%)
Mutual labels:  crawler, spider, scraping, crawling
bots-zoo
No description or website provided.
Stars: ✭ 59 (-86.59%)
Mutual labels:  crawler, scraper, scraping, crawling
Gosint
OSINT Swiss Army Knife
Stars: ✭ 401 (-8.86%)
Mutual labels:  crawler, spider, scraper
proxycrawl-python
ProxyCrawl Python library for scraping and crawling
Stars: ✭ 51 (-88.41%)
Mutual labels:  scraper, scraping, crawling
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+826.59%)
Mutual labels:  crawler, scraper, scraping
diffbot-php-client
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
Stars: ✭ 53 (-87.95%)
Mutual labels:  scraper, scraping, crawling
crawler-chrome-extensions
爬虫工程师常用的 Chrome 插件 | Chrome extensions used by crawler developer
Stars: ✭ 53 (-87.95%)
Mutual labels:  scraper, spider, scraping
Goose Parser
Universal scrapping tool, which allows you to extract data using multiple environments
Stars: ✭ 211 (-52.05%)
Mutual labels:  crawler, scraper, scraping
scrapy facebooker
Collection of scrapy spiders which can scrape posts, images, and so on from public Facebook Pages.
Stars: ✭ 22 (-95%)
Mutual labels:  scraper, spider, scraping
Xcrawler
快速、简洁且强大的PHP爬虫框架
Stars: ✭ 344 (-21.82%)
Mutual labels:  crawler, spider, scraper
scrapy-distributed
A series of distributed components for Scrapy. Including RabbitMQ-based components, Kafka-based components, and RedisBloom-based components for Scrapy.
Stars: ✭ 38 (-91.36%)
Mutual labels:  spider, scraping, crawling
Freshonions Torscraper
Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi.onion
Stars: ✭ 348 (-20.91%)
Mutual labels:  crawler, spider, scraper
Webster
a reliable high-level web crawling & scraping framework for Node.js.
Stars: ✭ 364 (-17.27%)
Mutual labels:  crawler, spider, crawling

Crawly

oltarasenko Coverage Status Hex pm hex.pm downloads

Overview

Crawly is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Requirements

  1. Elixir "~> 1.10"
  2. Works on Linux, Windows, OS X and BSD

Quickstart

  1. Add Crawly as a dependencies:

    # mix.exs
    defp deps do
        [
          {:crawly, "~> 0.13.0"},
          {:floki, "~> 0.26.0"}
        ]
    end
    
  2. Fetch dependencies: $ mix deps.get

  3. Create a spider

    # lib/crawly_example/esl_spider.ex
    defmodule EslSpider do
      use Crawly.Spider
      
      alias Crawly.Utils
    
      @impl Crawly.Spider
      def base_url(), do: "https://www.erlang-solutions.com"
    
      @impl Crawly.Spider
      def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog/"]]
    
      @impl Crawly.Spider
      def parse_item(response) do
        {:ok, document} = Floki.parse_document(response.body)
        hrefs = document |> Floki.find("a.btn-link") |> Floki.attribute("href")
    
        requests =
          Utils.build_absolute_urls(hrefs, base_url())
          |> Utils.requests_from_urls()
    
        title = document |> Floki.find("h1.page-title-sm") |> Floki.text()
    
        %{
          :requests => requests,
          :items => [%{title: title, url: response.request_url}]
        }
      end
    end
    
  4. Configure Crawly

    • By default, Crawly does not require any configuration. But obviously you will need a configuration for fine tuning the crawls:
    # in config.exs
    config :crawly,
      closespider_timeout: 10,
      concurrent_requests_per_domain: 8,
      middlewares: [
        Crawly.Middlewares.DomainFilter,
        Crawly.Middlewares.UniqueRequest,
        {Crawly.Middlewares.UserAgent, user_agents: ["Crawly Bot"]}
      ],
      pipelines: [
        {Crawly.Pipelines.Validate, fields: [:url, :title]},
        {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
        Crawly.Pipelines.JSONEncoder,
        {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"}
      ]
    
  5. Start the Crawl:

    • $ iex -S mix
    • iex(1)> Crawly.Engine.start_spider(EslSpider)
  6. Results can be seen with: $ cat /tmp/EslSpider.jl

Need more help?

I have decided to create a public telegram channel, so it's now possible to be connected, and it's possible to ask questions and get answers faster!

Please join me on: https://t.me/crawlyelixir

Browser rendering

Crawly can be configured in the way that all fetched pages will be browser rendered, which can be very useful if you need to extract data from pages which has lots of asynchronous elements (for example parts loaded by AJAX).

You can read more here:

Experimental UI

The CrawlyUI project is an add-on that aims to provide an interface for managing and rapidly developing spiders.

Checkout the code from GitHub or try it online CrawlyUIDemo

See more at Experimental UI

Documentation

Roadmap

  1. [x] Pluggable HTTP client
  2. [x] Retries support
  3. [x] Cookies support
  4. [x] XPath support - can be actually done with meeseeks
  5. [ ] Project generators (spiders)
  6. [ ] UI for jobs management

Articles

  1. Blog post on Erlang Solutions website: https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html
  2. Blog post about using Crawly inside a machine learning project with Tensorflow (Tensorflex): https://www.erlang-solutions.com/blog/how-to-build-a-machine-learning-project-in-elixir.html
  3. Web scraping with Crawly and Elixir. Browser rendering: https://medium.com/@oltarasenko/web-scraping-with-elixir-and-crawly-browser-rendering-afcaacf954e8
  4. Web scraping with Elixir and Crawly. Extracting data behind authentication: https://oltarasenko.medium.com/web-scraping-with-elixir-and-crawly-extracting-data-behind-authentication-a52584e9cf13
  5. What is web scraping, and why you might want to use it?
  6. Using Elixir and Crawly for price monitoring

Example projects

  1. Blog crawler: https://github.com/oltarasenko/crawly-spider-example
  2. E-commerce websites: https://github.com/oltarasenko/products-advisor
  3. Car shops: https://github.com/oltarasenko/crawly-cars
  4. JavaScript based website (Splash example): https://github.com/oltarasenko/autosites

Contributors

We would gladly accept your contributions!

Documentation

Please find documentation on the HexDocs

Production usages

Using Crawly on production? Please let us know about your case!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].