All Projects → alephdata → Memorious

alephdata / Memorious

Licence: mit
Distributed crawling framework for documents and structured data.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Memorious

Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Stars: ✭ 440 (+77.42%)
Mutual labels:  scraping, crawling
Easy Scraping Tutorial
Simple but useful Python web scraping tutorial code.
Stars: ✭ 583 (+135.08%)
Mutual labels:  scraping, crawling
Dataflowkit
Extract structured data from web sites. Web sites scraping.
Stars: ✭ 456 (+83.87%)
Mutual labels:  scraping, crawling
Antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Stars: ✭ 198 (-20.16%)
Mutual labels:  scraping, crawling
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Stars: ✭ 42,343 (+16973.79%)
Mutual labels:  scraping, crawling
Sasila
一个灵活、友好的爬虫框架
Stars: ✭ 286 (+15.32%)
Mutual labels:  scraping, crawling
Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+1968.15%)
Mutual labels:  scraping, crawling
bots-zoo
No description or website provided.
Stars: ✭ 59 (-76.21%)
Mutual labels:  scraping, crawling
Dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Stars: ✭ 100 (-59.68%)
Mutual labels:  scraping, crawling
Grawler
Grawler is a tool written in PHP which comes with a web interface that automates the task of using google dorks, scrapes the results, and stores them in a file.
Stars: ✭ 98 (-60.48%)
Mutual labels:  scraping, crawling
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (+11.69%)
Mutual labels:  scraping, crawling
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (-31.05%)
Mutual labels:  scraping, crawling
Apify Js
Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Stars: ✭ 3,154 (+1171.77%)
Mutual labels:  scraping, crawling
Spidermon
Scrapy Extension for monitoring spiders execution.
Stars: ✭ 309 (+24.6%)
Mutual labels:  scraping, crawling
ARGUS
ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
Stars: ✭ 68 (-72.58%)
Mutual labels:  scraping, crawling
Ferret
Declarative web scraping
Stars: ✭ 4,837 (+1850.4%)
Mutual labels:  scraping, crawling
feedsearch-crawler
Crawl sites for RSS, Atom, and JSON feeds.
Stars: ✭ 23 (-90.73%)
Mutual labels:  scraping, crawling
pomp
Screen scraping and web crawling framework
Stars: ✭ 61 (-75.4%)
Mutual labels:  scraping, crawling
Lulu
[Unmaintained] A simple and clean video/music/image downloader 👾
Stars: ✭ 789 (+218.15%)
Mutual labels:  scraping, crawling
Awesome Puppeteer
A curated list of awesome puppeteer resources.
Stars: ✭ 1,728 (+596.77%)
Mutual labels:  scraping, crawling

========= Memorious

The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.

-- `Funes the Memorious <http://users.clas.ufl.edu/burt/spaceshotsairheads/borges-funes.pdf>`_,
Jorge Luis Borges

.. image:: https://github.com/alephdata/memorious/workflows/memorious/badge.svg

memorious is a distributed web scraping toolkit. It is a light-weight tool that schedules, monitors and supports scrapers that collect structured or un-structured data. This includes the following use cases:

  • Maintain an overview of a fleet of crawlers
  • Schedule crawler execution in regular intervals
  • Store execution information and error messages
  • Distribute scraping tasks across multiple machines
  • Make crawlers modular and simple tasks re-usable
  • Get out of your way as much as possible

.. image:: docs/memorious-ui.png

Design

When writing a scraper, you often need to paginate through through an index page, then download an HTML page for each result and finally parse that page and insert or update a record in a database.

memorious handles this by managing a set of crawlers, each of which can be composed of multiple stages. Each stage is implemented using a Python function, which can be re-used across different crawlers.

The basic steps of writing a Memorious crawler:

  1. Make YAML crawler configuration file
  2. Add different stages
  3. Write code for stage operations (optional)
  4. Test, rinse, repeat

Documentation

The documentation for Memorious is available at memorious.readthedocs.io <https://memorious.readthedocs.io/>_. Feel free to edit the source files in the docs folder and send pull requests for improvements.

To build the documentation, inside the docs folder run make html

You'll find the resulting HTML files in /docs/_build/html.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].