All Projects → heeplr → document-dl

heeplr / document-dl

Licence: Unlicense license
Command line program to download documents from web portals.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to document-dl

scrapman
Retrieve real (with Javascript executed) HTML code from an URL, ultra fast and supports multiple parallel loading of webs
Stars: ✭ 21 (+50%)
Mutual labels:  scraper, scraping, scraping-websites
gochanges
**[ARCHIVED]** website changes tracker 🔍
Stars: ✭ 12 (-14.29%)
Mutual labels:  scraper, scraping, scraping-websites
Ferret
Declarative web scraping
Stars: ✭ 4,837 (+34450%)
Mutual labels:  scraper, scraping, scraping-websites
proxycrawl-python
ProxyCrawl Python library for scraping and crawling
Stars: ✭ 51 (+264.29%)
Mutual labels:  scraper, scraping, scraping-websites
Instagram-to-discord
Monitor instagram user account and automatically post new images to discord channel via a webhook. Working 2022!
Stars: ✭ 113 (+707.14%)
Mutual labels:  scraper, scraping, scraping-websites
crawler-chrome-extensions
爬虫工程师常用的 Chrome 插件 | Chrome extensions used by crawler developer
Stars: ✭ 53 (+278.57%)
Mutual labels:  scraper, scraping
scrapers
scrapers for building your own image databases
Stars: ✭ 46 (+228.57%)
Mutual labels:  scraper, scraping
diffbot-php-client
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
Stars: ✭ 53 (+278.57%)
Mutual labels:  scraper, scraping
ha-multiscrape
Home Assistant custom component for scraping (html, xml or json) multiple values (from a single HTTP request) with a separate sensor/attribute for each value. Support for (login) form-submit functionality.
Stars: ✭ 103 (+635.71%)
Mutual labels:  scraper, scraping
google-scraper
This class can retrieve search results from Google.
Stars: ✭ 33 (+135.71%)
Mutual labels:  scraper, scraping
reason-rust-scraper
🦀 Scraping & crawling websites using Rust, and ReasonML
Stars: ✭ 21 (+50%)
Mutual labels:  scraping, scraping-websites
LeetCode
At present contains scraped data from around 1500 problems present on the site. More to follow....
Stars: ✭ 45 (+221.43%)
Mutual labels:  scraper, scraping-websites
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (+271.43%)
Mutual labels:  scraper, scraping
readability-cli
A CLI for Mozilla Readability. Get clean, uncluttered, ready-to-read HTML from any webpage!
Stars: ✭ 41 (+192.86%)
Mutual labels:  scraping, scraping-websites
Pahe.ph-Scraper
Pahe.ph [Pahe.in] Movies Website Scraper
Stars: ✭ 57 (+307.14%)
Mutual labels:  scraper, scraping
copycat
A PHP Scraping Class
Stars: ✭ 70 (+400%)
Mutual labels:  scraper, scraping
scavenger
Scrape and take screenshots of dynamic and static webpages
Stars: ✭ 14 (+0%)
Mutual labels:  scraping, scraping-websites
Scrape Linkedin Selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Stars: ✭ 239 (+1607.14%)
Mutual labels:  scraper, scraping
TradeTheEvent
Implementation of "Trade the Event: Corporate Events Detection for News-Based Event-Driven Trading." In Findings of ACL2021
Stars: ✭ 64 (+357.14%)
Mutual labels:  scraper, scraping-websites
angel.co-companies-list-scraping
No description or website provided.
Stars: ✭ 54 (+285.71%)
Mutual labels:  scraper, scraping

command line document download made easy

Pylint flake8

Like youtube-dl can download videos from various websites, document-dl can download documents like invoices, messages, reports, etc.

It can save you from regularly logging into your account to download new documents.

Websites that don't require any form of 2FA can be polled without interaction regularly using a cron job so documents are downloaded automatically.


Highlights

  • list available documents in json format or download them
  • filter documents using
    • string matching
    • regular expressions or
    • jq queries
  • display captcha or QR codes for interactive input
  • writing new plugins is easy
  • existing plugins (some of them even work):
    • amazon
    • ing.de
    • handyvertrag.de
    • dkb.de
    • o2.de
    • kabel.vodafone.de
    • conrad.de
    • elster.de
    • strato.de



Dependencies



Installation (for debian bullseye)

$ apt install git python3-dev python3-pip python3-selenium chromium-chromedriver
$ pip3 install --user git+https://github.com/heeplr/document-dl.git

or for developers:

$ git clone --recursive https://github.com/heeplr/document-dl
$ cd document-dl
$ pip install --user --editable .



Usage

Display Help:

$ document-dl -h
Usage: document-dl [OPTIONS] COMMAND [ARGS]...

  download documents from web portals

Options:
  -u, --username TEXT             login id  [env var: DOCDL_USERNAME]
  -p, --password TEXT             secret password  [env var: DOCDL_PASSWORD]
  -m, --match <ATTRIBUTE PATTERN>...
                                  only output documents where attribute
                                  contains pattern string  [env var:
                                  DOCDL_STRING_MATCHES]
  -r, --regex <ATTRIBUTE REGEX>...
                                  only output documents where attribute value
                                  matches regex  [env var:
                                  DOCDL_REGEX_MATCHES]
  -j, --jq JQ_EXPRESSION          only output documents if json query matches
                                  document's attributes (see
                                  https://stedolan.github.io/jq/manual/ )
                                  [env var: DOCDL_JQ_MATCHES]
  -H, --headless / --show         show/hide browser window  [env var:
                                  DOCDL_HEADLESS; default: headless]
  -b, --browser [chrome|edge|firefox|ie|opera|safari|webkitgtk]
                                  webdriver to use for selenium based plugins
                                  [env var: DOCDL_BROWSER; default: chrome]
  -t, --timeout INTEGER           seconds to wait for data before terminating
                                  connection  [env var: DOCDL_TIMEOUT;
                                  default: 15]
  -i, --image-loading BOOLEAN     Turn off image loading when False  [env var:
                                  DOCDL_IMAGE_LOADING; default: False]
  -a, --action [download|list]    download or just list documents  [env var:
                                  DOCDL_ACTION; default: list]
  -f, --format [list|dicts]       choose between line buffered output of json
                                  dicts or single json list  [env var:
                                  DOCDL_OUTPUT_FORMAT; default: dicts]
  -h, --help                      Show this message and exit.

Commands:
  amazon        Amazon (invoices)
  conrad        conrad.de (invoices)
  dkb           dkb.de with chipTAN QR (postbox)
  elster        elster.de with path to .pfx certfile as username (postbox)
  handyvertrag  service.handyvertrag.de (invoices, call record)
  ing           banking.ing.de with photoTAN (postbox)
  o2            o2online.de (invoices, call record, postbox)
  strato        strato.de (invoices)
  vodafone      kabel.vodafone.de (postbox, invoices)

Display plugin-specific help: (currently there is a bug in click that prompts for username and password before displaying the help)

$ document-dl ing --help
Usage: document-dl ing [OPTIONS]

  banking.ing.de with photoTAN (postbox)

Options:
  -k, --diba-key TEXT  DiBa Key  [env var: DOCDL_DIBA_KEY]
  -h, --help           Show this message and exit.



Examples

List all documents from vodafone.de, prompt for username/password:

$ document-dl vodafone

Same, but show browser window this time:

$ document-dl --show vodafone

Download all documents from conrad.de, pass credentials as commandline arguments:

$ document-dl --username mylogin --password mypass --action download conrad

Download all documents from conrad.de, pass credentials as env vars:

$ DOCDL_USERNAME='mylogin' DOCDL_PASSWORD='mypass' document-dl --action download conrad

Download all documents from o2online.de where "doctype" attribute contains "BILL":

$ document-dl --match doctype BILL --action download o2

You can also use regular expressions to filter documents:

$ document-dl --regex date '^(2021-04|2021-05).*$' o2

List all documents from o2online.de where year >= 2019:

$ document-dl --jq 'select(.year >= 2019)' o2

Download document from elster.de with id == 15:

$ document-dl --jq 'contains({id: 15})' --action download elster



Security

BEWARE that your login credentials are most probably saved in your shell history when you pass them as commandline arguments. You can use the input prompt to avoid that or set environment variables securely.



Writing a plugin

Plugins are click-plugins which in turn are normal @click.command's that are registered in setup.py

Roughly, you have to:

  • put your plugin into "docdl/plugins/myplugin.py"
  • write your plugin class, e.g. MyPlugin():
    • if you just need python requests, inherit from docdl.WebPortal and use self.session that's initialized for you
    • if you need selenium, inherit from docdl.SeleniumWebPortal and use self.webdriver that's initialized for you
    • add a login(), logout() and documents() method.
  • add click glue code
  • add your plugin to setup.py docdl_plugins registry

requests plugin example

import docdl
import docdl.util

class MyPlugin(docdl.WebPortal):

    URL_LOGIN = "https://myservice.com/login"
    URL_LOGOUT = "https://myservice.com/logout"

    def login(self):
        # maybe load some session cookie
        request = self.session.get(self.URL_LOGIN)
        # authenticate
        request = self.session.post(
            self.URL_LOGIN,
            data={ 'username': self.username, 'password': self.password }
        )
        # return false if login failed, true otherwise
        if not request.ok:
            return False
        return True

    def logout(self):
        request = self.session.get(self.URL_LOGOUT)

    def documents(self):
        # acquire list of documents
        # ...

        # iterate over all available documents
        for count, document in enumerate(all_documents):

            # scrape:
            #  * document attributes
            #    * it's recommended to assign an incremental "id"
            #      attribute to every document
            #    * if you set a "filename" attribute, it will be used to
            #      rename the downloaded file
            #    * dates should be parsed to datetime.datetime objects
            #      docdl.util.parse_date() should parse the most common strings
            #
            # also you must scrape either:
            #  * the download URL
            #
            # or (for SeleniumWebPortal plugins):
            #  * the DOM element that triggers download. It is expected
            #    that the download starts immediately after click() on
            #    the DOM element
            # or implement a custom download() method

            yield docdl.Document(
                url = this_documents_url,
                # download_element = <some selenium element to click>
                attributes = {
                    "id": count,
                    "category": "invoices",
                    "title": this_documents_title,
                    "filename": this_documents_target_filename,
                    "date": docdl.util.parse_date(some_date_string)
                }
            )


    def download(self, document):
        """you shouldn't need this for most web portals"""
        # ... save file to os.getcwd() ...
        return self.rename_after_download(document, filename)


@click.command()
@click.pass_context
def myplugin(ctx):
    """plugin description (what, documents, are, scraped)"""
    docdl.cli.run(ctx, MyPlugin)

selenium plugin example

TBD

register plugin

...in setup.py:

# ...
setup(
    # ...
    packages=find_packages(
        # ...
        entry_points={
            'docdl_plugins': [
                # ...
                'myplugin=docdl.plugins.myplugin:myplugin',
                # ...
            ],
            # ...
        }
)



Bugs

document-dl is still in a very early state of development and a lot of things don't work, yet. Especially a ton of edge cases need to be covered. If you find a bug, please open an issue or send a pull request.

  • --browser settings beside chrome probably don't work unless you help to test them
  • some services offer more documents/data than currently scraped



TODO

  • logging
  • better documentation
  • properly parse rfc6266
  • delete action
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].