All Projects → SimFin → pdf-crawler

SimFin / pdf-crawler

Licence: other
SimFin's open source PDF crawler

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to pdf-crawler

Webdrivermanager
WebDriverManager (Copyright © 2015-2021) is a project created and maintained by Boni Garcia and licensed under the terms of the Apache 2.0 License.
Stars: ✭ 1,808 (+1708%)
Mutual labels:  selenium-webdriver, geckodriver
bots-zoo
No description or website provided.
Stars: ✭ 59 (-41%)
Mutual labels:  crawling, puppeteer
puppet-master
Puppeteer as a service hosted on Saasify.
Stars: ✭ 25 (-75%)
Mutual labels:  crawling, puppeteer
page-modeller
⚙️ Browser DevTools extension for modelling web pages for automation.
Stars: ✭ 66 (-34%)
Mutual labels:  selenium-webdriver, puppeteer
Cdp4j
cdp4j - Chrome DevTools Protocol for Java
Stars: ✭ 232 (+132%)
Mutual labels:  crawling, selenium-webdriver
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+71%)
Mutual labels:  crawling, puppeteer
SlackWebhooksGithubCrawler
Search for Slack Webhooks token publicly exposed on Github
Stars: ✭ 21 (-79%)
Mutual labels:  crawling, puppeteer
Marionette
Selenium alternative for Crystal. Browser manipulation without the Java overhead.
Stars: ✭ 119 (+19%)
Mutual labels:  selenium-webdriver, puppeteer
Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+5029%)
Mutual labels:  crawling, puppeteer
Webster
a reliable high-level web crawling & scraping framework for Node.js.
Stars: ✭ 364 (+264%)
Mutual labels:  crawling, puppeteer
tees
Universal test framework for front-end with WebDriver, Puppeteer and Enzyme
Stars: ✭ 23 (-77%)
Mutual labels:  selenium-webdriver, puppeteer
Squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Stars: ✭ 125 (+25%)
Mutual labels:  crawling, puppeteer
double-agent
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
Stars: ✭ 123 (+23%)
Mutual labels:  crawling, puppeteer
Apify Js
Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Stars: ✭ 3,154 (+3054%)
Mutual labels:  crawling, puppeteer
Awesome Puppeteer
A curated list of awesome puppeteer resources.
Stars: ✭ 1,728 (+1628%)
Mutual labels:  crawling, puppeteer
Instagram Bot
An Instagram bot developed using the Selenium Framework
Stars: ✭ 138 (+38%)
Mutual labels:  crawling, selenium-webdriver
Nutch
Apache Nutch is an extensible and scalable web crawler
Stars: ✭ 2,277 (+2177%)
Mutual labels:  crawling
web-scraping
Web Scraping using puppeteer
Stars: ✭ 21 (-79%)
Mutual labels:  puppeteer
N2h4
네이버 뉴스 수집을 위한 도구
Stars: ✭ 177 (+77%)
Mutual labels:  crawling
clusteer
Clusteer is a Puppeteer wrapper written for Laravel, with the super-power of parallelizing pages across multiple browser instances.
Stars: ✭ 81 (-19%)
Mutual labels:  puppeteer

PDF Crawler

This is SimFin's open source PDF crawler. Can be used to crawl all PDFs from a website.

You specify a starting page and all pages that link from that page are crawled (ignoring links that lead to other pages, while still fetching PDFs that are linked on the original page but hosted on a different domain).

Can crawl files "hidden" with javascript too (the crawler can render the page and click on all elements to make new links appear).

Built in proxy support.

We use this crawler to gather PDFs from company websites to find financial reports that are then uploaded to SimFin, but can be used for other documents too.

Development

How to install pdf-extractor for development.

$ git clone https://github.com/SimFin/pdf-crawler.git
$ cd pdf-crawler

# Make a virtual environment with the tool of your choice. Please use Python version 3.6+
# Here an example based on pyenv:
$ pyenv virtualenv 3.6.6 pdf-crawler

$ pip install -e .

Usage Example

After having installed pdf-crawler as described in the "Development" section, you can import and use the crawler class like so:

import crawler

crawler.crawl(url="https://simfin.com/crawlingtest/",output_dir="crawling_test",method="rendered-all")

Parameters

  • url - the url to crawl
  • output_dir - the directory where the files should be saved
  • method - the method to use for the crawling, has 3 possible values: normal (plain HTML crawling), rendered (renders the HTML page, so that frontend SPA frameworks like Angular, Vue etc. get read properly) and rendered-all (renders the HTML page and clicks on all elements that can be clicked on (buttons etc.) to make appear links that are hidden somewhere)
  • depth - the "depth" to crawl, refers to the number of sub-pages the crawler goes to before it stops. Default is 2.
  • gecko_path - if you choose the crawling method "rendered-all", you have to install Firefox's headless browser Gecko. You can specify the location to the executable that you downloaded here.

License

Available under MIT license

Credits

@gwaramadze, @q7v6rhgfzc8tnj3d, @thf24

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].