All Projects → seanywang0408 → Crawling-CV-Conference-Papers

seanywang0408 / Crawling-CV-Conference-Papers

Licence: Apache-2.0 License
Crawling CV conference papers with Python.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Crawling-CV-Conference-Papers

crawler
A simple and flexible web crawler framework for java.
Stars: ✭ 20 (-37.5%)
Mutual labels:  crawler
php-google
Google search results crawler, get google search results that you need - php
Stars: ✭ 23 (-28.12%)
Mutual labels:  crawler
tf-hack2.0
⚡#PoweredByTF 2.0 Challenge! [ /w @inishchith ]
Stars: ✭ 18 (-43.75%)
Mutual labels:  cv
Alchemy
CV DL
Stars: ✭ 59 (+84.38%)
Mutual labels:  cv
cv
A LaTeX template for academic CVs
Stars: ✭ 129 (+303.13%)
Mutual labels:  cv
medium-stat-box
Practical pinned gist which show your latest medium status 📌
Stars: ✭ 29 (-9.37%)
Mutual labels:  crawler
TaobaoAnalysis
练习NLP,分析淘宝评论的项目
Stars: ✭ 28 (-12.5%)
Mutual labels:  crawler
BilibiliCrawler
🌀 crawl bilibili user info and video info for data analysis | BiliBili爬虫
Stars: ✭ 25 (-21.87%)
Mutual labels:  crawler
Sharingan
We will try to find your visible basic footprint from social media as much as possible - 😤 more sites is comming soon
Stars: ✭ 13 (-59.37%)
Mutual labels:  crawler
spiderable-middleware
🤖 Prerendering for JavaScript powered websites. Great solution for PWAs (Progressive Web Apps), SPAs (Single Page Applications), and other websites based on top of front-end JavaScript frameworks
Stars: ✭ 29 (-9.37%)
Mutual labels:  crawler
flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
Stars: ✭ 48 (+50%)
Mutual labels:  crawler
arachnod
High performance crawler for Nodejs
Stars: ✭ 17 (-46.87%)
Mutual labels:  crawler
domfind
A Python DNS crawler to find identical domain names under different TLDs.
Stars: ✭ 22 (-31.25%)
Mutual labels:  crawler
auto crawler ptt beauty image
Auto Crawler Ptt Beauty Image Use Python Schedule
Stars: ✭ 35 (+9.38%)
Mutual labels:  crawler
TripAdvisor-Crawling-Suite
Fetching hotel data from TripAdvisor.
Stars: ✭ 17 (-46.87%)
Mutual labels:  crawler
img-cli
An interactive Command-Line Interface Build in NodeJS for downloading a single or multiple images to disk from URL
Stars: ✭ 15 (-53.12%)
Mutual labels:  crawler
personal-website
My personal website
Stars: ✭ 117 (+265.63%)
Mutual labels:  cv
WeiboCrawler
无cookie版微博爬虫,可以连续爬取一个或多个新浪微博用户信息、用户微博及其微博评论转发。
Stars: ✭ 45 (+40.63%)
Mutual labels:  crawler
lostark-wait-notifier
🐤️ Lost Ark wait notifier
Stars: ✭ 38 (+18.75%)
Mutual labels:  crawler
ptt-web-crawler
PTT 網路版爬蟲
Stars: ✭ 20 (-37.5%)
Mutual labels:  crawler

Crawling-CV-Conference-Papers

News

  • 2021-12-07 - Support ICML-2021 and SIGGRAPH-2021

Set the local download directory in download.ipynb and download_siggraph.ipynb, then run it! Or you can directly download pdf files in my OneDrive link.

Also I change from Chrome driver to Edge driver and fix the bugs in download_siggraph.ipynb.

  • 2021-11-25 - Support NeurIPS-2021

Set the local download directory in download_neurips2021.ipynb and run it! Or you can directly download pdf files in my OneDrive link.

  • 2021-10-13 - Support ICCV-2021

Set the local download directory in download_iccv2021.py and run it! Or you can directly download pdf files in my OneDrive link.

  • 2021-6-21 - Important! Direct download link available!

To lower the barriers for those who do not want to mess with code and git, the direct download link from OneDrive for recent CV/DL conference papers is provided! Click and check! (Though older conferences downloading still require mannually running the code)

  • 2021-6-21 - Support CVPR-2021

Download all CVPR-2021 papers in one click. Just set the local download directory in download_cvpr2021.py and run it! Don't forget to have your chrome driver ready (i.e., corresponding version to your Chrome browser)

  • 2021-6-20 - Support continuation of downloading from where the program encounters interruption. (prevent re-downloading from scratch)

Introduction

Python code to crawl computer vision papers from top CV conferences. Currently it supports CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR, SIGGRAPH. It leverages selenium, a website testing framework to crawl the titles and pdf urls from the conference website, and download them one by one with some simple anti-anti-crawler tricks.

Websites for older conferences are not guaranteed to be bug-free, since this project is based on newest website structure.

Recommend to work with Mendeley. You will get a juicy academic corpus.

Currently only single-thread downloading is implemented. Therefore the downloading for thousands of papers would be slow (takes several hours). It is suggested that you run the script before bed and it would be finished when you get to work again :)

Multi-thread downloading will be coming soon!

Requirements

pip install selenium, slugify

Besides, based on the browser you use (Chrome or Edge), downlowd chromedriver.exe or edgedriver.exe from the link to any local path you favour.

Usage

To execute the crawler, you could run download.py or download.ipynb (Basically the same). Before the execution, some paths need to be set up, including:

conference = 'neurips'
conference_url = "https://papers.nips.cc/paper/2019" # the conference url to download papers from
chromedriver_path = '.../chromedriver.exe' # the chromedriver.exe path
root = './NeurIPS-2019-ALL' # file path to save the downloaded papers

Here are some conference url examples:

cvpr: https://openaccess.thecvf.com/CVPR2020 (CVPR 2020)
eccv: https://openaccess.thecvf.com/ECCV2018 (ECCV 2018) (changed in 2020)
eccv: https://www.ecva.net/papers.php (ECCV 2020) 
iccv: https://openaccess.thecvf.com/ICCV2019 (ICCV 2019)
icml: http://proceedings.mlr.press/v119/ (ICML 2020)
neurips: https://papers.nips.cc/paper/2020 (NeurIPS 2020)
iclr: https://openreview.net/group?id=ICLR.cc/2021/Conference (ICLR 2021)
siggraph: https://dl.acm.org/toc/tog/2020/39/4 (SIGGRAPH 2020)

Replace the url and the conference names with your choice.

If you want to crawl papers from other conference website, all you need to do is to write a retrieve function like the ones in retrieve_titles_urls_from_websites.py, to parse html code and retrieve the paper titles and pdf urls into two lists.

Others

Warnings: It is heard that crawling from conference websites might cause a banning of your IP (hasn't happened to me so far). Not sure of the risk.

Warnings: This project is for learning purpose only. Do not crawl the same website frequently, which will burden the server.

Welcome to submit a pull request if there is any bugs or if you would like to add support to other conferences!

Maintainer

Xiaoyang Huang

Email: [email protected]

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].