All Projects → soskek → arxiv_leaks

soskek / arxiv_leaks

Licence: other
Whisper of the arxiv: read comments in tex of papers

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to arxiv leaks

Arxivscraper
A python module to scrape arxiv.org for specific date range and categories
Stars: ✭ 121 (+450%)
Mutual labels:  scraper, arxiv
web-scraping-engine
A simple web scraping engine supporting concurrent and anonymous scraping
Stars: ✭ 27 (+22.73%)
Mutual labels:  scraper
site-audit-seo
Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv, xlsx, Google Drive.
Stars: ✭ 91 (+313.64%)
Mutual labels:  scraper
document-dl
Command line program to download documents from web portals.
Stars: ✭ 14 (-36.36%)
Mutual labels:  scraper
stock-market-scraper
Scraps historical stock market data from Yahoo Finance (https://finance.yahoo.com/)
Stars: ✭ 110 (+400%)
Mutual labels:  scraper
aliexscrape
Get Aliexpress product details in JSON
Stars: ✭ 80 (+263.64%)
Mutual labels:  scraper
PDAP-Scrapers
Code relating to scraping public police data.
Stars: ✭ 72 (+227.27%)
Mutual labels:  scraper
newspaperjs
News extraction and scraping. Article Parsing
Stars: ✭ 59 (+168.18%)
Mutual labels:  scraper
crawlkit
A crawler based on Phantom. Allows discovery of dynamic content and supports custom scrapers.
Stars: ✭ 23 (+4.55%)
Mutual labels:  scraper
yt-videos-list
Create and **automatically** update a list of all videos on a YouTube channel (in txt/csv/md form) via YouTube bot with end-to-end web scraping - no API tokens required. Multi-threaded support for YouTube videos list updates.
Stars: ✭ 64 (+190.91%)
Mutual labels:  scraper
scraper
A web scraper starter project
Stars: ✭ 18 (-18.18%)
Mutual labels:  scraper
OLX Scraper
📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.
Stars: ✭ 15 (-31.82%)
Mutual labels:  scraper
tieba-zhuaqu
百度贴吧分布式爬虫,用于贴吧数据挖掘。从贴吧维度和用户维度进行数据分析
Stars: ✭ 56 (+154.55%)
Mutual labels:  scraper
InstagramLocationScraper
No description or website provided.
Stars: ✭ 13 (-40.91%)
Mutual labels:  scraper
Spydan
A web spider for shodan.io without using the Developer API.
Stars: ✭ 30 (+36.36%)
Mutual labels:  scraper
google-this
🔎 A simple yet powerful module to retrieve organic search results and much more from Google.
Stars: ✭ 88 (+300%)
Mutual labels:  scraper
youtube-unofficial
Access parts of your account unavailable through normal YouTube API access.
Stars: ✭ 33 (+50%)
Mutual labels:  scraper
OpenScraper
An open source webapp for scraping: towards a public service for webscraping
Stars: ✭ 80 (+263.64%)
Mutual labels:  scraper
WaGpScraper
A Python Oriented tool to Scrap WhatsApp Group Link using Google Dork it Scraps Whatsapp Group Links From Google Results And Gives Working Links.
Stars: ✭ 18 (-18.18%)
Mutual labels:  scraper
Mimo-Crawler
A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.
Stars: ✭ 22 (+0%)
Mutual labels:  scraper

ArxivLeaks

Most of papers on arxiv have latex files, which often contain much comment out. Dig up the valuable comments!

For example, you can extract a secret comment from "Attention Is All You Need", as below:

\paragraph{Symbol Dropout} In the source and target embedding layers, we replace a random subset of the token ids with a sentinel id. For the base model, we use a rate of $symbol\_dropout\_rate=0.1$. Note that this applies only to the auto-regressive use of the target ids - not their use in the cross-entropy loss.

We found "Symbol Dropout", which do not appear in the paper (pdf).

Run

Feed a text/page containing arxiv URLs by -t.

python -u run.py -t deepmind.html -s arxiv_dir

To test, run sh test.sh. This pre-downloads a publication page of deepmind.

You can also read only selected papers by -i, feeding their arxiv ids.

python -u run.py -i 1709.04905 1706.03762 -s arxiv_dir
  • -s: Downloaded arxiv pages and files are stored into this directory.
  • -o: Output is printed and saved as a json file with this file path. Default is ./comments.json.

Requirement

  • requests
  • lxml

For Writers

You can remove %-comments from your file as follows:

perl -pe 's/(^|[^\\])%.*/\1%/' < old.tex > new.tex

This one line command is given from arxiv.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].