All Projects → Santosh-Gupta → Arxiv Manatee

Santosh-Gupta / Arxiv Manatee

Arxiv Sanity with novel paper search

Projects that are alternatives of or similar to Arxiv Manatee

Aics Segmentation
AICS Segmentation (One-Way) Mirror
Stars: ✭ 37 (-2.63%)
Mutual labels:  jupyter-notebook
Adversarial autoencoder
Implementation of Adversarial Autoencoder with Theano
Stars: ✭ 37 (-2.63%)
Mutual labels:  jupyter-notebook
Stanfordextra
12k labelled instances of dogs in-the-wild with 2D keypoint and segmentations. Dataset released with our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.
Stars: ✭ 38 (+0%)
Mutual labels:  jupyter-notebook
Minerva Training Materials
Learn advanced data science on real-life, curated problems
Stars: ✭ 37 (-2.63%)
Mutual labels:  jupyter-notebook
Dashboards
[RETIRED] See Voilà as a supported replacement
Stars: ✭ 986 (+2494.74%)
Mutual labels:  jupyter-notebook
Python Berkeley
python resources of berkeley curated at a place
Stars: ✭ 37 (-2.63%)
Mutual labels:  jupyter-notebook
Ipcc sr15 scenario analysis
Scenario analysis notebooks for the IPCC Special Report on Global Warming of 1.5°C
Stars: ✭ 37 (-2.63%)
Mutual labels:  jupyter-notebook
Fractional differencing gpu
Rapid large-scale fractional differencing with RAPIDS to minimize memory loss while making a time series stationary. 6x-400x speed up over CPU implementation.
Stars: ✭ 38 (+0%)
Mutual labels:  jupyter-notebook
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+2494.74%)
Mutual labels:  jupyter-notebook
Moltrans
MolTrans: Molecular Interaction Transformer for Drug Target Interaction Prediction (Bioinformatics)
Stars: ✭ 38 (+0%)
Mutual labels:  jupyter-notebook
Dog Breed Classifier
I built a CNN that can classify dog breeds
Stars: ✭ 37 (-2.63%)
Mutual labels:  jupyter-notebook
Foggle
a script that takes posts from reddit.com/r/explainlikeimfive and enters them as a google search, so as to blur your history search
Stars: ✭ 37 (-2.63%)
Mutual labels:  jupyter-notebook
Mysql History Graph
History Graphs about MySQL and forks
Stars: ✭ 37 (-2.63%)
Mutual labels:  jupyter-notebook
Plate scatac Seq
A rapid and robust plate-based single cell ATAC-seq (scATAC-seq) method
Stars: ✭ 37 (-2.63%)
Mutual labels:  jupyter-notebook
Lstm Autoencoder For Anomaly Detection
AI deep learning neural network for anomaly detection using Python, Keras and TensorFlow
Stars: ✭ 36 (-5.26%)
Mutual labels:  jupyter-notebook
Bus number
Up Your Bus Number: A Primer for Reproducible Data Science
Stars: ✭ 37 (-2.63%)
Mutual labels:  jupyter-notebook
Algotrading
Algorithmic trading platform for multiple assets
Stars: ✭ 37 (-2.63%)
Mutual labels:  jupyter-notebook
Depiction
interpret deep learning models in a framework-independent fashion
Stars: ✭ 38 (+0%)
Mutual labels:  jupyter-notebook
Blindsr dataset generator
Downscale a set of images by randomly created kernels and save them
Stars: ✭ 38 (+0%)
Mutual labels:  jupyter-notebook
Tensorflow1
머신러닝야학 - 텐서플로우1 수업을 위한 코드입니다.
Stars: ✭ 38 (+0%)
Mutual labels:  jupyter-notebook

This is the public repo we're going to be using to post updates on Arxiv-Manatee, which are going to be tools for searching through Arxiv-Sanity papers.

If you would like be involved, feel free to reach out to [email protected]. We could use those with experience with Tensorflow/Keras/Pytorch, NLP, abtractive summarization, and text/data processing, but we're open to considering anyone!

Our previous projects

https://github.com/re-search/DocProduct

https://github.com/re-search/gpt2-estimator

https://github.com/Santosh-Gupta/Research2Vec

https://github.com/Santosh-Gupta/Lit2Vec

Update 07-1-19

I am proud to present a summarization dataset for machine learning concepts.

https://drive.google.com/open?id=1B8qqHQNXZ4OVMKpGHAvrxPxdQlKt2Ia4

The dataset contains all 89252 machine learning papers ( cs.[CV|CL|LG|AI|NE]/stat.ML ) from Arxiv (1993-June 2019). The 'summary' is the title of the paper, and the 'source' is the abstract of each paper.

I am focused on concept summarization (as opposed to factual summarization), so all special characters and digits have been filtered out, so that the summarizars that are trained over this data can focus on the concepts described in the text.


6-7-19

Computer science summmarization datasets completed

This is a dataset of 5.6 million title / abstract data points, about 75% of which are from computer science papers (I tried my best to filter all non-CS papers (perhaps the non-CS papers add a bit of a "regularization" effect . . . ?) ) .

Title/Abstract pairs have been used to train biomedical summarizers [https://arxiv.org/pdf/1804.08875.pdf] , but I am doing a project on CS/ML papers so I made my own.

The dataset is basically a filtered version of the Semantic Scholar Corpus https://api.semanticscholar.org/corpus/

But it took some effort to produce it and I figure I may save some people time if they wanted the same.

This is a zip file containing 12 parquet files

https://drive.google.com/open?id=1WEdf-_au3vg2EzmWhawmW9xsYaHAE7iV

it's ~2.5 gb zipped, I think like 6 something gigs unzipped

This is the sqlite database version, 1 file

https://drive.google.com/open?id=1IhIaBD98BEseteAUi1S_f_SfIaUI8V4D

it's 2.5 gb zipped, 7.5 gb unzipped


To do

-Develop dataset (in progress)

-Figure out how to process data for training

https://github.com/yaserkl/RLSeq2Seq/tree/master/src/helper

-Train summarizer using this repo

https://github.com/yaserkl/RLSeq2Seq

-Download Arxiv papers from cs.[CV|CL|LG|AI|NE]/stat.ML

https://arxiv.org/help/bulk_data

Should be around ~73000 papers

The Arxiv-Sanity github repo may help figure out how to do this

https://github.com/karpathy/arxiv-sanity-preserver/blob/master/fetch_papers.py

-Find best solution to retieving on-disk text data using index or key.

Possible solution 1: sqlite

Possible solution 2. pickledb https://pythonhosted.org/pickleDB/ it looks like this is what arxiv-sanity uses https://github.com/karpathy/arxiv-sanity-preserver/blob/master/fetch_papers.py "The script is intended to enrich an existing database pickle (by default db.p)" from utils import Config, safe_pickle_dump # lets load the existing database to memory try: db = pickle.load(open(Config.db_path, 'rb'))

Possible solution 3. sqlitedict https://pypi.org/project/sqlitedict/

-Take text from extracted Arxiv papers and seperate them into sections/paragraph. Store each section/paragraph as a row in a row in a sqlite database. Should also contain Title, and arxiv link as seperate rows. Possibly abstract as well.

Possibly helpful resources

https://github.com/karpathy/arxiv-sanity-preserver/blob/master/parse_pdf_to_text.py

https://github.com/arxiv-vanity

https://github.com/arxiv-vanity/engrafo

-Figure out most efficient way to do embedding similarity search for large amounts of data.

Possible solutions 1 search through hdf5 data stored on disk

Possible solution 2 use Faiss IVF65536_HNSW32 index https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].