Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → Santosh-Gupta → Arxiv Manatee

Santosh-Gupta / Arxiv Manatee

Arxiv Sanity with novel paper search

Labels

jupyter-notebook

Projects that are alternatives of or similar to Arxiv Manatee

Aics Segmentation

AICS Segmentation (One-Way) Mirror

Stars: ✭ 37 (-2.63%)

Mutual labels: jupyter-notebook

Adversarial autoencoder

Implementation of Adversarial Autoencoder with Theano

Stars: ✭ 37 (-2.63%)

Mutual labels: jupyter-notebook

Stanfordextra

12k labelled instances of dogs in-the-wild with 2D keypoint and segmentations. Dataset released with our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

Stars: ✭ 38 (+0%)

Mutual labels: jupyter-notebook

Minerva Training Materials

Learn advanced data science on real-life, curated problems

Stars: ✭ 37 (-2.63%)

Mutual labels: jupyter-notebook

Dashboards

[RETIRED] See Voilà as a supported replacement

Stars: ✭ 986 (+2494.74%)

Mutual labels: jupyter-notebook

Python Berkeley

python resources of berkeley curated at a place

Stars: ✭ 37 (-2.63%)

Mutual labels: jupyter-notebook

Ipcc sr15 scenario analysis

Scenario analysis notebooks for the IPCC Special Report on Global Warming of 1.5°C

Stars: ✭ 37 (-2.63%)

Mutual labels: jupyter-notebook

Fractional differencing gpu

Rapid large-scale fractional differencing with RAPIDS to minimize memory loss while making a time series stationary. 6x-400x speed up over CPU implementation.

Stars: ✭ 38 (+0%)

Mutual labels: jupyter-notebook

Optimus

🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Stars: ✭ 986 (+2494.74%)

Mutual labels: jupyter-notebook

Moltrans

MolTrans: Molecular Interaction Transformer for Drug Target Interaction Prediction (Bioinformatics)

Stars: ✭ 38 (+0%)

Mutual labels: jupyter-notebook

Dog Breed Classifier

I built a CNN that can classify dog breeds

Stars: ✭ 37 (-2.63%)

Mutual labels: jupyter-notebook

Foggle

a script that takes posts from reddit.com/r/explainlikeimfive and enters them as a google search, so as to blur your history search

Stars: ✭ 37 (-2.63%)

Mutual labels: jupyter-notebook

Mysql History Graph

History Graphs about MySQL and forks

Stars: ✭ 37 (-2.63%)

Mutual labels: jupyter-notebook

Plate scatac Seq

A rapid and robust plate-based single cell ATAC-seq (scATAC-seq) method

Stars: ✭ 37 (-2.63%)

Mutual labels: jupyter-notebook

Lstm Autoencoder For Anomaly Detection

AI deep learning neural network for anomaly detection using Python, Keras and TensorFlow

Stars: ✭ 36 (-5.26%)

Mutual labels: jupyter-notebook

Bus number

Up Your Bus Number: A Primer for Reproducible Data Science

Stars: ✭ 37 (-2.63%)

Mutual labels: jupyter-notebook

Algotrading

Algorithmic trading platform for multiple assets

Stars: ✭ 37 (-2.63%)

Mutual labels: jupyter-notebook

Depiction

interpret deep learning models in a framework-independent fashion

Stars: ✭ 38 (+0%)

Mutual labels: jupyter-notebook

Blindsr dataset generator

Downscale a set of images by randomly created kernels and save them

Stars: ✭ 38 (+0%)

Mutual labels: jupyter-notebook

Tensorflow1

머신러닝야학 - 텐서플로우1 수업을 위한 코드입니다.

Stars: ✭ 38 (+0%)

Mutual labels: jupyter-notebook

View All Similar Projects ➔

This is the public repo we're going to be using to post updates on Arxiv-Manatee, which are going to be tools for searching through Arxiv-Sanity papers.

If you would like be involved, feel free to reach out to [email protected]. We could use those with experience with Tensorflow/Keras/Pytorch, NLP, abtractive summarization, and text/data processing, but we're open to considering anyone!

Our previous projects

https://github.com/re-search/DocProduct

https://github.com/re-search/gpt2-estimator

https://github.com/Santosh-Gupta/Research2Vec

https://github.com/Santosh-Gupta/Lit2Vec

Update 07-1-19

I am proud to present a summarization dataset for machine learning concepts.

https://drive.google.com/open?id=1B8qqHQNXZ4OVMKpGHAvrxPxdQlKt2Ia4

The dataset contains all 89252 machine learning papers ( cs.[CV|CL|LG|AI|NE]/stat.ML ) from Arxiv (1993-June 2019). The 'summary' is the title of the paper, and the 'source' is the abstract of each paper.

I am focused on concept summarization (as opposed to factual summarization), so all special characters and digits have been filtered out, so that the summarizars that are trained over this data can focus on the concepts described in the text.

6-7-19

Computer science summmarization datasets completed

This is a dataset of 5.6 million title / abstract data points, about 75% of which are from computer science papers (I tried my best to filter all non-CS papers (perhaps the non-CS papers add a bit of a "regularization" effect . . . ?) ) .

Title/Abstract pairs have been used to train biomedical summarizers [https://arxiv.org/pdf/1804.08875.pdf] , but I am doing a project on CS/ML papers so I made my own.

The dataset is basically a filtered version of the Semantic Scholar Corpus https://api.semanticscholar.org/corpus/

But it took some effort to produce it and I figure I may save some people time if they wanted the same.

This is a zip file containing 12 parquet files

https://drive.google.com/open?id=1WEdf-_au3vg2EzmWhawmW9xsYaHAE7iV

it's ~2.5 gb zipped, I think like 6 something gigs unzipped

This is the sqlite database version, 1 file

https://drive.google.com/open?id=1IhIaBD98BEseteAUi1S_f_SfIaUI8V4D

it's 2.5 gb zipped, 7.5 gb unzipped

To do

-Develop dataset (in progress)

Should be around ~73000 papers

The Arxiv-Sanity github repo may help figure out how to do this

https://github.com/karpathy/arxiv-sanity-preserver/blob/master/fetch_papers.py

-Find best solution to retieving on-disk text data using index or key.

Possible solution 1: sqlite

Possible solution 2. pickledb https://pythonhosted.org/pickleDB/ it looks like this is what arxiv-sanity uses https://github.com/karpathy/arxiv-sanity-preserver/blob/master/fetch_papers.py "The script is intended to enrich an existing database pickle (by default db.p)" from utils import Config, safe_pickle_dump # lets load the existing database to memory try: db = pickle.load(open(Config.db_path, 'rb'))

Possible solution 3. sqlitedict https://pypi.org/project/sqlitedict/

-Take text from extracted Arxiv papers and seperate them into sections/paragraph. Store each section/paragraph as a row in a row in a sqlite database. Should also contain Title, and arxiv link as seperate rows. Possibly abstract as well.

Possibly helpful resources

https://github.com/karpathy/arxiv-sanity-preserver/blob/master/parse_pdf_to_text.py

https://github.com/arxiv-vanity

https://github.com/arxiv-vanity/engrafo

-Figure out most efficient way to do embedding similarity search for large amounts of data.

Possible solutions 1 search through hdf5 data stored on disk

Possible solution 2 use Faiss IVF65536_HNSW32 index https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 38

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Santosh-Gupta / Arxiv Manatee

Labels

Projects that are alternatives of or similar to Arxiv Manatee

Update 07-1-19

-Develop dataset (in progress)

-Figure out how to process data for training

-Train summarizer using this repo

-Download Arxiv papers from cs.[CV|CL|LG|AI|NE]/stat.ML

-Find best solution to retieving on-disk text data using index or key.

-Take text from extracted Arxiv papers and seperate them into sections/paragraph. Store each section/paragraph as a row in a row in a sqlite database. Should also contain Title, and arxiv link as seperate rows. Possibly abstract as well.

-Figure out most efficient way to do embedding similarity search for large amounts of data.