Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → EdCo95 → Scientific Paper Summarisation

EdCo95 / Scientific Paper Summarisation

Machine learning models to automatically summarise scientific papers

Programming Languages

139335 projects - #7 most used programming language

Labels

machine-learning natural-language-processing

Projects that are alternatives of or similar to Scientific Paper Summarisation

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Stars: ✭ 13,219 (+9016.55%)

Mutual labels: natural-language-processing

Practical Machine Learning With Python

Master the essential skills needed to recognize and solve complex real-world problems with Machine Learning and Deep Learning by leveraging the highly popular Python Machine Learning Eco-system.

Stars: ✭ 1,868 (+1188.28%)

Mutual labels: natural-language-processing

Monkeylearn Python

Official Python client for the MonkeyLearn API. Build and consume machine learning models for language processing from your Python apps.

Stars: ✭ 143 (-1.38%)

Mutual labels: natural-language-processing

NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.

Stars: ✭ 1,767 (+1118.62%)

Mutual labels: natural-language-processing

Learn To Select Data

Code for Learning to select data for transfer learning with Bayesian Optimization

Stars: ✭ 140 (-3.45%)

Mutual labels: natural-language-processing

📚Survey of previous research and related works on machine learning (especially Deep Learning) in Japanese

Stars: ✭ 140 (-3.45%)

Mutual labels: natural-language-processing

🤖 The Cocoa Artificial Intelligence Lab

Stars: ✭ 134 (-7.59%)

Mutual labels: natural-language-processing

Awesome Nlp Resources

This repository contains landmark research papers in Natural Language Processing that came out in this century.

Stars: ✭ 145 (+0%)

Mutual labels: natural-language-processing

Lexpredict Contraxsuite

LexPredict ContraxSuite

Stars: ✭ 140 (-3.45%)

Mutual labels: natural-language-processing

Code for the ACL 2018 paper "Neural Document Summarization by Jointly Learning to Score and Select Sentences"

Stars: ✭ 143 (-1.38%)

Mutual labels: natural-language-processing

Kaggle Crowdflower

1st Place Solution for CrowdFlower Product Search Results Relevance Competition on Kaggle.

Stars: ✭ 1,708 (+1077.93%)

Mutual labels: natural-language-processing

Deeplearning.ai

Stars: ✭ 139 (-4.14%)

Mutual labels: natural-language-processing

Stanford NLP group's shared Python tools.

Stars: ✭ 142 (-2.07%)

Mutual labels: natural-language-processing

Sluice Networks

Code for Sluice networks: Learning what to share between loosely related tasks

Stars: ✭ 135 (-6.9%)

Mutual labels: natural-language-processing

Multihead Siamese Nets

Implementation of Siamese Neural Networks built upon multihead attention mechanism for text semantic similarity task.

Stars: ✭ 144 (-0.69%)

Mutual labels: natural-language-processing

A Multi-Aspect Multi-Sentiment Dataset for aspect-based sentiment analysis.

Stars: ✭ 135 (-6.9%)

Mutual labels: natural-language-processing

Data augmentation for NLP

Stars: ✭ 2,761 (+1804.14%)

Mutual labels: natural-language-processing

互联网大厂面试经验

Stars: ✭ 145 (+0%)

Mutual labels: natural-language-processing

Worth-reading papers and related awesome resources on aspect-based sentiment analysis (ABSA). 值得一读的方面级情感分析论文与相关资源集合

Stars: ✭ 142 (-2.07%)

Mutual labels: natural-language-processing

Pre-Trained Models for ToD-BERT

Stars: ✭ 143 (-1.38%)

Mutual labels: natural-language-processing

View All Similar Projects ➔

Automatic Summarisation of Scientific Papers

Have you ever had to do a literature review as part of a research project and thought "I wish there was a quicker way of doing this"? This code aims to create that quicker way by developing a supervised-learning based extractive summarisation system for the summarisation of scientific papers.

For more information on the project, please see:

Ed Collins, Isabelle Augenstein, Sebastian Riedel. A Supervised Approach to Extractive Summarisation of Scientific Papers. To appear in Proceedings of CoNLL, July 2017.

Ed Collins. A supervised approach to extractive summarisation of scientific papers. UCL MEng thesis, May 2017.

Code Description

The various code files and folders are described here. Note that the data used is not uploaded here but nonetheless the repository is still over 1GB in size.

Analysis - A folder containing code used to analyse the generated summaries and create various pretty graphs. It is not essential to the functioning of the summarisers and will not work without the data.
Data - Where all data should be stored. In the folder Utility_Data are things such as stopword lists, permitted titles and a count of how many different papers each word occurs in (used for TF-IDF; calculated automatically by DataTools/DataPreprocessing/cspubsumext_creator.py.
DataTools - Contains files for manipulating and preprocessing the data. There are two particularly important files in this folder. useful_functions.py contains many important functions used to run the system. DataPreprocessing/cspubsumext_creator.py will take the parsed papers which are produced by the code in DataDownloader and preprocess them into the form used to train the models in the research automatically.
Evaluation - Contains code to evaluate summaries and calculate the ROUGE-L metric, with thanks to hapribot.
Models - Contains the code which constructs and trains each of the supervised learning modules that form the core of the summarisation system. All written in TensorFlow.
Summarisers - Contains the code which takes the trained models and uses them to actually create summaries of papers.
Visualisations - Contains code which visualises summaries by colouring them and saving them as HTML files. This is not essential to run the system.
Word2Vec - Contains the code necessary to train the Word2Vec model used for word embeddings. The actual trained Word2Vec model is not uploaded because it is too large.
DataDownloader - Contains code to download and parse the original XML paper files into the format currently used by this system - where each section title is delineated by "@&#" so the paper can easily be read and split into constituent sections by reading the whole paper as a string and splitting the string on this symbol which is very unlikely to ever occur in the text. The important file is acquire_data.py.

Running the Code

Before attempting to run this code you should setup a suitable virtualenv using Python 2.7. Install all of the requirements listed in requirements.txt with pip install -r requirements.txt.

To download the dataset and preprocess it into the form used to train the models in the paper, first run DataDownloader/acquire_data.py. This will download all of the papers and parse them into the form used - with sections separated by a special symbol - "@&#" - so that the papers can be read as strings then split into sections and titles by splitting on this symbol.

To turn these downloaded papers into training data, run DataTools/DataPreprocessing/cspubsumext_creator.py. This will take a while to run depending on your machine and number of cores (~2 hours on late 2016 MacBook Pro with dual core i7) but will handle creating all of the necessary files to train models. These are stored by default in Data/Training_Data/, with there being an individual JSON file for each paper and a single JSON file called all_data.json which is a list of all of the individual items of training data. This code now uses the ultra-fast uJSON library which reads the data much faster than the previous version which used pickle.

All of the models and summarisers should then be usable.

Be sure to check that all of the paths are correctly set! These are in DataDownloader/acquire_data.py for downloading papers, and in DataTools/useful_functions.py otherwise.

NOTE: The code in DataTools/DataPreprocessing/AbstractNetPreprocessor.py is still unpleasently inefficient and is still currently used in the summarisers themselves. The next code update will fix this and streamline the process of running the trained summarisers.

Other Notes

If you have read or are reading the MEng thesis or CoNLL paper corresponding to this code, then SAFNet = SummariserNet, SFNet = SummariserNetV2, SNet = LSTM, SAF+F Ens = EnsembleSummariser, S+F Ens = EnsembleV2Summariser.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 145

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (11) 🔗