All Projects → kariminf → allsummarizer

kariminf / allsummarizer

Licence: Apache-2.0 license
Multilingual automatic text summarizer using statistical approach and extraction

Programming Languages

java
68154 projects - #9 most used programming language
python
139335 projects - #7 most used programming language
perl
6916 projects

Projects that are alternatives of or similar to allsummarizer

TextRank-node
No description or website provided.
Stars: ✭ 21 (-25%)
Mutual labels:  text-summarization, sentence-extraction
ir datasets
Provides a common interface to many IR ranking datasets.
Stars: ✭ 190 (+578.57%)
Mutual labels:  information-retrieval, ir
awesome-semantic-search
A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.
Stars: ✭ 161 (+475%)
Mutual labels:  information-retrieval, information-retrival
Bidirectiona-LSTM-for-text-summarization-
A bidirectional encoder-decoder LSTM neural network is trained for text summarization on the cnn/dailymail dataset. (MIT808 project)
Stars: ✭ 73 (+160.71%)
Mutual labels:  text-summarization
kex
Kex is a python library for unsupervised keyword extraction from a document, providing an easy interface and benchmarks on 15 public datasets.
Stars: ✭ 46 (+64.29%)
Mutual labels:  information-retrieval
tutorials
A tutorial series by Preferred.AI
Stars: ✭ 136 (+385.71%)
Mutual labels:  information-retrieval
IP-Tracker
Track any ip address with IP-Tracker. IP-Tracker is developed for Linux and Termux. you can retrieve any ip address information using IP-Tracker.
Stars: ✭ 53 (+89.29%)
Mutual labels:  information-retrieval
ps-srum-hunting
PowerShell Script to facilitate the processing of SRUM data for on-the-fly forensics and if needed threat hunting
Stars: ✭ 16 (-42.86%)
Mutual labels:  ir
bookworm
📚 social networks from novels
Stars: ✭ 72 (+157.14%)
Mutual labels:  information-retrieval
ml4ir
Machine Learning for Information Retrieval
Stars: ✭ 75 (+167.86%)
Mutual labels:  information-retrieval
GNN-Recommender-Systems
An index of recommendation algorithms that are based on Graph Neural Networks.
Stars: ✭ 505 (+1703.57%)
Mutual labels:  information-retrieval
HAR
Code for WWW2019 paper "A Hierarchical Attention Retrieval Model for Healthcare Question Answering"
Stars: ✭ 22 (-21.43%)
Mutual labels:  information-retrieval
BM25Transformer
(Python) transform a document-term matrix to an Okapi/BM25 representation
Stars: ✭ 50 (+78.57%)
Mutual labels:  information-retrieval
srqm
An introductory statistics course for social scientists, using Stata
Stars: ✭ 43 (+53.57%)
Mutual labels:  statistical-methods
cs6101
The Web IR / NLP Group (WING)'s public reading group at the National University of Singapore.
Stars: ✭ 17 (-39.29%)
Mutual labels:  information-retrieval
3d model retriever
Experimenting with a newly published deep learning paper and how it can be used for content-based 3D model retrieval. (info retrieval for CAD)
Stars: ✭ 45 (+60.71%)
Mutual labels:  information-retrieval
pytorch-translm
An implementation of transformer-based language model for sentence rewriting tasks such as summarization, simplification, and grammatical error correction.
Stars: ✭ 22 (-21.43%)
Mutual labels:  text-summarization
ml-nlp-services
机器学习、深度学习、自然语言处理
Stars: ✭ 23 (-17.86%)
Mutual labels:  information-retrieval
Azure-Sentinel-4-SecOps
Microsoft Sentinel SOC Operations
Stars: ✭ 140 (+400%)
Mutual labels:  ir
src
tools for fast reading of docs
Stars: ✭ 40 (+42.86%)
Mutual labels:  information-retrieval

AllSummarizer

Project Type License GitHub release Github All Releases

A research project implementation for automatic text summarization. AllSummarizer uses an extractive method to generate the summary ; Each sentence is scored based on some criteria, reordered, then if it scores among the first ones it will be included in the summary.

For more documentation check this

You can find more about the method in the paper:

@inproceedings {13-aries-al,
	author = {Aries, Abdelkrime and Oufaida, Houda and Nouali, Omar},
	title = {Using clustering and a modified classification algorithm for automatic text summarization},
	series = {Proc. SPIE},
	volume = {8658},
	number = {},
	pages = {865811-865811-9},
	year = {2013},
	doi = {10.1117/12.2004001},
	URL = { http://dx.doi.org/10.1117/12.2004001}
}

Also, the participation of the system at MultiLing 2015 workshop:

@Inbook{15-aries-al,
  author = {Aries, Abdelkrime
            and Zegour, Eddine Djamel
            and Hidouci, Walid Khaled},
  chapter = {AllSummarizer system at MultiLing 2015: Multilingual single and multi-document summarization},
  title = {Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue},
  year = {2015},
  publisher = {Association for Computational Linguistics},
  pages = {237--244},
  location = {Prague, Czech Republic},
  url = {http://aclweb.org/anthology/W15-4634}
}
@inproceedings{18-aries-al,
	author    = {Abdelkrime Aries and
	Djamel Eddine Zegour and
	Walid{-}Khaled Hidouci},
	title     = {Exploring Graph Bushy Paths to Improve Statistical Multilingual Automatic Text Summarization},
	booktitle = {Computational Intelligence and Its Applications - 6th {IFIP} {TC}
	5 International Conference, {CIIA} 2018, Oran, Algeria, May 8-10,
	2018, Proceedings},
	pages     = {78--89},
	year      = {2018},
	url       = {https://doi.org/10.1007/978-3-319-89743-1\_8},
	doi       = {10.1007/978-3-319-89743-1\_8},
	timestamp = {Sat, 05 May 2018 23:05:32 +0200},
	biburl    = {https://dblp.org/rec/bib/conf/ciia/AriesZH18},
	bibsource = {dblp computer science bibliography, https://dblp.org}
}

Dependencies:

This project is dependent to other projects:

  • KToolJa: for file management and plugins
  • LangPi: for text preprocessing; which depends on other libraries

Preprocessing plugins are in the folder: "preProcess". For Hebrew and Tai preprocessing tools, check LangPi releases. Those two plugins are not Apache2 licensed.

Command line usage

To execute from command line:

  • Jar file: java -jar <jar_name> options
  • Class: java kariminf.as.ui.MonoDoc options

input/output options:

  • -i <input_file>: it must be a file or a folder if it is multidocument or variant inputs
  • -o <output_file>: it must be a file or a folder if it is multidocument or there is multiple output lengths, feature combinations or thresholds
  • -v: variant inputs; a folder that contains files or folders to be summarized.

summary options:

sumary unit:

  • -b: we use Bytes to specify the summary size.
  • -c: we use characters to specify the summary size.
  • -w: we use words to specify the summary size.
  • -s: we use sentences to specify the summary size.

sumary length:

  • -n : defines the number of units to be extracted.
  • -r : ratio from 1 to 100% defines the percentage of units to be extracted. you can specify more than one length, by separating the lengths with semicolons

summarizer options:

  • -f : the features used to score the sentences. the features are separated by commas; for example: tfu,pos for multiple combinations, we use semicolons; for example: tfu,pos;tfb,len
  • -t : a number from 0 to 100 to specify the threshold of clustering. for multiple thresholds, we use semicolons; for example: 5;50

To get help, use -h

Examples of command line

Suppose we have a folder for inputs called "exp":

exp
├── multi
│   ├── M001
│   │   ├── M0010.english
│   │   ├── M0011.english
│   │   └── M0012.english
│   └── M002
│       ├── M0020.english
│       ├── M0021.english
│       └── M0022.english
└── single
    ├── doc1.txt
    └── doc2.txt

single document examples:

the command:

-i "exp/single" -o "exp/output" -l en -t "5-15:5" -n "100;200" -c -f "tfu,pos;tfb,rleng" -v

gives these files:

doc1.txt_0.05_Pos-TFU_100c.txt    doc1.txt_0.1_Pos-TFU_100c.txt     doc2.txt_0.15_Pos-TFU_100c.txt
doc1.txt_0.05_Pos-TFU_200c.txt    doc1.txt_0.1_Pos-TFU_200c.txt     doc2.txt_0.15_Pos-TFU_200c.txt
doc1.txt_0.05_RLeng-TFB_100c.txt  doc1.txt_0.1_RLeng-TFB_100c.txt   doc2.txt_0.15_RLeng-TFB_100c.txt
doc1.txt_0.05_RLeng-TFB_200c.txt  doc1.txt_0.1_RLeng-TFB_200c.txt   doc2.txt_0.15_RLeng-TFB_200c.txt
doc1.txt_0.15_Pos-TFU_100c.txt    doc2.txt_0.05_Pos-TFU_100c.txt    doc2.txt_0.1_Pos-TFU_100c.txt
doc1.txt_0.15_Pos-TFU_200c.txt    doc2.txt_0.05_Pos-TFU_200c.txt    doc2.txt_0.1_Pos-TFU_200c.txt
doc1.txt_0.15_RLeng-TFB_100c.txt  doc2.txt_0.05_RLeng-TFB_100c.txt  doc2.txt_0.1_RLeng-TFB_100c.txt
doc1.txt_0.15_RLeng-TFB_200c.txt  doc2.txt_0.05_RLeng-TFB_200c.txt  doc2.txt_0.1_RLeng-TFB_200c.txt

the command:

-i "exp/single/doc1.txt" -o "exp/output" -l en -t 5 -r "5;10" -c -f "tfu,pos"

gives these files:

doc1.txt_0.05_Pos-TFU_10%c.txt  doc1.txt_0.05_Pos-TFU_5%c.txt

multi-document examples:

the command:

-i "exp/multi" -o "exp/output" -l en -t 5 -r "5;10" -c -f "tfu,pos" -v -m

gives these files:

M001_0.05_Pos-TFU_10%c.txt  M001_0.05_Pos-TFU_5%c.txt  
M002_0.05_Pos-TFU_10%c.txt  M002_0.05_Pos-TFU_5%c.txt

License

Copyright (C) 2012-2017 Abdelkrime Aries

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].