All Projects → akashp1712 → summarize-webpage

akashp1712 / summarize-webpage

Licence: MIT License
A small NLP SAAS project that summarize a webpage

Programming Languages

python
139335 projects - #7 most used programming language
HTML
75241 projects
CSS
56736 projects
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to summarize-webpage

Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+3229.41%)
Mutual labels:  nltk, text-summarization
nlp-akash
Natural Language Processing notes and implementations.
Stars: ✭ 66 (+94.12%)
Mutual labels:  nltk, text-summarization
youtube-video-maker
📹 A tool for automatic video creation and uploading on YouTube
Stars: ✭ 134 (+294.12%)
Mutual labels:  nltk
Brief
In a nutshell, this is a Text Summarizer
Stars: ✭ 29 (-14.71%)
Mutual labels:  text-summarization
xl-sum
This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
Stars: ✭ 160 (+370.59%)
Mutual labels:  text-summarization
DocSum
A tool to automatically summarize documents abstractively using the BART or PreSumm Machine Learning Model.
Stars: ✭ 58 (+70.59%)
Mutual labels:  text-summarization
NLP Toolkit
Library of state-of-the-art models (PyTorch) for NLP tasks
Stars: ✭ 92 (+170.59%)
Mutual labels:  text-summarization
Text-Summarization
Abstractive and Extractive Text summarization using Transformers.
Stars: ✭ 38 (+11.76%)
Mutual labels:  text-summarization
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-11.76%)
Mutual labels:  nltk
PlanSum
[AAAI2021] Unsupervised Opinion Summarization with Content Planning
Stars: ✭ 25 (-26.47%)
Mutual labels:  text-summarization
reddit-opinion-mining
Sentiment analysis and opinion mining of Reddit data.
Stars: ✭ 15 (-55.88%)
Mutual labels:  nltk
Text-Summarization-Repo
텍스트 요약 분야의 주요 연구 주제, Must-read Papers, 이용 가능한 model 및 data 등을 추천 자료와 함께 정리한 저장소입니다.
Stars: ✭ 213 (+526.47%)
Mutual labels:  text-summarization
Stock-Analyser
📈 Stocks technical analysis code collection and Stocks data platform.
Stars: ✭ 30 (-11.76%)
Mutual labels:  nltk
gazeta
Gazeta: Dataset for automatic summarization of Russian news / Газета: набор данных для автоматического реферирования на русском языке
Stars: ✭ 25 (-26.47%)
Mutual labels:  text-summarization
Introduction-to-text-mining-with-Python
Lectures in Urban Data Science Lab, Seoul
Stars: ✭ 25 (-26.47%)
Mutual labels:  nltk
allsummarizer
Multilingual automatic text summarizer using statistical approach and extraction
Stars: ✭ 28 (-17.65%)
Mutual labels:  text-summarization
Intelligent Document Finder
Document Search Engine Tool
Stars: ✭ 45 (+32.35%)
Mutual labels:  text-summarization
Entity2Topic
[NAACL2018] Entity Commonsense Representation for Neural Abstractive Summarization
Stars: ✭ 20 (-41.18%)
Mutual labels:  text-summarization
TextSummarizer
TextRank implementation for C#
Stars: ✭ 29 (-14.71%)
Mutual labels:  text-summarization
ru punkt
Russian language support for NLTK's PunktSentenceTokenizer
Stars: ✭ 49 (+44.12%)
Mutual labels:  nltk

Summarize-Webpage, Powered by nlp-akash.

A Flask application that extract and summarize webpage using Natural Language Processing.

screen_webpage_1

Index

Motivation

Motivation of this project to make production ready API for Text Summarization Algorithm and leverage it for the real-world use cases. This project implements a NLP algorithm using Python and serves using flask API.

How to start

You need to manually clone or download this repository. Project is ready to run (with some requirements).

You need to run app.py file in your development environment.

Open http://127.0.0.1:5000/, customize project files and have fun.

Requirements

The suggested way is to use python virtual environment. The project is developed using python 3.7.1

Included modules support

Python

This project uses very simple python web framework called Flask, which is very easy to learn and adopt. (even scale!!!)

The NLTK - Natural Language ToolKit is used for the Text Summarization Algorithm implementation.

HTML

The HTML Template used in this project is Stanley - Bootstrap Freelancer Template.

JavaScript

  • Vanilla Javascript

CSS

  • Vanilla CSS

Installation

Run requirements.txt to install the required python packages.

$ pip install -r requirements.txt

Implementation

Project Structure

|───config/
|───framework/
|───implementaion/
|───static/
|───templates/
|───app.py
|───wsgi.py

Framework

├──framework
| |──justext
| |──parser

jusText (the original framework) is developed by miso-belica

  • justext is modified code from jusText which is a Heuristic based boilerplate removal tool. The original code is modified to parse some of the tags (i.e, <P>, <li>, <b>, <H1>...<H6>), etc

    • Please note that, this project only uses English stopwords from the original project.
  • We're using jusText framework to download the webcontent and parse it using parser.

    • parser creates list of Paragraph object which has following properties:
1. is_heading -> boolean
   :: returns true if paragraph is heading (<H1>...<H6>) 


2. is_list_set -> boolean 
   :: returns true if paragraph is list tag (<li>)


3. is_paragraph -> boolean
   :: returns true if paragraph is paragraph tag (<p>)

4. is_first_paragraph(self):
   :: returns true if the paragraph is the first paragraph from the content.

5. text(self):
   :: get the text content of the paragraph without any tags

Summarization Algorithm

├──implementaion
| |──word_frequency_summarize_parser.py

This is the core module of this project: The implementation of the Summarization Algorithm.

Word_Frequency_Summarization: Summarization implementation using word frequency.

Important: This project has implemented slightly modified version of the Algorithm, where scoring the sentences method considers the web Text properties such as Header or list text.

i.e, it gives more weighing to Header or Bold text than normal text.

# All weightage for structure doc
# Important: These scores are for the experimenting purpose only

WEIGHT_FOR_LIST = 5
WEIGHT_FOR_HIGHLIGHTED = 10
WEIGHT_FOR_NUMERICAL = 5
WEIGHT_FIRST_PARAGRAPH = 5
WEIGHT_BASIC = 1

...

 for word in words:
    if paragraph.is_list_set:
        weight = WEIGHT_FOR_LIST
    else:
        weight = WEIGHT_BASIC

    if word in highlighted_words:
        weight += WEIGHT_FOR_HIGHLIGHTED

    if word.isnumeric() and len(word) >= 2:
        weight += WEIGHT_FOR_NUMERICAL

    if paragraph.is_first_paragraph:
        weight += WEIGHT_FIRST_PARAGRAPH

    word = ps.stem(word)
    if word in stopWords:
        continue

    if word in freqTable:
        freqTable[word] += weight
    else:
        freqTable[word] = weight

This way we can give extra weightage to words which are part of the headers or list. This way we can give more importnace to such words.

Idea: Play with the weightage and see the difference in the result!


Flask service

├──app.py

What if we want to make our Algorithm as servable API? (SAAS startup ???) Yes! we can do that, The app.py is flask module which serves an API that summarize the webpage

# `summarize` method takes webpage url as argument and returns the summarized text as json object
@app.route('/v1/summarize', methods=['GET'])
def summarize():
    ...
Usage:

This is a GET API which can be queried easily using CURL, Postman or your favourite browser.

ie, GET /v1/summarize?url=https://medium.com/@bnoll12/real-freedom-539c8e9499bb

OR via browser

http://localhost:5000/v1/summarize?url=https://medium.com/@bnoll12/real-freedom-539c8e9499bb


Let's add some UI

├──templates
├ ├──index.html
├──static
├ ├──assets
├ ├ ├──css
├ ├ ├──js
1. Accept the website url from the user

The following interface takes the website url and request the API we've developed using ajax.

screen_webpage_1

2. Ajax request using javascript: main.js
$.ajax({
    url: baseUrl + "?url=" + mediumURL
}).then(function(data) {
   processSummary(mediumURL, data.summary);
});
3. Process API response and display on UI

The API response is displayed on the HTML page using javascript.

var summary = document.createElement('p');
summary.innerHTML = "<b>Summary</b>: " + summaryData;
$('#summary').append(summary);

screen_webpage_2

Contribution

Feel free to raise an issue for bug or feature request And pull request for any kind of improvements.

Ideas

If you find this project interesting, you can do pretty more now, followings ideas might help you.

  • We can customize the API by adding more options to manipulate the output. ie, summary length, ignoring list text, etc
  • Display list of sentences instead of paragraph.
  • Create chrome plugin and highlight the sentences.

Credits

This application uses Open Source components. You can find the source code of their open source projects along with license information below.

I acknowledge and is grateful to these developers for their contributions to open source.

jusText used in /framework
Project: Heuristic based boilerplate removal tool https://github.com/miso-belica/jusText
Copyright (c) 2011, Jan Pomikalek <[email protected]> Copyright (c) 2013, Michal Belica. All Rights Reserved.
License (2-Clause BSD) https://github.com/miso-belica/jusText/blob/dev/LICENSE.rst
HTML template theme
Project: Stanley - HTML theme by TemplateMag (https://templatemag.com)
Copyrights Stanley. All Rights Reserved.
Licensing information: https://templatemag.com/license/
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].