All Projects → YaleDHLab → intertext

YaleDHLab / intertext

Licence: other
Detect and visualize text reuse

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to intertext

Answerable
Recommendation system for Stack Overflow unanswered questions
Stars: ✭ 13 (-86.6%)
Mutual labels:  text-mining
perke
A keyphrase extractor for Persian
Stars: ✭ 60 (-38.14%)
Mutual labels:  text-mining
Sudoku
The classic game in its brand new, modern shape. Badges, points & leaderboards included.
Stars: ✭ 13 (-86.6%)
Mutual labels:  web-app
dominion-card-generator
a web-app to generate mockups of fan-cards for the card game dominion easily
Stars: ✭ 20 (-79.38%)
Mutual labels:  web-app
palladian
Palladian is a Java-based toolkit with functionality for text processing, classification, information extraction, and data retrieval from the Web.
Stars: ✭ 32 (-67.01%)
Mutual labels:  text-mining
Text-Classification-LSTMs-PyTorch
The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.
Stars: ✭ 45 (-53.61%)
Mutual labels:  text-mining
Virtual-Room
A virtual room where friends share videos among them in real time directly over the web browser, with synchronized playback and a video chat at the same time.
Stars: ✭ 31 (-68.04%)
Mutual labels:  web-app
guitar
🎸 Online guitar toy and tablature recorder/player
Stars: ✭ 80 (-17.53%)
Mutual labels:  web-app
peeps-generator
Build and customize your open peeps illustrations right away!
Stars: ✭ 32 (-67.01%)
Mutual labels:  web-app
BioMedical-NLP-corpus
Biomedical NLP Corpus or Datasets.
Stars: ✭ 44 (-54.64%)
Mutual labels:  text-mining
tcloud-heroku
File sharing and torrent downloading
Stars: ✭ 24 (-75.26%)
Mutual labels:  web-app
streamlit-project
This repository provides a simple deployment-ready project layout for a Streamlit app. Simply swap out the code in `app.py` for your own and hit deploy!
Stars: ✭ 33 (-65.98%)
Mutual labels:  web-app
binnit
minimal no-fuss pastebin service clone in golang
Stars: ✭ 27 (-72.16%)
Mutual labels:  web-app
Pastebin
Modern pastebin written in golang
Stars: ✭ 111 (+14.43%)
Mutual labels:  web-app
virtool
Viral infection diagnostics using next-generation sequencing
Stars: ✭ 36 (-62.89%)
Mutual labels:  web-app
ocsigen-start
Ocsigen-start: an Eliom application skeleton ready to use to build your own application with users, (pre)registration, notifications, etc.
Stars: ✭ 70 (-27.84%)
Mutual labels:  web-app
text-mined-synthesis public
Codes for text-mined solid-state reactions dataset
Stars: ✭ 46 (-52.58%)
Mutual labels:  text-mining
24x7-Foodies---Food-Ordering-Project-in-PHP
An online food ordering website that displays the menu of available food items along with their price and allows the user to place an order after choosing the items from the menu. Technologies used: HTML, CSS, Javascript, PHP, MySQL database.
Stars: ✭ 31 (-68.04%)
Mutual labels:  web-app
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Stars: ✭ 18 (-81.44%)
Mutual labels:  minhash
crminer
⛔ ARCHIVED ⛔ Fetch 'Scholary' Full Text from 'Crossref'
Stars: ✭ 17 (-82.47%)
Mutual labels:  text-mining

Intertext

Detect and visualize text reuse within collections of plain text or XML documents.

Intertext uses machine learning and interactive visualizations to identify and display intertextual patterns in text collections. The text processing is based on minhashing vectorized strings and the web viewer is based on interactive React components. [Demo]

App preview

Installation

To install Intertext, run the steps below:

# optional: install Anaconda and set up conda virtual environment
conda create --name intertext python=3.7
conda activate intertext

# install the package
pip uninstall intertext -y
pip install https://github.com/yaledhlab/intertext/archive/master.zip

Usage

# search for intertextuality in some documents
intertext --infiles "sample_data/texts/*.txt"

# serve output
python -m http.server 8000

Then open a web browser to http://localhost:8000/output and you'll see any intertextualities the engine discovered!

CUDA Acceleration

To enable Cuda acceleration, we recommend using the following steps when installing the module:

# set up conda virtual environment
conda create --name intertext python=3.7
conda activate intertext

# set up cuda and cupy
conda install cudatoolkit
conda install -c conda-forge cupy

# install the package
pip uninstall intertext -y
pip install https://github.com/yaledhlab/intertext/archive/master.zip

Providing Metadata

To indicate the author and title of matching texts, one should pass the flag to a metadata file to the intertext command, e.g.

intertext --infiles "sample_data/texts/*.txt" --metadata "sample_data/metadata.json"

Metadata files should be JSON files with the following format:

{
  "a.xml": {
    "author": "Author A",
    "title": "Title A",
    "year": 1751,
    "url": "https://google.com?text=a.xml"
  },
  "b.xml": {
    "author": "Author B",
    "title": "Title B",
    "year": 1753,
    "url": "https://google.com?text=b.xml"
  }
}

Deeplinking

If your text documents can be read on another website, you can add a url attribute to each of your files within your metadata JSON file (see example above).

If your documents are XML files and you would like to deeplink to specific pages within a reading environment, you can use the --xml_page_tag flag to designate the tag within which page breaks are identified. Additionally, you should include $PAGE_ID in the url attribute for the given file within your metadata file, e.g.

{
  "a.xml": {
    "author": "Author A",
    "title": "Title A",
    "year": 1751,
    "url": "https://google.com?text=a.xml&page=$PAGE_ID"
  },
  "b.xml": {
    "author": "Author B",
    "title": "Title B",
    "year": 1753,
    "url": "https://google.com?text=b.xml&page=$PAGE_ID"
  }
}

If your page ids are specified within an attribute in the --xml_page_tag tag, you can specify the relevant attribute using the --xml_page_attr flag.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].