All Projects → ericleasemorgan → reader

ericleasemorgan / reader

Licence: GPL-2.0 license
Distant Reader, a tool for using & understanding a corpus

Programming Languages

shell
77523 projects
javascript
184084 projects - #8 most used programming language
perl
6916 projects
HTML
75241 projects
python
139335 projects - #7 most used programming language
CSS
56736 projects

Projects that are alternatives of or similar to reader

TableDisentangler
Functional and structural analysis of tables in research papers (Table disentangling)
Stars: ✭ 21 (+16.67%)
Mutual labels:  text-mining
Twitter-Sentiment-Analyzer
Twitter Sentiment Analyzer
Stars: ✭ 13 (-27.78%)
Mutual labels:  text-mining
textlearnR
A simple collection of well working NLP models (Keras, H2O, StarSpace) tuned and benchmarked on a variety of datasets.
Stars: ✭ 16 (-11.11%)
Mutual labels:  text-mining
teanaps
자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.
Stars: ✭ 91 (+405.56%)
Mutual labels:  text-mining
deduce
Deduce: de-identification method for Dutch medical text
Stars: ✭ 40 (+122.22%)
Mutual labels:  text-mining
SEDTWik-Event-Detection-from-Tweets
Segmentation based event detection from Tweets. Published at NAACL SRW 2019
Stars: ✭ 58 (+222.22%)
Mutual labels:  text-mining
woolly
The Text Mining Elixir
Stars: ✭ 48 (+166.67%)
Mutual labels:  text-mining
neji
Flexible and powerful platform for biomedical information extraction from text
Stars: ✭ 37 (+105.56%)
Mutual labels:  text-mining
Udacity-Data-Analyst-Nanodegree
Repository for the projects needed to complete the Data Analyst Nanodegree.
Stars: ✭ 31 (+72.22%)
Mutual labels:  text-mining
textreadr
Tools to uniformly read in text data including semi-structured transcripts
Stars: ✭ 65 (+261.11%)
Mutual labels:  text-mining
PubMed-Best-Match
Machine-learning based pipeline relying on LambdaMART currently used in PubMed for relevance (Best Match) searches
Stars: ✭ 36 (+100%)
Mutual labels:  text-mining
Search
Blue Brain text mining toolbox for semantic search and structured information extraction
Stars: ✭ 26 (+44.44%)
Mutual labels:  text-mining
JoSH
[KDD 2020] Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding
Stars: ✭ 55 (+205.56%)
Mutual labels:  text-mining
iis
Information Inference Service of the OpenAIRE system
Stars: ✭ 16 (-11.11%)
Mutual labels:  text-mining
tf-idf-python
Term frequency–inverse document frequency for Chinese novel/documents implemented in python.
Stars: ✭ 98 (+444.44%)
Mutual labels:  text-mining
estratto
parsing fixed width files content made easy
Stars: ✭ 12 (-33.33%)
Mutual labels:  text-mining
odinson
Odinson is a powerful and highly optimized open-source framework for rule-based information extraction. Odinson couples a simple, yet powerful pattern language that can operate over multiple representations of text, with a runtime system that operates in near real time.
Stars: ✭ 59 (+227.78%)
Mutual labels:  text-mining
TabInOut
Framework for information extraction from tables
Stars: ✭ 37 (+105.56%)
Mutual labels:  text-mining
rosette-elasticsearch-plugin
Document Enrichment plugin for Elasticsearch
Stars: ✭ 25 (+38.89%)
Mutual labels:  text-mining
extractnet
A Dragnet that also extract author, headline, date, keywords from context
Stars: ✭ 52 (+188.89%)
Mutual labels:  text-mining

Distant Reader CORD

The Distant Reader CORD is a high performance computing (HPC) system which: 1) takes an almost arbitrary amount of unstructured data (text) as input and outputs a set of structured data for analysis, and 2) does this work against a specific data set called CORD-19. (Reader CORD is based on a different software suite called Distant Reader Classic which is designed for more generic sets of input.)

To do this work, the Distant Reader CORD first caches the data set. It then transforms the content into a set of plain text files. Third, the Reader does text mining and natural language processing against the text files for the purpose of feature extraction: n-grams, parts-of-speech, named-entities, etc. The results of this process is a set of tab-delimited text files. The whole of the tab-delimited text files is then distilled into a relational database. A set of tabular and narrative reports is then generated against the database. The cache, transformed plain text files, tab-delimited files, relational database, and reports are then compressed ito a single (zip) file, and returned to the... reader. [1]

The returned file is affectionately called a "study carrel". The student, researcher, or scholar is intended to peruse the study carrel for the purpose of supplementing the more traditional reading process. For more detail, links of possible interest include:

As an HPC, the Distant Reader CORD is not a single computer program but instead a suite of software comprised of many individual scripts and applications. Personally, I see the scripts and applications akin to collection of poems used to make the output of human expression more cogent. Really. Seroiusly.

As a collection of scripts and applications, the Distant Reader has only been built by "standing on the shoulders of giants". Cited here in no particular order nor necessarily complete, they include these below and more:

  • the Perl-based LWP modules - this software is a significant part of harvesting process
  • Wget - an absolutely wonderful Internt spidering application
  • Tika - a Java-based library which transforms just about any file into plain text
  • Spacy - a Python module which simplifies natural language processing operations
  • Gensim - another Python module for natural language processing
  • Textacy - a Python module building on the good work of Spacey
  • SQLite - a cross-platform, SQL-compliant relational database library/application
  • OpenStack - a tool for building virtual machines
  • Slurm - a tool for instantiating a cluster of computer nodes and what runs on them
  • Airivata - a Web-based suite of software used to monitor computing jobs on a cluster
  • Other Python Libraries - sqlalchemy, pandas, itertools, wordcloud, scipy, sklearn, networkx, textatistic, nltk
  • Other Perl Modules - DBI, JSON, Archive::Zip, WebService::Solr, XML::XPath, CGI, File::Basename, File::Copy, HTML::Entities, HTML::Escape
  • Javascript Libraries - bootstap, jquery
  • Other Programs - csvstack

If you have any questions, then please don't hesitate to ask.

"Happy reading!"

[1] Just like GNU, the Distant Reader's defintion is rather recursive


Eric Lease Morgan <[email protected]>
Navari Family Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame
574/631-8604

Created: June 28, 2018
Updated: May 31, 2020

cord-19

This suite of software will prepare a data set called "CORD-19" for processing with the Distant Reader.

CORD-19 is a set of more than 50,000 full text scholarly journal articles surrounding the topic of COVID-19. Each "article" is really a JSON file containing (very) rudimentary bibliographic information, a set of paragraphs, and bibliographic citations. As a pre-processing step for the Distant Reader, the suite processes the CORD-19 metadata and its associated JSON files.

To get this software to work for you, pip install -r requirements.txt, configure ./bin/cache.sh, and the run ./bin/build.sh. The system will then:

  1. download a zip file and its associated metadata file
  2. uncompress the the zip file
  3. move all the JSON files to a single directory
  4. initialize a database
  5. pour the metadata into the the database
  6. output a simple narrative report summarizing the content of the metadata file

Depending on the network connection, the build process takes less than 7 minutes.

The next steps are the creation of two scripts:

  1. Given an SQL SELECT statement, return a list of keys, and use them to initialize a Distant Reader study carrel
  2. Given a JSON file, output a more human-readable version of the same

Wish us luck.


Eric Lease Morgan <[email protected]>
May 14, 2020

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].