All Projects → word-fish → wordfish-python

word-fish / wordfish-python

Licence: MIT License
extract relationships from standardized terms from corpus of interest with deep learning 🐟

Programming Languages

python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language
CSS
56736 projects

Projects that are alternatives of or similar to wordfish-python

Nlp Journey
Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation),etc. All codes are implemented intensorflow 2.0.
Stars: ✭ 1,290 (+6689.47%)
Mutual labels:  word2vec, gensim, lda
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (+21.05%)
Mutual labels:  word2vec, lda
doc2vec-api
document embedding and machine learning script for beginners
Stars: ✭ 92 (+384.21%)
Mutual labels:  word2vec, gensim
pydataberlin-2017
Repo for my talk at the PyData Berlin 2017 conference
Stars: ✭ 63 (+231.58%)
Mutual labels:  gensim, lda
Aravec
AraVec is a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models.
Stars: ✭ 239 (+1157.89%)
Mutual labels:  word2vec, gensim
word-embeddings-from-scratch
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.
Stars: ✭ 22 (+15.79%)
Mutual labels:  word2vec, gensim
biovec
ProtVec can be used in protein interaction predictions, structure prediction, and protein data visualization.
Stars: ✭ 23 (+21.05%)
Mutual labels:  word2vec, gensim
Splitter
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).
Stars: ✭ 177 (+831.58%)
Mutual labels:  word2vec, gensim
RolX
An alternative implementation of Recursive Feature and Role Extraction (KDD11 & KDD12)
Stars: ✭ 52 (+173.68%)
Mutual labels:  word2vec, gensim
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (+42.11%)
Mutual labels:  word2vec, lda
word2vec-pt-br
Implementação e modelo gerado com o treinamento (trigram) da wikipedia em pt-br
Stars: ✭ 34 (+78.95%)
Mutual labels:  word2vec, gensim
Gemsec
The TensorFlow reference implementation of 'GEMSEC: Graph Embedding with Self Clustering' (ASONAM 2019).
Stars: ✭ 210 (+1005.26%)
Mutual labels:  word2vec, gensim
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+931.58%)
Mutual labels:  word2vec, gensim
Word2VecAndTsne
Scripts demo-ing how to train a Word2Vec model and reduce its vector space
Stars: ✭ 45 (+136.84%)
Mutual labels:  word2vec, gensim
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (+894.74%)
Mutual labels:  word2vec, gensim
text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (+326.32%)
Mutual labels:  word2vec, corpus
NMFADMM
A sparsity aware implementation of "Alternating Direction Method of Multipliers for Non-Negative Matrix Factorization with the Beta-Divergence" (ICASSP 2014).
Stars: ✭ 39 (+105.26%)
Mutual labels:  word2vec, lda
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+67073.68%)
Mutual labels:  word2vec, gensim
Log Anomaly Detector
Log Anomaly Detection - Machine learning to detect abnormal events logs
Stars: ✭ 169 (+789.47%)
Mutual labels:  word2vec, gensim
walklets
A lightweight implementation of Walklets from "Don't Walk Skip! Online Learning of Multi-scale Network Embeddings" (ASONAM 2017).
Stars: ✭ 94 (+394.74%)
Mutual labels:  word2vec, gensim

wordfish-python

DOI

If pulling a thread of meaning from woven text
is that which your heart does wish.
Not so absurd or seemingly complex,
if you befriend a tiny word fish.

wordfish

Choose your input corpus, terminologies, and deployment environment, and an application will be generated to use deep learning to extract features for text, and then entities can be mapped onto those features to discover relationships and classify new texty things. Custom plugins will allow for dynamic generation of corpus and terminologies from data structures and standards of choice from wordfish-plugins You can have experience with coding (and use the functions in the module as you wish), or no experience at all, and let the interactive web interface walk you through generation of your application. This will ideally be able to generate single instances of analysis applications, and an instance that we can deploy on the cloud (and integrate into a collaborative, cloud-based tool for many researchers to use).

will eventually be here

** under development! ** not ready for use!

0. Install the tool

  pip install git+git://github.com/word-fish/wordfish-python.git

1. Generate your application

Call the tool to configure your application:

wordfish

view1

Select your terminologies and corpus. view2 view3

A custom application is generated for you! view4

This will produce a folder for you to drop in your cluster environment.

2. Install

Drop the folder into your home directory of your cluster environment. Run the install script to install the package itself, and generate the output folder structure. The only argument that you need to supply is the base of your output directory:

  WORK=/scratch/users/vsochat/wordfish
  bash install.sh $WORK

All scripts for you to run are in the scripts folder here:

  cd $WORK/scripts

Each of these files corresponds to a step in the pipeline, and is simply a list of commands to be run in parallel. You can use launch, or submit each command to a SLURM cluster. There will eventually be scripts provided for easily running with your preferred method.

Project Current Status

Plugins are being developed, and pipelines tested. When this is finished, the functionality will be integrated into the application generation. It is not yet decided if a database will be used for the initial processing. For deployment options, it makes sense to deploy the module folder to a cluster environment, and then perhaps deploy an application with docker. I have not yet decided. I have also not yet implemented the inference, but I have a good idea of how I'm going to do it.

3. Running the Pipeline

After the installation of your custom application is complete, this install script simply runs run.py, which generates all output folders and running scripts. It used to be the case that this script did some preprocessing, but I have moved all these steps to be specified in the files in the scripts folder. This means that you have a few options for running:

  • sumbit the commands in serial, locally. You can run a job file with bash, bash run_extraction_relationships.job
  • submit the commands to a launch cluster, something like launch -s run_extraction_relationships.job
  • submit the commands individually to a slurm cluster. This will mean reading in the file, and submitting each script with a line like sbatch -p normal -j myjob.job [command line here]

4. Infrastructure

The jobs are going to generate output to fill in the following file structure in your project base folder (and files that will eventually be produced are shown):

  WORK
          corpus
              corpus1
                  12345_sentences.txt
                  12346_sentences.txt
              corpus2
                  12345_sentences.txt
                  12346_sentences.txt
          terms
              terms1_terms.txt
              terms2_relationships.txt

          scripts
              run_extraction_corpus.job
              run_extraction_relationships.job
              run_extraction_terms.job

The folders are generated dynamically by the run.py script for each corpus and terms plugin based on the "tag" variable in the plugin's config. Relationships, by way of being associated with terms, are stored in the equivalent folder, and the process is only separate because it is not the case that all plugins for terms can have relationships defined. The corpus are kept separate at this step as the output has not been parsed into the wordfish standard to allow integration across corpus and terminologies, at which point wordfish unique IDs will be assigned.

Deployment Options

Right now the only option (that works) is to generate a folder and install on a cluster, however if there is interest or need, I can generate a version to install and run on a virtual machine, either with vagrant (local) or vagrant with amazon web services, and I am also thinking about docker for easy cloud deployment.

Plugins

Plugins will be resources from which to derive corpus and/or terminology (terms). Optionally, a terminology can also have relationships. To see an initial list of available plugins, see the wordfish-plugins repo.

Data

We will eventually want to relate these analyses to data, such as brain imaging data. For example, NeuroVault is a database of whole-brain statistical maps with annotations for terms from the cognitive atlas, so this means we can link brain imaging data to terms from the cognitive atlas ontology, and either of the fsl or fma_nif plugins, which both define brain regions. Toward this aim a separate wordfish-data repo has been added. Nothing has been developed here yet, but it's in the queue.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].