All Projects → tao-pr → vor-knowledge-graph

tao-pr / vor-knowledge-graph

Licence: other
🎓 Open knowledge mining and graph builder

Programming Languages

python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language
HTML
75241 projects

Projects that are alternatives of or similar to vor-knowledge-graph

Quran-and-Arabic-Language-Repository
Projects & Libraries related to Quran & Arabic Language
Stars: ✭ 26 (-54.39%)
Mutual labels:  text-mining
boltex
Elixir driver for the neo4j bolt protocol
Stars: ✭ 27 (-52.63%)
Mutual labels:  graph-database
textstem
Tools for fast text stemming & lemmatization
Stars: ✭ 36 (-36.84%)
Mutual labels:  text-mining
restaurant-finder-featureReviews
Build a Flask web application to help users retrieve key restaurant information and feature-based reviews (generated by applying market-basket model – Apriori algorithm and NLP on user reviews).
Stars: ✭ 21 (-63.16%)
Mutual labels:  text-mining
FSCNMF
An implementation of "Fusing Structure and Content via Non-negative Matrix Factorization for Embedding Information Networks".
Stars: ✭ 16 (-71.93%)
Mutual labels:  word2vec-model
GraphiPy
GraphiPy: Universal Social Data Extractor
Stars: ✭ 61 (+7.02%)
Mutual labels:  graph-database
jnosql.github.io
The JNoSQL is a framework whose has the goal to help Java developers to create Java EE applications with NoSQL, whereby they can make scalable application beyond enjoy the polyglot persistence.
Stars: ✭ 13 (-77.19%)
Mutual labels:  graph-database
named-entity-recognition
Notebooks for teaching Named Entity Recognition at the Cultural Heritage Data School, run by Cambridge Digital Humanities
Stars: ✭ 18 (-68.42%)
Mutual labels:  text-mining
gofastr
Make a DocumentTermMatrix faster
Stars: ✭ 19 (-66.67%)
Mutual labels:  text-mining
blueprints-text
Jupyter notebooks for our O'Reilly book "Blueprints for Text Analysis Using Python"
Stars: ✭ 103 (+80.7%)
Mutual labels:  text-mining
Guten-gutter
Strips boilerplate from Project Gutenberg text files
Stars: ✭ 16 (-71.93%)
Mutual labels:  text-mining
SparseLSH
A Locality Sensitive Hashing (LSH) library with an emphasis on large, highly-dimensional datasets.
Stars: ✭ 127 (+122.81%)
Mutual labels:  text-mining
textdigester
TextDigester: document summarization java library
Stars: ✭ 23 (-59.65%)
Mutual labels:  text-mining
TextDatasetCleaner
🔬 Очистка датасетов от мусора (нормализация, препроцессинг)
Stars: ✭ 27 (-52.63%)
Mutual labels:  text-mining
seabolt
Neo4j Bolt Connector for C
Stars: ✭ 37 (-35.09%)
Mutual labels:  graph-database
Introduction-to-text-mining-with-Python
Lectures in Urban Data Science Lab, Seoul
Stars: ✭ 25 (-56.14%)
Mutual labels:  text-mining
ipo-miner
IPO Investment via Text Mining.
Stars: ✭ 20 (-64.91%)
Mutual labels:  text-mining
aera-workshop
This workshop introduces participants to the Learning Analytics (LA), and provides a brief overview of LA methodologies, literature, applications, and ethical issues as they relate to STEM education.
Stars: ✭ 14 (-75.44%)
Mutual labels:  text-mining
sensim
Sentence Similarity Estimator (SenSim)
Stars: ✭ 15 (-73.68%)
Mutual labels:  text-mining
advanced-text-mining
TEANAPS 라이브러리를 활용한 자연어 처리와 텍스트 분석 방법론에 대해 다룹니다.
Stars: ✭ 15 (-73.68%)
Mutual labels:  text-mining

Project vör : Open Knowledge modeling


Network Network


Synopsis

The project is initiated as a dirty hack for crawling and modeling a large volume of open knowledge out there in Wikipedia. Thus, we have a "nearly" complete graph of those knowledge, also obtain an ability to traverse the relations between knowledge topics.


Infrastructure / Prerequisites

To build and run the knowledge graph engine with vör, you need the following software for the infrastructure.


Setup

Install python 3.x requirements by:

  $ pip3 install -r -U requirements.txt

Install Node.js modules required by the graph visualiser. You may ignore these steps if you are not interested in visualisation.

  $ npm install

Other than registered NPM packages, you also need to install Sigma.js for visualisation. The module is not bundled within this repository.


1) Download (crawl) wikipedia pages

Execute:

  $ python3 crawl_wiki.py --verbose 

The script continuously and endlessly crawls the knowledge topic from Wikipedia starting from the seeding page. You may change the initial topic within the script to what best suits you. To stop the process, just terminate is fine. It won't leave anything at dirty stat so you can re-execute the script again at any time.

[NOTE] The script keeps continuously crawling and downloading the related knowledge through link traveral. The script never ends unless you terminate it.


2) Build the knowledge graph

Execute:

  $ python3 build_knowledge.py --verbose --root {PASSWORD} --limit {NUM}

Where {PASSWORD} represents your root password of OrientDB. And {NUM} represents the number of wikipedia topics to process.

What the script does is simply imports the entire raw hefty text knowledge from MongoDB to OrientDB as a big graph. The output graph in OrientDB is built from the following components:

  • [1] Vertices : Represent topic / keyword
  • [2] Edges : Represent relations between topic-keyword or keyword-keyword.

[NOTE] The script processes the entire data in the collection all the way to the end. This will definitely take large amount of time if you have large data in your collection.


3) Visualise the knowledge graph

Execute:

  $ node visualise {PASSWORD}

Where {PASSWORD} is your OrientDB root's password. The script downloads the graph data from OrientDB, renders it with appropriate visual figure. After it's done, you can view the graphs as follows.

  • [1] Universe of topics graph [html/graph-universe.html].
  • [2] Index graph [html/graph-index.html.]

4) Build Word2Vec model over the crawled data

Execute:

  $ python3 build_wordvec.py --limit {LIMIT} --out {PATH_TO_MODEL}

There should be sufficient amount of the downloaded wikipedia in MongoDB which is done by running crawl_wiki.py. The output is a binary file.


5) Create topic index

Execute:

  $ python3 build_index.py --limit {LIMIT} --root {PASSWORD}

The script generates another OrientDB collection vorindex which contains all invert-index of the topics and their corresponding keywords. Weights of the edges are calculated by how frequent the word appear in each of the topics.

Network Network


Licence

The project is licenced under GNU 3 public licence. All third party libraries are redistributed under their own licences.


Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].