j-min / Easy Namuwiki Extractor
Easy Namuwiki Extractor
Stars: ✭ 25
Programming Languages
python
139335 projects - #7 most used programming language
Easy NamuWiki Extractor
Simple Namuwiki Extractor extension of Namu Wiki Extractor
This module strips the namu mark from a namu wiki document and extracts its plain text only.
Environment
- Python 2, 3
- tqdm
Usage
-
Clone this repo :
git clone https://github.com/j-min/Easy-Namuwiki-Extractor
-
Download Namuwiki json dump inside directory of repo :
wget http://file2.unofficialnis.ga/namuwiki_161031.json
-
You can find latest dumps here
-
Run extractor:
python Run_extractor.py -i input_json_file -o outputfile_name
-
Tags:
--input (-i) : input filename
--output (-o) : output filename
--multiprocess (-m) : run multiprocessing module
--title (-t) : include titles of documents while extracting
How Namuwiki Json looks like
- from web json viewer
Sample Output
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].