newca12 / Dictionary Builder
Programming Languages
Labels
Projects that are alternatives of or similar to Dictionary Builder
Dictionary builder
About
This project allow you to build dictionaries based on Wiktionary entries.
Dictionary builder used to be a demonstration of advanced JAXB techniques to unmarshall very large xml document with very low memory footprint.
The Java/JAXB implementation has been archived in java-jaxb branch
Then it was re-written with Scala and Akka Streams.
The Scala/akka-stream implementation has been archived in scala-akka-streams branch
And now re-written with Rust.
The resulting dictionnary is exactly the same with the three implementations. None of these implementations was designed to be use as a benchmark but nethertheless Rust results are breathtaking. See below.
dictionary-builder is an EDLA project.
The purpose of edla.org is to promote the state of the art in various domains.
How to use it
-
Rust need to be installed to generate an executable
-
Get a fresh wiktionary backup
Choose your favorite language and download the dump containing the current versions of article content here
Example for the english dump: http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles-multistream.xml.bz2 -
Uncompress the fresh downloaded dump somewhere (Take care you need up to 6 Gigas of free disk space)
-
Build the executable : cargo build --release
-
Edit Setings.toml to indicate the language you choose, where the dump is located and last but not least where the dictionary should be generated.
(Take care you need some free disk space to store your dictionary) -
Launch the program : ./target/release/dictionary-builder
-
Some results :
From the English dictionary 746879 entries are generated in less than 2 minutes and 3 Gigas disk space are required for the dictionary.
That's it.
Limitations
The Rust version was not tested on Windows systems.
Performance comparaison
Test were done on a modest i7-4600U CPU @ 2.10GHz with SSD.
The results sound like a joke :
Rust | Scala/akka streams | Java/JAXB | |
---|---|---|---|
without definition | 37s | 4min 47s | 7min 36s |
with definitions | 1min 53s | 5min 46s | 9min 1s |
Rust implementation outperform by far the others implementations and the icing on the cake : Rust use ten time less memory. 🚀
License
© 2009-2020 Olivier ROLAND. Distributed under the GPLv3 License.