Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → scriptin → Topokanji

scriptin / Topokanji

Topologically ordered lists of kanji for effective learning

Programming Languages

184084 projects - #8 most used programming language

365 projects

Labels

data japanese japanese-language

Projects that are alternatives of or similar to Topokanji

kanji-frequency

Kanji usage frequency data collected from various sources

Stars: ✭ 92 (-14.81%)

Mutual labels: japanese, japanese-language

[No Active Development] An Android app for learning Japanese by keeping a journal.

Stars: ✭ 37 (-65.74%)

Mutual labels: japanese, japanese-language

Domino-English-Translation

🌏 Let's translate Domino, a Japanese MIDI editor!

Stars: ✭ 29 (-73.15%)

Mutual labels: japanese, japanese-language

google-news-scraper

Google News Scraper for languages like Japanese, Chinese... [VPN Support]

Stars: ✭ 88 (-18.52%)

Mutual labels: japanese, japanese-language

unofficial-jisho-api

Encapsulates the official Jisho.org API and also provides kanji, example, and stroke diagram search.

Stars: ✭ 88 (-18.52%)

Mutual labels: japanese, japanese-language

A Discord bot for helping with learning Japanese.

Stars: ✭ 118 (+9.26%)

Mutual labels: japanese, japanese-language

The Tab Of Words

A minimal Chrome / Firefox extension to help you learn Japanese words in each new tab.

Stars: ✭ 94 (-12.96%)

Mutual labels: japanese, japanese-language

jmdict-simplified

JMdict, JMnedict, Kanjidic, KRADFILE/RADKFILE in JSON format

Stars: ✭ 96 (-11.11%)

Mutual labels: japanese, japanese-language

japanese-pitch-accent-resources

Trying to consolidate japanese phonetic, and in particular pitch accent resources into one list

Stars: ✭ 64 (-40.74%)

Mutual labels: japanese, japanese-language

KanjiRecognitionDictionary

Perfect for those who forgets kanji pronunciation

Stars: ✭ 14 (-87.04%)

Mutual labels: japanese, japanese-language

A little and minimalist Japanese Kana training

Stars: ✭ 66 (-38.89%)

Mutual labels: japanese, japanese-language

Self-contained Japanese Morphological Analyzer written in pure Go

Stars: ✭ 554 (+412.96%)

Mutual labels: japanese, japanese-language

Convert-Numbers-to-Japanese

Converts Arabic numerals, or 'western' style numbers, to a Japanese context.

Stars: ✭ 33 (-69.44%)

Mutual labels: japanese, japanese-language

ra-language-japanese

Japanese messages for react-admin

Stars: ✭ 22 (-79.63%)

Mutual labels: japanese, japanese-language

日语N5-N2语法笔记~ 🍻

Stars: ✭ 84 (-22.22%)

Mutual labels: japanese, japanese-language

A php Japanese language text analyzer and parser.

Stars: ✭ 76 (-29.63%)

Mutual labels: japanese, japanese-language

Genki Study Resources

A collection of exercises for practicing what is taught in Genki: An Integrated Course in Elementary Japanese.

Stars: ✭ 232 (+114.81%)

Mutual labels: japanese, japanese-language

Python ドキュメント日本語訳プロジェクト

Stars: ✭ 130 (+20.37%)

Mutual labels: japanese, japanese-language

Angular.js kanji web application

Stars: ✭ 45 (-58.33%)

Mutual labels: japanese, japanese-language

Japanese pop-up dictionary extension for Chrome and Firefox.

Stars: ✭ 464 (+329.63%)

Mutual labels: japanese, japanese-language

View All Similar Projects ➔

TopoKanji

30 seconds explanation for people who want to learn kanji:

It is best to learn kanji starting from simple characters and then learning complex ones as compositions of "parts", which are called "radicals" or "components". For example:

一 → 二 → 三

丨 → 凵 → 山 → 出

言 → 五 → 口 → 語

It is also smart to learn more common kanji first.

This project is based on those two ideas and provides properly ordered lists of kanji to make your learning process as fast, simple, and effective as possible.

Motivation for this project initially came from reading this article: The 5 Biggest Mistakes People Make When Learning Kanji.

First 100 kanji from lists/aozora.txt (formatted for convenience):

人一丨口日目儿見凵山
出十八木未丶来大亅了
子心土冂田思二丁彳行
寸寺時卜上丿刀分厶禾
私中彐尹事可亻何自乂
又皮彼亠方生月門間扌
手言女本乙气気干年三
耂者刂前勹勿豕冖宀家
今下白勺的云牛物立小
文矢知入乍作聿書学合

These lists can be found in lists directory. They only differ in order of kanji. Each file contains a list of kanji, ordered as described in following sections. There are few options (see Used data for details):

aozora.(json|txt) - ordered by kanji frequency in Japanese fiction and non-fiction books; I recommend this list if you're starting to learn kanji
news.(json|txt) - ordered by kanji frequency in online news
twitter.(json|txt) - ordered by kanji frequency in Twitter messages
wikipedia.(json|txt) - ordered by kanji frequency in Wikipedia articles
all.(json|txt) - combined "average" version of all previous; this one is experimental, I don't recommend using it

You can use these lists to build an Anki deck or just as a guidance. If you're looking for "names" or meanings of kanji, you might want to check my kanji-keys project.

What is a properly ordered list of kanji?

If you look at a kanji like 語, you can see it consists of at least three distinct parts: 言, 五, 口. Those are kanji by themselves too. The idea behind this project is to find the order of about 2000-2500 common kanji, in which no kanji appears before its' parts, so you only learn a new kanji when you already know its' components.

Properties of properly ordered lists

No kanji appear before it's parts (components). In fact, in you treat kanji as nodes in a graph structure, and connect them with directed edges, where each edge means "kanji A includes kanji B as a component", it all forms a directed acyclic graph (DAG). For any DAG, it is possible to build a topological order, which is basically what "no kanji appear before it's parts" means.
More common kanji come first. That way you learn useful characters as soon as possible.

Algorithm

Topological sorting is done by using a modified version of Kahn (1962) algorithm with intermediate sorting step which deals with the second property above. This intermediate sorting uses the "weight" of each character: common kanji (lighter) tend appear before rare kanji (heavier). See source code for details.

Used data

Initial unsorted list contains only kanji which are present in KanjiVG project, so for each character there is a data of its' shape and stroke order.

Characters are split into components using CJK Decompositions Data project, along with "fixes" to simplify final lists and avoid characters which are not present in initial list.

Statistical data of kanji usage frequencies was collected by processing raw textual data from various sources. See kanji-frequency repository for details.

Which kanji are (not) included?

Kanji list covers about 95-99% of kanji found in various Japanese texts. Generally, the goal is provide something similar to Jōyō kanji, but based on actual data. Radicals are also included, but only those which are parts of some kanji in the list.

Kanji/radical must NOT appear in this list if it is:

not included in KanjiVG character set
primarily used in names (people, places, etc.) or in some specific terms (religion, mythology, etc.)
mostly used because of its' shape, e.g. a part of text emoticons/kaomoji like ( ^ω^)个
a part of currently popular meme, manga/anime/dorama/movie title, #hashtag, etc., and otherwise is not commonly used

Files and formats

`lists` directory

Files in lists directory are final lists.

*.txt files contain lists as plain text, one character per line; those files can be interpreted as CSV/TSV files with a single column
*.json files contain lists as JSON arrays

All files are encoded in UTF-8, without byte order mark (BOM), and have unix-style line endings, LF.

`dependencies` directory

Files in dependencies directory are "flat" equivalents of CJK-decompositions (see below). "Dependency" here roughly means "a component of the visual decomposition" for kanji.

1-to-1.txt has a format compatible with tsort command line utility; first character in each line is "target" kanji, second character is target's dependency or 0
1-to-1.json contains a JSON array with the same data as in 1-to-1.txt
1-to-N.txt is similar, but lists all "dependecies" at once
1-to-N.json contains a JSON object with the same data as in 1-to-N.txt

All files are encoded in UTF-8, without byte order mark (BOM), and have unix-style line endings, LF.

`data` directory

kanji.json - data for kanji included in final ordered lists, including radicals
kanjivg.txt - list of kanji from KanjiVG
cjk-decomp-{VERSION}.txt - data from CJK Decompositions Data, without any modifications
cjk-decomp-override.txt - data to override some CJK's decompositions
kanji-frequency/*.json - kanji frequency tables

All files are encoded in UTF-8, without byte order mark (BOM). All files, except for cjk-decomp-{VERSION}.txt, have unix-style line endings, LF.

`data/kanji.json`

Contains table with data for kanji, including radicals. Columns are:

Character itself
Stroke count
Frequency flag:
- true if it is a common kanji
- false if it is primarily used as a radical/component and unlikely to be seen within top 3000 in kanji usage frequency tables. In this case character is only listed because it's useful for decomposition, not as a standalone kanji

Resrictions:

No duplicates
Each character must be listed in kanjivg.txt
Each character must be listed on the left hand side in exactly one line in cjk-decomp-{VERSION}.txt
Each character may be listed on the left hand side in exactly one line in cjk-decomp-override.txt

`data/kanjivg.txt`

Simple list of characters which are present in KanjiVG project. Those are from the list of *.svg files in KanjiVG's Github repository.

`data/cjk-decomp-{VERSION}.txt`

Data file from CJK Decompositions Data project, see description of its' format.

`data/cjk-decomp-override.txt`

Same format as cjk-decomp-{VERSION}.txt, except:

comments starting with # allowed
purpose of each record in this file is to override the one from cjk-decomp-{VERSION}.txt
type of decomposition is always fix, which just means "fix a record for the same character from original file"

Special character 0 is used to distinguish invalid decompositions (which lead to characters with no graphical representation) from those which just can't be decomposed further into something meaningful. For example, 一:fix(0) means that this kanji can't be further decomposed, since it's just a single stroke.

NOTE: Strictly speaking, records in this file are not always "visual decompositions" (but most of them are). Instead, it's just an attempt to provide meaningful recommendations of kanji learning order.

`data/kanji-frequency/*.json`

See kanji-frequency repository for details.

Usage

You must have Node.js and Git installed

git clone https://github.com/THIS/REPO.git
npm install
node build.js + commands and arguments described below

Command-line commands and arguments

show - only display sorted list without writing into files
- (optional) --per-line=NUM - explicitly tell how many characters per line to display. 50 by default. Applicable only to (no arguments)
- (optional) --freq-table=TABLE_NAME - use only one frequency table. Table names are file names from data/kanji-frequency directory, without .json extension, e.g. all ("combined" list), aozora, etc. When omitted, all frequency tables are used
coverage - show tables coverage, i.e. which fraction of characters from each frequency table is included into kanji list
suggest-add - suggest kanji to add in a list, based on coverage within kanji usage frequency tables
- (required) --num=NUM - how many
- (optional) --mean-type=MEAN_TYPE - same as previous, sort by given mean type: arithmetic (most "extreme"), geometric, harmonic (default, most "conservative"). See Pythagorean means for details
suggest-remove - suggest kanji to remove from a list, reverse of suggest-add
- (required) --num=NUM - see above
- (optional) --mean-type=MEAN_TYPE - see above
save - update files with final lists

License

This is a multi-license project. Choose any license from this list:

Apache-2.0 or any later version
CC-BY-4.0 or any later version
EPL-1.0 or any later version
LGPL-3.0 or any later version
MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 108

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗