All Projects → scriptin → kanji-frequency

scriptin / kanji-frequency

Licence: other
Kanji usage frequency data collected from various sources

Programming Languages

javascript
184084 projects - #8 most used programming language
HTML
75241 projects
Less
1899 projects
coffeescript
4710 projects
shell
77523 projects

Projects that are alternatives of or similar to kanji-frequency

kanji-web-app
Angular.js kanji web application
Stars: ✭ 45 (-51.09%)
Mutual labels:  japanese, kanji, japanese-language
Convert-Numbers-to-Japanese
Converts Arabic numerals, or 'western' style numbers, to a Japanese context.
Stars: ✭ 33 (-64.13%)
Mutual labels:  japanese, japanese-language
Nihonoari-App
A little and minimalist Japanese Kana training
Stars: ✭ 66 (-28.26%)
Mutual labels:  japanese, japanese-language
ra-language-japanese
Japanese messages for react-admin
Stars: ✭ 22 (-76.09%)
Mutual labels:  japanese, japanese-language
jmdict-simplified
JMdict, JMnedict, Kanjidic, KRADFILE/RADKFILE in JSON format
Stars: ✭ 96 (+4.35%)
Mutual labels:  japanese, japanese-language
nippon
日语N5-N2语法笔记~ 🍻
Stars: ✭ 84 (-8.7%)
Mutual labels:  japanese, japanese-language
open2ch-dialogue-corpus
おーぷん2ちゃんねるをクロールして作成した対話コーパス
Stars: ✭ 65 (-29.35%)
Mutual labels:  japanese, corpus
Ichiran
Linguistic tools for texts in Japanese language
Stars: ✭ 120 (+30.43%)
Mutual labels:  japanese, japanese-language
Kawazu
A C# library for converting Japanese sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported. Inspired by project Kuroshiro.
Stars: ✭ 33 (-64.13%)
Mutual labels:  japanese, kanji
kanjigrid
Fork of the Kanji Grid addon for Anki
Stars: ✭ 21 (-77.17%)
Mutual labels:  japanese, kanji
zkanji
Japanese language study suite and dictionary
Stars: ✭ 55 (-40.22%)
Mutual labels:  japanese, kanji
python-doc-ja
Python ドキュメント日本語訳プロジェクト
Stars: ✭ 130 (+41.3%)
Mutual labels:  japanese, japanese-language
Genki Study Resources
A collection of exercises for practicing what is taught in Genki: An Integrated Course in Elementary Japanese.
Stars: ✭ 232 (+152.17%)
Mutual labels:  japanese, japanese-language
ark-pixel-font
Open source Pan-CJK pixel font / 开源的泛中日韩像素字体
Stars: ✭ 1,767 (+1820.65%)
Mutual labels:  japanese, cjk
Kanji Data Media
Japanese language data on kanji and radicals, media files, fonts and related resources from Kanji alive
Stars: ✭ 186 (+102.17%)
Mutual labels:  japanese, japanese-language
kanji-handwriting-swift
Kanji handwriting recognition for iOS using Zinnia.
Stars: ✭ 27 (-70.65%)
Mutual labels:  japanese, kanji
BSD
The Business Scene Dialogue corpus
Stars: ✭ 51 (-44.57%)
Mutual labels:  japanese, corpus
Languagepod101 Scraper
Python scraper for Language Pods such as Japanesepod101.com 👹 🗾 🍣 Compatible with Japanese, Chinese, French, German, Italian, Korean, Portuguese, Russian, Spanish and many more! ✨
Stars: ✭ 104 (+13.04%)
Mutual labels:  japanese, japanese-language
Topokanji
Topologically ordered lists of kanji for effective learning
Stars: ✭ 108 (+17.39%)
Mutual labels:  japanese, japanese-language
kanjigrid
A web-app displaying the 2200 kanji characters taught in James Heisig's "Remembering the Kanji", 6th edition.
Stars: ✭ 37 (-59.78%)
Mutual labels:  japanese, kanji

Kanji frequency

Statistical data of kanji usage frequencies was collected by processing textual data from various sources. You can find these files in the data directory:

File Total # of kanji Description Date
aozora.json ~51.5M Fiction and non-fiction books from Aozora Bunko May 2015
news.json ~10.3M Online news articles from various sources June 2015
twitter.json ~10.0M Twitter messages collected by a bot June 2015
wikipedia.json ~784.6M Japanese Wikipedia dump May 2015

See detailed descriptions below.

Format

Each file contain an array of arrays (rows). Each row contains three fields:

  1. (string) Kanji itself. "all" is a special case in the first row.
  2. (integer) How many times it was found in the analyzed data set. For "all" it is a total number of kanji, including repetitions.
  3. (float) Fraction of total amount of data this character represents. For "all" it is 1 (i.e. 100%).

Aozora

  • Sources: Aozora Bunko
  • Result: aozora.json
  • Total # of scanned texts: 12905
  • Total # of kanji collected: ~51.5M
  • Date collected: May 2015
  • Processing method: Pages were scanned as plain text, ignoring HTML structure, since they contain very little extra content.

Known issues

http://vtrm.net/japanese/kanji-frequency/en points out:

Some kanji radicals or elements which are usually not used on their own gathered relatively high rankings. One would expect such elements not to occur at all, or nearly so. For example, in Shpika’s list, 廴, a radical not used on its own, is stated to occur 1595 times and is ranked 2294th most common kanji. The explanation is simple: when a kanji outside the JIS X 0208 set appears in a text, the Aozora Bunko policy is to break it out into simpler parts. By instance, 𢌞 (it may not be displayed correctly if you don’t have a suitable font installed) is written ※[#「廴+囘」、第4水準2-12-11], where 廴+囘 is the kanji decomposition and 第4水準2-12-11 is the JIS X 0213 code point.

News

  • Sources:
  • Result: news.json
  • Total # of kanji collected: ~10.3M
  • Total # of scanned texts:
    • asahi - 19392
    • mainichi - 6449
    • saga-s - 61671
    • yomiuri - 1978
  • Date collected: June 2015
  • Processing method: Samples include articles published between June 2014 and June 2015, more samples from 2015. Only article titles, subtitles, main text body and image captions were scanned. Everything else was ignored: menus, publication dates, comments, ads, links to related articles, etc. Weather forecasts and area-specific news were not included.

Twitter

  • Sources: Twitter via Streaming API
  • Result: twitter.json
  • Total # of scanned texts: unknown
  • Total # of kanji collected: ~10.0M
  • Date collected: June 2015
  • Processing method: Messages were collected within about 1 week from Twitter's Streaming API v1 using a bot. Only message text bodies were scanned, authors' names and other data ignored. You can find specific details in the bot's source code, at commit e82cf7c (the exact version used to collect the data).

Known issues

Twitter dataset contains a lot of kanji used primarily in emoji:

  • ( ^ω^)个 (umbrella/flower?)
  • U^皿^U (grin/teeth, mustache?)
  • ( ’ω’)旦~~ (cup)
  • (╯°益°)╯ (rage face)
  • (oT-T)尸 (flag)
  • and probably many more

Also, the "笑" character is #1 simply bause it is used as a generic "smiley face". Yet technically, it's not an emoji because it's used for its meaning, as opposed to the examples above that are used only for their shapes.

Wikipedia

  • Sources: Japanese Wikipedia via Wikipedia dump (see jawiki bot)
  • Result: wikipedia.json
  • Total # of scanned texts: unknown
  • Total # of kanji collected: ~784.6M
  • Date collected: May 2015 (dump date is 2015-05-12)
  • Processing method: Dump included only current versions of pages and articles, without previous revisions or any other history of editing. Dump was scanned as plain text, ignoring XML and wiki markup structure.

Known issues

Since XML structure and wiki markup was ignored, thus this dataset is potentially noisy. This needs further investigation. Proper parsing was not implemented simply because it is too difficult (need to parse both XML and wiki markup).

Alternative datasets

License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].