name2nat: a Python package for nationality prediction from a name
name2nat is a Python package that predicts the nationality of any name written in Roman letters.
For example, it returns the correct output Korean
for my name `Kyubyong Park'.
Needless to say, it is not possible to guess somebody's nationality 100% right from their name.
After all, nationality can change, you know.
However, it is also true that there is a tendency between names and nationality.
So it turns out statistical classifiers for this task works to some extent.
Details are explained below.
Disclaimer
I am aware that this topic may be viewed from a political perspective. That is absolutely AGAINST my motivation.
NaNa Dataset
Construction
I constructed a new dataset for this project because I failed to find any available dataset that is big and comprehensive enough.
- STEP 1. Downloaded and extracted the 20200601 English wiki dump (enwiki-20200601-pages-articles.xml).
- STEP 2. Iterated all pages and collected the title and the nationality. I regarded the title as a person if the Category section at the bottom of each page included ... births (green rectangle), and identified their nationality from the most frequent nationality word in the section (red rectangles).
Stats
Nationality | # Samples | Train | Dev | Test |
---|---|---|---|---|
Total | 1,112,902 | 890,248 | 111,286 | 111,368 |
Afghan | 973 | 778 | 97 | 98 |
Albanian | 2,742 | 2,193 | 274 | 275 |
Algerian | 1,991 | 1,592 | 199 | 200 |
American | 302,215 | 241,772 | 30,221 | 30,222 |
Andorran | 236 | 188 | 24 | 24 |
Angolan | 630 | 504 | 63 | 63 |
Argentine | 11,158 | 8,926 | 1,116 | 1,116 |
Armenian | 2,001 | 1,600 | 200 | 201 |
Aruban | 117 | 93 | 12 | 12 |
Australian | 50,670 | 40,536 | 5,067 | 5,067 |
Austrian | 11,490 | 9,192 | 1,149 | 1,149 |
Azerbaijani | 1,664 | 1,331 | 166 | 167 |
Bahamian | 292 | 233 | 29 | 30 |
Bahraini | 297 | 237 | 30 | 30 |
Bangladeshi | 2,045 | 1,636 | 204 | 205 |
Barbadian | 466 | 372 | 47 | 47 |
Basque | 1,202 | 961 | 120 | 121 |
Belarusian | 2,923 | 2,338 | 292 | 293 |
Belgian | 9,884 | 7,907 | 988 | 989 |
Belizean | 186 | 148 | 19 | 19 |
Beninese | 249 | 199 | 25 | 25 |
Bermudian | 338 | 270 | 34 | 34 |
Bhutanese | 180 | 144 | 18 | 18 |
Bolivian | 822 | 657 | 82 | 83 |
Bosniak | 102 | 81 | 10 | 11 |
Botswana | 315 | 252 | 31 | 32 |
Brazilian | 14,043 | 11,234 | 1,404 | 1,405 |
Breton | 148 | 118 | 15 | 15 |
British | 57,403 | 45,922 | 5,740 | 5,741 |
Bruneian | 144 | 115 | 14 | 15 |
Bulgarian | 4,908 | 3,926 | 491 | 491 |
Burkinabé | 362 | 289 | 36 | 37 |
Burmese | 1,180 | 944 | 118 | 118 |
Burundian | 175 | 140 | 17 | 18 |
Cambodian | 451 | 360 | 45 | 46 |
Cameroonian | 1,286 | 1,028 | 129 | 129 |
Canadian | 42,691 | 34,152 | 4,269 | 4,270 |
Catalan | 2,147 | 1,717 | 215 | 215 |
Chadian | 174 | 139 | 17 | 18 |
Chilean | 3,548 | 2,838 | 355 | 355 |
Chinese | 11,868 | 9,494 | 1,187 | 1,187 |
Colombian | 3,276 | 2,620 | 328 | 328 |
Comorian | 68 | 54 | 7 | 7 |
Congolese | 44 | 35 | 4 | 5 |
Cuban | 2,423 | 1,938 | 242 | 243 |
Cypriot | 1,271 | 1,016 | 127 | 128 |
Czech | 9,056 | 7,244 | 906 | 906 |
Dane | 41 | 32 | 4 | 5 |
Djiboutian | 68 | 54 | 7 | 7 |
Dominican | 1,976 | 1,580 | 198 | 198 |
Dutch | 18,645 | 14,916 | 1,864 | 1,865 |
Ecuadorian | 1,093 | 874 | 109 | 110 |
Egyptian | 3,471 | 2,776 | 347 | 348 |
Emirati | 777 | 621 | 78 | 78 |
English | 96,449 | 77,159 | 9,645 | 9,645 |
Equatoguinean | 242 | 193 | 24 | 25 |
Eritrean | 167 | 133 | 17 | 17 |
Estonian | 2,536 | 2,028 | 254 | 254 |
Ethiopian | 917 | 733 | 92 | 92 |
Faroese | 355 | 284 | 35 | 36 |
Filipino | 4,910 | 3,928 | 491 | 491 |
Finn | 85 | 68 | 8 | 9 |
French | 51,052 | 40,841 | 5,105 | 5,106 |
Gabonese | 226 | 180 | 23 | 23 |
Gambian | 276 | 220 | 28 | 28 |
Georgian | 328 | 262 | 33 | 33 |
German | 52,986 | 42,388 | 5,299 | 5,299 |
Ghanaian | 2,546 | 2,036 | 255 | 255 |
Gibraltarian | 123 | 98 | 12 | 13 |
Greek | 7,469 | 5,975 | 747 | 747 |
Grenadian | 174 | 139 | 17 | 18 |
Guatemalan | 704 | 563 | 70 | 71 |
Guinean | 731 | 584 | 73 | 74 |
Guyanese | 448 | 358 | 45 | 45 |
Haitian | 702 | 561 | 70 | 71 |
Honduran | 626 | 500 | 63 | 63 |
Hungarian | 9,026 | 7,220 | 903 | 903 |
I-Kiribati | 51 | 40 | 5 | 6 |
Indian | 28,365 | 22,692 | 2,836 | 2,837 |
Indonesian | 3,525 | 2,820 | 352 | 353 |
Iranian | 6,263 | 5,010 | 626 | 627 |
Iraqi | 1,566 | 1,252 | 157 | 157 |
Irish | 14,806 | 11,844 | 1,481 | 1,481 |
Israeli | 6,437 | 5,149 | 644 | 644 |
Italian | 36,671 | 29,336 | 3,667 | 3,668 |
Jamaican | 1,778 | 1,422 | 178 | 178 |
Japanese | 26,520 | 21,216 | 2,652 | 2,652 |
Jordanian | 613 | 490 | 61 | 62 |
Kazakh | 31 | 24 | 3 | 4 |
Kenyan | 2,012 | 1,609 | 201 | 202 |
Korean | 9,871 | 7,896 | 987 | 988 |
Kuwaiti | 496 | 396 | 50 | 50 |
Kyrgyz | 20 | 16 | 2 | 2 |
Lao | 33 | 26 | 3 | 4 |
Latvian | 2,117 | 1,693 | 212 | 212 |
Lebanese | 1,558 | 1,246 | 156 | 156 |
Liberian | 368 | 294 | 37 | 37 |
Libyan | 339 | 271 | 34 | 34 |
Lithuanian | 2,474 | 1,979 | 247 | 248 |
Macedonian | 1,374 | 1,099 | 137 | 138 |
Malagasy | 290 | 232 | 29 | 29 |
Malawian | 274 | 219 | 27 | 28 |
Malaysian | 3,228 | 2,582 | 323 | 323 |
Maldivian | 191 | 152 | 19 | 20 |
Malian | 482 | 385 | 48 | 49 |
Maltese | 829 | 663 | 83 | 83 |
Manx | 188 | 150 | 19 | 19 |
Marshallese | 40 | 32 | 4 | 4 |
Mauritanian | 120 | 96 | 12 | 12 |
Mauritian | 329 | 263 | 33 | 33 |
Mexican | 10,810 | 8,648 | 1,081 | 1,081 |
Moldovan | 1,250 | 1,000 | 125 | 125 |
Mongolian | 631 | 504 | 63 | 64 |
Montenegrin | 1,194 | 955 | 119 | 120 |
Moroccan | 1,822 | 1,457 | 182 | 183 |
Mozambican | 263 | 210 | 26 | 27 |
Namibian | 736 | 588 | 74 | 74 |
Nauruan | 40 | 32 | 4 | 4 |
Nepalese | 967 | 773 | 97 | 97 |
Nicaraguan | 357 | 285 | 36 | 36 |
Nigerian | 5,075 | 4,060 | 507 | 508 |
Nigerien | 179 | 143 | 18 | 18 |
Norwegian | 16,891 | 13,512 | 1,689 | 1,690 |
Omani | 247 | 197 | 25 | 25 |
Pakistani | 4,703 | 3,762 | 470 | 471 |
Palauan | 44 | 35 | 4 | 5 |
Palestinian | 660 | 528 | 66 | 66 |
Panamanian | 593 | 474 | 59 | 60 |
Paraguayan | 1,266 | 1,012 | 127 | 127 |
Peruvian | 1,902 | 1,521 | 190 | 191 |
Portuguese | 5,918 | 4,734 | 592 | 592 |
Qatari | 685 | 548 | 68 | 69 |
Romanian | 8,189 | 6,551 | 819 | 819 |
Russian | 26,593 | 21,274 | 2,659 | 2,660 |
Rwandan | 337 | 269 | 34 | 34 |
Salvadoran | 634 | 507 | 63 | 64 |
Sammarinese | 248 | 198 | 25 | 25 |
Samoan | 746 | 596 | 75 | 75 |
Saudi | 1,871 | 1,496 | 187 | 188 |
Senegalese | 1,029 | 823 | 103 | 103 |
Serb | 56 | 44 | 6 | 6 |
Singaporean | 1,646 | 1,316 | 165 | 165 |
Slovak | 3,584 | 2,867 | 358 | 359 |
Slovene | 111 | 88 | 11 | 12 |
Somali | 145 | 116 | 14 | 15 |
Sotho | 62 | 49 | 6 | 7 |
Sudanese | 436 | 348 | 44 | 44 |
Surinamese | 250 | 200 | 25 | 25 |
Swazi | 143 | 114 | 14 | 15 |
Syriac | 98 | 78 | 10 | 10 |
Syrian | 1,309 | 1,047 | 131 | 131 |
Taiwanese | 2,433 | 1,946 | 243 | 244 |
Tajik | 77 | 61 | 8 | 8 |
Tamil | 1,749 | 1,399 | 175 | 175 |
Tanzanian | 784 | 627 | 78 | 79 |
Thai | 3,434 | 2,747 | 343 | 344 |
Tibetan | 332 | 265 | 33 | 34 |
Togolese | 264 | 211 | 26 | 27 |
Tongan | 570 | 456 | 57 | 57 |
Tunisian | 1,340 | 1,072 | 134 | 134 |
Turk | 99 | 79 | 10 | 10 |
Tuvaluan | 83 | 66 | 8 | 9 |
Ugandan | 1,316 | 1,052 | 132 | 132 |
Ukrainian | 7,748 | 6,198 | 775 | 775 |
Uruguayan | 2,834 | 2,267 | 283 | 284 |
Uzbek | 78 | 62 | 8 | 8 |
Vanuatuan | 146 | 116 | 15 | 15 |
Venezuelan | 2,422 | 1,937 | 242 | 243 |
Vietnamese | 1,572 | 1,257 | 157 | 158 |
Vincentian | 10 | 8 | 1 | 1 |
Welsh | 6,588 | 5,270 | 659 | 659 |
Yemeni | 403 | 322 | 40 | 41 |
Zambian | 638 | 510 | 64 | 64 |
Downloadable Link
- You can download the dataset here.
name2nat
Installation
pip install name2nat
Usage
>>> from name2nat import Name2nat
>>> my_nanat = Name2nat()
>>> names = ["Donald Trump", # American
"Moon Jae-in", # Korean
"Shinzo Abe", # Japanese
"Xi Jinping", # Chinese
"Joko Widodo", # Indonesian
"Angela Merkel", # German
"Emmanuel Macron", # French
"Kyubyong Park", # Korean
"Yamamoto Yu", # Japanese
"Jing Xu"] # Chinese
>>> result = my_nanat(names, top_n=3)
>>> print(result)
# (name, [(nationality, prob), ...])
# Note that prob of 1.0 indicates the name exists
# in Wikipedia.
[
('Donald Trump', [('American', 1.0)])
('Moon Jae-in', [('Korean', 1.0)])
('Shinzo Abe', [('Japanese', 1.0)])
('Xi Jinping', [('Chinese', 1.0)])
('Joko Widodo', [('Indonesian', 1.0)])
('Angela Merkel', [('German', 1.0)])
('Emmanuel Macron', [('French', 1.0)])
('Kyubyong Park', [('Korean', 0.9985014200210571), ('American', 0.000289416522718966), ('Bhutanese', 0.00025851925602182746)])
('Yamamoto Yu', [('Japanese', 0.7050493359565735), ('Taiwanese', 0.12779785692691803), ('Chinese', 0.04263153299689293)])
('Jing Xu', [('Chinese', 0.8626819252967834), ('Taiwanese', 0.09901007264852524), ('American', 0.022995812818408012)])
]
Training
I use a powerful NLP library Flair to train a text classifier model. A bidirectional GRU layer is employed.
python train.py
Evaluation
python predict.py;
python eval.py --gt nana/test.tgt --pred test.pred
Results
K | Precision@K |
---|---|
1 | 61310/111368=55.1 |
2 | 77480/111368=69.6 |
3 | 86703/111368=77.9 |
4 | 92491/111368=83.0 |
5 | 96697/111368=86.8 |
Applications
Let's predict the nationalities of the first authors of the recent machine learning conferences.
- Check
conferences.py
andconferences/lrec2020.md
- Contributions (PRs) are welcome!
References
If you use this code for research, please cite:
@misc{park2018name2nat,
author = {Park, Kyubyong},
title = {name2nat: a Python package for nationality prediction from a name},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Kyubyong/name2nat}}
}