All Projects โ†’ bsolomon1124 โ†’ Demoji

bsolomon1124 / Demoji

Licence: apache-2.0
Accurately find/replace/remove emojis in text strings

Programming Languages

python
139335 projects - #7 most used programming language
python3
1442 projects

Projects that are alternatives of or similar to Demoji

Awesome Emoji Picker
Add-on/WebExtension that provides a modern emoji picker that you can use to find and copy/insert emoji into the active web page.
Stars: โœญ 54 (-34.15%)
Mutual labels:  unicode, emojis
Awesome Unicode
๐Ÿ˜‚ ๐Ÿ‘Œ A curated list of delightful Unicode tidbits, packages and resources.
Stars: โœญ 693 (+745.12%)
Mutual labels:  unicode, emojis
Font Awesome Php
A PHP library for Font Awesome 4.7.
Stars: โœญ 47 (-42.68%)
Mutual labels:  unicode
Emoji Regex
A regular expression to match all Emoji-only symbols as per the Unicode Standard.
Stars: โœญ 1,134 (+1282.93%)
Mutual labels:  unicode
Python Myanmar
Python library for Myanmar text processing
Stars: โœญ 53 (-35.37%)
Mutual labels:  unicode
Metrotwitter
What Twitter reveals about the differences between cities and the monoculture of the Bay Area
Stars: โœญ 52 (-36.59%)
Mutual labels:  emojis
Glyphhanger
Your web font utility belt. It can subset web fonts. It can find unicode-ranges for you automatically. It makes julienne fries.
Stars: โœญ 1,099 (+1240.24%)
Mutual labels:  unicode
Hexagon
A package manager for Hexo
Stars: โœญ 47 (-42.68%)
Mutual labels:  emojis
Lehar
Visualize data using relative ordering
Stars: โœญ 81 (-1.22%)
Mutual labels:  unicode
Knayi Myscript
Myanmar Language Script Library
Stars: โœญ 63 (-23.17%)
Mutual labels:  unicode
Mdetect
Stars: โœญ 54 (-34.15%)
Mutual labels:  unicode
Sheenbidi
A sophisticated implementation of Unicode Bidirectional Algorithm
Stars: โœญ 52 (-36.59%)
Mutual labels:  unicode
Sinais
๐Ÿ”ฃ Desenvolvimento passo a passo do exemplo `sinais` em Go.
Stars: โœญ 59 (-28.05%)
Mutual labels:  unicode
Open Relay
Free and open source fonts from Kreative Software
Stars: โœญ 48 (-41.46%)
Mutual labels:  unicode
Locale2
๐Ÿ’ช Try as hard as possible to detect the client's language tag ("locale") in node or the browser. Browserify and Webpack friendly!
Stars: โœญ 65 (-20.73%)
Mutual labels:  unicode
Unicode Confusable
Unicode::Confusable.confusable? "โ„œีฝแ–ฏส", "Ruby"
Stars: โœญ 47 (-42.68%)
Mutual labels:  unicode
Keytokey
Rust keyboard firmware library
Stars: โœญ 54 (-34.15%)
Mutual labels:  unicode
Quran Data
Unicode-encoded Quran data
Stars: โœญ 54 (-34.15%)
Mutual labels:  unicode
Unicode
Unicode normalization library. (Mirror of Yoshida-san's code base to maintain the RubyGem.)
Stars: โœญ 81 (-1.22%)
Mutual labels:  unicode
Ucdn
Unicode Database and Normalization
Stars: โœญ 78 (-4.88%)
Mutual labels:  unicode

demoji

Accurately find or remove emojis from a blob of text.

License PyPI Status Python


Basic Usage

demoji requires an initial data download from the Unicode Consortium's emoji code repository.

On first use of the package, call download_codes():

>>> import demoji
>>> demoji.download_codes()
Downloading emoji data ...
... OK (Got response in 0.14 seconds)
Writing emoji data to /Users/brad/.demoji/codes.json ...
... OK

This will store the Unicode hex-notated symbols at ~/.demoji/codes.json for future use.

demoji exports several text-related functions for find-and-replace functionality with emojis:

>>> tweet = """\
... #startspreadingthenews yankees win great start by ๐ŸŽ…๐Ÿพ going 5strong innings with 5kโ€™s๐Ÿ”ฅ ๐Ÿ‚
... solo homerun ๐ŸŒ‹๐ŸŒ‹ with 2 solo homeruns and๐Ÿ‘น 3run homerunโ€ฆ ๐Ÿคก ๐Ÿšฃ๐Ÿผ ๐Ÿ‘จ๐Ÿฝโ€โš–๏ธ with rbiโ€™s โ€ฆ ๐Ÿ”ฅ๐Ÿ”ฅ
... ๐Ÿ‡ฒ๐Ÿ‡ฝ and ๐Ÿ‡ณ๐Ÿ‡ฎ to close the game๐Ÿ”ฅ๐Ÿ”ฅ!!!โ€ฆ.
... WHAT A GAME!!..
... """
>>> demoji.findall(tweet)
{
    "๐Ÿ”ฅ": "fire",
    "๐ŸŒ‹": "volcano",
    "๐Ÿ‘จ๐Ÿฝ\u200dโš–๏ธ": "man judge: medium skin tone",
    "๐ŸŽ…๐Ÿพ": "Santa Claus: medium-dark skin tone",
    "๐Ÿ‡ฒ๐Ÿ‡ฝ": "flag: Mexico",
    "๐Ÿ‘น": "ogre",
    "๐Ÿคก": "clown face",
    "๐Ÿ‡ณ๐Ÿ‡ฎ": "flag: Nicaragua",
    "๐Ÿšฃ๐Ÿผ": "person rowing boat: medium-light skin tone",
    "๐Ÿ‚": "ox",
}

See below for function API.

The reason that demoji requires a download rather than coming pre-packaged with Unicode emoji data is that the emoji list itself is frequently updated and changed. You are free to periodically update the local cache by calling demoji.download_codes() every so often.

To pull your last-downloaded date, you can use the last_downloaded_timestamp() helper:

>>> demoji.last_downloaded_timestamp()
datetime.datetime(2019, 2, 9, 7, 42, 24, 433776, tzinfo=<demoji.UTC object at 0x101b9ecf8>)

The result will be None if codes have not previously been downloaded.

Reference

Note: Text refers to typing.Text, an alias for str in Python 3 or unicode in Python 2.

download_codes() -> None

Download emoji data to ~/.demoji/codes.json. Required at first module usage, and can be used to periodically update data.

findall(string: Text) -> Dict[Text, Text]

Find emojis within string. Return a mapping of {emoji: description}.

findall_list(string: Text, desc: bool = True) -> List[Text]

Find emojis within string. Return a list (with possible duplicates).

If desc is True, the list contains description codes. If desc is False, the list contains emojis.

replace(string: Text, repl: Text = "") -> Text

Replace emojis in string with repl.

replace_with_desc(string: Text, sep: Text = ":") -> Text

Replace emojis in string with their description codes. The codes are surrounded by sep.

last_downloaded_timestamp() -> Optional[datetime.datetime]

Show the timestamp of last download from download_codes().

Footnote: Emoji Sequences

Numerous emojis that look like single Unicode characters are actually multi-character sequences. Examples:

  • The keycap 2๏ธโƒฃ is actually 3 characters, U+0032 (the ASCII digit 2), U+FE0F (variation selector), and U+20E3 (combining enclosing keycap).
  • The flag of Scotland 7 component characters, b'\\U0001f3f4\\U000e0067\\U000e0062\\U000e0073\\U000e0063\\U000e0074\\U000e007f' in full esaped notation.

(You can see any of these through s.encode("unicode-escape").)

demoji is careful to handle this and should find the full sequences rather than their incomplete subcomponents.

The way it does this it to sort emoji codes by their length, and then compile a concatenated regular expression that will greedily search for longer emojis first, falling back to shorter ones if not found. This is not by any means a super-optimized way of searching as it has O(N2) properties, but the focus is on accuracy and completeness.

>>> from pprint import pprint
>>> seq = """\
... I bet you didn't know that ๐Ÿ™‹, ๐Ÿ™‹โ€โ™‚๏ธ, and ๐Ÿ™‹โ€โ™€๏ธ are three different emojis.
... """
>>> pprint(seq.encode('unicode-escape'))  # Python 3
(b"I bet you didn't know that \\U0001f64b, \\U0001f64b\\u200d\\u2642\\ufe0f,"
 b' and \\U0001f64b\\u200d\\u2640\\ufe0f are three different emojis.\\n')
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].