Skip to content Skip to sidebar Skip to footer

Why Python String Cut Returns 11 Symbols When 12 Is Requested?

I use python 2.7 on OSX 10.9 and would like to cut unicode string (05. Чайка.mp3) by 12 symbols, so I use mp3file[:12] to cut it by 12 symbols. But in result I get the string

Solution 1:

You have unicode text with a combining character:

u'05. \u0427\u0430\u0438\u0306\u043a\u0430.m'

The U+0306 is a COMBINING BREVE codepoint, ̆, it combines with the preceding и CYRILLIC SMALL LETTER I to form:

>>>printu'\u0438'
и
>>>printu'\u0438\u0306'
й

You can normalize that to the combined form, U+0439 CYRILLIC SMALL LETTER SHORT I instead:

>>> import unicodedata
>>> unicodedata.normalize('NFC', u'\u0438\u0306')
u'\u0439'

This uses the unicodedata.normalize() function to produce a composed normal form.

Solution 2:

A user-perceived character (grapheme cluster) such as й may be constructed using several Unicode codepoints, each Unicode codepoints in turn may be encoded using several bytes depending on a character encoding.

Therefore number of characters that you see may be less the corresponding sizes of Unicode or byte strings that encode them and you can also truncate inside a Unicode character if you slice a bytestring or inside a user-perceived character if you slice a Unicode string even if it is in NFC Unicode normalization form. Obviously, it is not desirable.

To properly count characters, you could use \Xregex that matches eXtended grapheme cluster (a language independent "visual character"):

import regex as re # $ pip install regex

characters = re.findall(u'\\X', u'05. \u0427\u0430\u0438\u0306\u043a\u0430.m')
print(characters)
# -> [u'0', u'5', u'.', u' ', u'\u0427', u'\u0430', #     u'\u0438\u0306', u'\u043a', u'\u0430', u'.', u'm']

Notice, that even without normalization: u'\u0438\u0306' is a separate character 'й'.

>>> import unicodedata
>>> unicodedata.normalize('NFC', u'\u0646\u200D ') # 3 Unicode codepointsu'\u0646\u200d '# still 3 codepoints, NFC hasn't combined them>>> import regex as re
>>> re.findall(u'\\X', u'\u0646\u200D ') # same 3 codepoints
[u'\u0646\u200d', u' '] # 2 grapheme clusters

See also, In Python, how do I most efficiently chunk a UTF-8 string for REST delivery?

Post a Comment for "Why Python String Cut Returns 11 Symbols When 12 Is Requested?"