unicode - Calculating the length of a Japanese multibyte string with half-width kana in PHP -


so have utf-8 encoded string can contain full-width kanji, full-width kana, half-width kana, romaji, numbers or kawaii japanese symbols ★ or ♥.

if want length use mb_strlen() , counts each of these 1 in length. fine purposes.

but, i've been asked (by japanese client) count half-width kana 0.5 (for purpose of maxlength of text field) because apparently thats how japanese websites it. using mb_strwidth() counts full-width 2, , half-width 1, divide 2.

however method counts romaji characters 1 chocアイス count 7 .. i'd divide 2 account kanji , i'd 3.5. want 5.5 (4 romaji + 1.5 3 half-width kana).

// edit: more info: character (even non-kana) has both full , half should 1 full-width , 0.5 half-width. example, characters ¥、3@( should 1, characters ¥,3@( should 0.5

// edit: symbols ☆ , ♥ should 1, mb_strwidth/2 method return them 0.5

is there standard way japanese systems count string length? or loop thru strings , count characters don't match standard width rules?

one way convert half-width katakana full-width , subtract difference in width original length:

$raw = 'chocアイス'; $full = mb_convert_kana($raw, 'k'); $len = mb_strlen($raw) - (mb_strwidth($full) - mb_strwidth($raw))/2; assert($len === 5.5); 

however, sure should considering basic latin characters full-width? there exist full-width varieties of basic latin characters too---that is, should choc considered same Choc?

usually, characters "a" , "ア" have width of 1, "A" , "ア" have width of 2 (which mb_strwidth does). i'd cautious having hack around that.


given edit, mb_strwidth (or mb_strwidth/2) want.


Comments

Popular posts from this blog

php - What is the difference between $_SERVER['PATH_INFO'] and $_SERVER['ORIG_PATH_INFO']? -

fortran - Function return type mismatch -

queue - mq_receive: message too long -