Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split text into Unicode words with Regular Expression in PHP

Tags:

regex

php

unicode

I have a web site module which collects some tweets from twitter and splits them as words to put into a database. However, as the tweets usually have Turkish characters [ıöüğşçİÖÜĞŞÇ], my module cannot divide the words correctly.

For example, the phrase Aynı labda çalıştığım is split into Ayn, labda and alıştığım, but it should have been split into Aynı, labda and çalıştığım

Here's my code which does the job:

preg_match_all('/(\A|\b)[A-Z\Ç\Ö\Ş\İ\Ğ\Ü]?[a-z\ç\ö\ş\ı\ğ\ü]+(\Z|\b)/u', $text,$a);

What do you think is wrong here?

Important Note: I'm not stupid not to split text by the space character, I need exactly these characters to match. I don't want any numerical or special character such as [,.!@#$^&*123456780].

I need a regular expression that will split this kısa isimleri ile "Vic" ve "Wick" vardı.

into this:

kısa
isimleri
ile
Vic
ve
Wick
vardı

More examples:

We're @test would be

We
re
test

Föö bär, we're @test to0 ÅÄÖ - 123 ok? kthxbai? is split into this,

b
r
we
re
test
ok
kthxbai

but I want it to be:

Föö
bär
we
re
test
ÅÄÖ
ok
kthxbai
like image 765
Yunus Eren Güzel Avatar asked May 08 '26 23:05

Yunus Eren Güzel


2 Answers

I would take a look at mb_split().

$str = 'We\'re @test Aynı labda çalıştığım';
var_dump(\mb_split('\s', $str));

Gives me:

array
  0 => string 'We're' (length=5)
  1 => string '@test' (length=5)
  2 => string 'Aynı' (length=5)
  3 => string 'labda' (length=5)
  4 => string 'çalıştığım' (length=16)
like image 191
Charles Sprayberry Avatar answered May 10 '26 14:05

Charles Sprayberry


This expression would give you the desired result (according to your examples):

/(?<!\pL|\pN)\pL+(?!\pL|\pN)/u

\pL matches any unicode letter. The lookarounds are needed to make sure it isn't followed or preceded by numbers, to completely exclude words containing any numbers.

Example:

$str = "Aynı, labda - çalıştığım? \"quote\". Föö bär, we're @test to0 ÅÄÖ - 123 ok? kthxbai?";
preg_match_all('/(?<!\pL|\pN)\pL+(?!\pL|\pN)/u', $str, $m);
print_r($m);

Output:

Array
(
    [0] => Array
        (
            [0] => Aynı
            [1] => labda
            [2] => çalıştığım
            [3] => quote
            [4] => Föö
            [5] => bär
            [6] => we
            [7] => re
            [8] => test
            [9] => ÅÄÖ
            [10] => ok
            [11] => kthxbai
        )

)
like image 27
Qtax Avatar answered May 10 '26 13:05

Qtax



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!