Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting string containing letters and numbers not separated by any particular delimiter in PHP

Currently I am developing a web application to fetch Twitter stream and trying to create a natural language processing by my own.

Since my data is from Twitter (limited by 140 characters) there are many words shortened, or on this case, omitted space.

For example:

"Hi, my name is Bob. I m 19yo and 170cm tall"

Should be tokenized to:

- hi
- my
- name
- bob
- i
- 19
- yo
- 170
- cm
- tall

Notice that 19 and yo in 19yo have no space between them. I use it mostly for extracting numbers with their units.

Simply, what I need is a way to 'explode' each tokens that has number in it by chunk of numbers or letters without delimiter.

'123abc' will be ['123', 'abc']

'abc123' will be ['abc', '123']

'abc123xyz' will be ['abc', '123', 'xyz']

and so on.

What is the best way to achieve it in PHP?


I found something close to it, but it's C# and spesifically for day/month splitting. How do I split a string in C# based on letters and numbers

like image 680
akhy Avatar asked Apr 16 '12 19:04

akhy


2 Answers

You can use preg_split

$string = "Hi, my name is Bob. I m 19yo and 170cm tall";
$parts = preg_split("/(,?\s+)|((?<=[a-z])(?=\d))|((?<=\d)(?=[a-z]))/i", $string);
var_dump ($parts);

When matching against the digit-letter boundary, the regular expression match must be zero-width. The characters themselves must not be included in the match. For this the zero-width lookarounds are useful.

http://codepad.org/i4Y6r6VS

like image 128
d_inevitable Avatar answered Sep 27 '22 21:09

d_inevitable


how about this:

you extract numbers from string by using regexps, store them in an array, replace numbers in string with some kind of special character, which will 'hold' their position. and after parsing the string created only by your special chars and normal chars, you will feed your numbers from array to theirs reserved places.

just an idea, but imho might work for you.

EDIT: try to run this short code, hopefully you will see my point in the output. (this code doesnt work on codepad, dont know why)

<?php
$str = "Hi, my name is Bob. I m 19yo and 170cm tall";
preg_match_all("#\d+#", $str, $matches);
$str = preg_replace("!\d+!", "#SPEC#", $str);

print_r($matches[0]);
print $str;
like image 39
xholicka Avatar answered Sep 27 '22 22:09

xholicka