Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to explode different section from a textfile into an array using php (and no regex)?

Tags:

php

This question is almost duplicate to How to transform structured textfiles into PHP multidimensional array but I have posted it again since I was unable to understand the regular expression based solutions that were given. It seems better to try and solve this using just PHP so that I may actually learn from it (regex is too hard to understand at this point).

Assume the following text file:

HD Alcoa Earnings Soar; Outlook Stays Upbeat 
BY By James R. Hagerty and Matthew Day 
PD 12 July 2011
LP 

Alcoa Inc.'s profit more than doubled in the second quarter.
The giant aluminum producer managed to meet analysts' forecasts.

However, profits wereless than expected

TD
Licence this article via our website:

http://example.com

I read this textfile with PHP, an need a robust way to put the file contents into an array, like this:

array(
  [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat,
  [BY] => By James R. Hagerty and Matthew Day,
  [PD] => 12 July 2011,
  [LP] => Alcoa Inc.'s profit...than expected,
  [TD] => Licence this article via our website: http://example.com
)

The words HD BY PD LP TD are keys to identify a new section in the file. In the array, all newlines may be stripped from the values. Ideally I would be able to do this without regular expressions. I believe exploding on all keys could be one way of doing it, but it would be very dirty:

$fields = array('HD', 'BY', 'PD', 'LP', 'TD');
$parts = explode($text, "\nHD ");
$HD = $parts[0];

Does anybody have a more clean idea on how to loop through the text, perhaps even once, and dividing it up into the array as given above?

like image 878
Pr0no Avatar asked Aug 21 '13 14:08

Pr0no


People also ask

How to explode an array in php?

PHP | explode() Function The explode() function splits a string based on a string delimiter, i.e. it splits the string wherever the delimiter character occurs. This functions returns an array containing the strings formed by splitting the original string.

What is explode function in PHP?

The explode() function breaks a string into an array. Note: The "separator" parameter cannot be an empty string.

How to split the text in php?

The str_split() function splits a string into an array.

What is split in PHP?

PHP - Function split() The split() function will divide a string into various elements, the boundaries of each element based on the occurrence of pattern in string.


1 Answers

This is another, even shorter approach without using regular expressions.

/**
 * @param  array  array of stopwords eq: array('HD', 'BY', ...)
 * @param  string Text to search in
 * @param  string End Of Line symbol
 * @return array  [ stopword => string, ... ]
 */
function extract_parts(array $parts, $str, $eol=PHP_EOL) {
  $ret=array_fill_keys($parts, '');
  $current=null;
  foreach(explode($eol, $str) AS $line) {
    $substr = substr($line, 0, 2);
    if (isset($ret[$substr])) {
      $current = $substr;
      $line = trim(substr($line, 2));
    }
    if ($current) $ret[$current] .= $line;
  }
  return $ret;
}

$ret = extract_parts(array('HD', 'BY', 'PD', 'LP', 'TD'), $str);
var_dump($ret);

Why not using regular expressions?

Since the php documentation, particular in preg_* functions, recommend to not use regular expressions if not strongly required. I was wondering which of the examples in the answers to this question has the best berformance.

The result surprised myself:

Answer 1 by: hek2mgl     2.698 seconds (regexp)
Answer 2 by: Emo Mosley  2.38  seconds
Answer 3 by: anubhava    3.131 seconds (regexp)
Answer 4 by: jgb         1.448 seconds

I would have expected that the regexp variants would be the fastest.

Well, it isn't a bad thing to not use regular expressions in any case. In other words: using regular expressions is not the best solution in general. You have to decide for the best solution case-by-case.

You may repeat the measurement with this script.


Edit

Here is a short, more optimized example using a regexp pattern. Still not as fast as my example above but faster than the other regexp based examples.

The Output format may be optimized (whitespaces / line breaks).

function extract_parts_regexp($str) {
  $a=array();
  preg_match_all('/(?<k>[A-Z]{2})(?<v>.*?)(?=\n[A-Z]{2}|$)/Ds', $str, $a);
  return array_combine($a['k'], $a['v']);
}
like image 110
jgb Avatar answered Nov 04 '22 13:11

jgb