Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can you parse excel CSV data that contains linebreaks in the data?

I'm attempting to parse a set of CSV data using PHP, but having a major issue. One of the fields is a long description field, which itself contains linebreaks within the enclosures.

My primary issue is writing a piece of code that can split the data line by line, but also recognize when linebreaks within the data should not be used. The linebreaks within this field are not properly escaped, making them hard to distinguish from legitimate linebreaks.

I've tried to come up with a regular expression that can properly handle it, but had no luck so far. Any ideas?

CSV format:

"####","text data here", "text data \n with linebreaks \n here"\n
"####","more text data", "more data \n with \n linebreaks \n here"\n
like image 914
omgitsfletch Avatar asked Jul 19 '10 03:07

omgitsfletch


People also ask

How do I remove a line break in CSV?

Press and hold the Alt key and then enter “010” from your keyboard's 10-keypad part. (Note: The numbers in the top row of the keyboard won't work.) If '010' doesn't work try '013' because the data imported from a different source might have line breaks represented by a different code.

How do I read a CSV file in a different separator?

Adding "sep=;" or "sep=," to the CSV file Here are the steps you should follow: Open your CSV using a text editor. Skip a line at the top, and add sep=; if the separator used in the CSV is a semicolon (;), or sep=, if the separator is a comma (,). Save, and re-open the file.


2 Answers

According to aleske, a commenter in the documentation for PHP's fgetcsv function:

The PHP's CSV handling stuff is non-standard and contradicts with RFC4180, thus fgetcsv() cannot properly deal with files [that contain line breaks] ...

And he offered up the following function to get around this limitation:

function csvstring_to_array(&$string, $CSV_SEPARATOR = ';', $CSV_ENCLOSURE = '"', $CSV_LINEBREAK = "\n") { 
  $o = array(); 

  $cnt = strlen($string); 
  $esc = false; 
  $escesc = false; 
  $num = 0; 
  $i = 0; 
  while ($i < $cnt) { 
$s = $string[$i]; 

if ($s == $CSV_LINEBREAK) { 
  if ($esc) { 
    $o[$num] .= $s; 
  } else { 
    $i++; 
    break; 
  } 
} elseif ($s == $CSV_SEPARATOR) { 
  if ($esc) { 
    $o[$num] .= $s; 
  } else { 
    $num++; 
    $esc = false; 
    $escesc = false; 
  } 
} elseif ($s == $CSV_ENCLOSURE) { 
  if ($escesc) { 
    $o[$num] .= $CSV_ENCLOSURE; 
    $escesc = false; 
  } 

  if ($esc) { 
    $esc = false; 
    $escesc = true; 
  } else { 
    $esc = true; 
    $escesc = false; 
  } 
} else { 
  if ($escesc) { 
    $o[$num] .= $CSV_ENCLOSURE; 
    $escesc = false; 
  } 

  $o[$num] .= $s; 
} 

$i++; 
  } 

//  $string = substr($string, $i); 

  return $o; 
} 

That looks like it will do the trick.

like image 180
Stephen Avatar answered Sep 19 '22 13:09

Stephen


I found that you can use a normal CSV parser after you convert the CSV to unix format.

Here is a function that did the trick for me .

function dos2unix($s) {
    $s = str_replace("\r\n", "\n", $s);
    $s = str_replace("\r", "\n", $s);
    $s = preg_replace("/\n{2,}/", "\n\n", $s);
    return $s;
}

And a parsing function

function csvstring_to_array($string, $separatorChar = ',', $enclosureChar = '"', $newlineChar = PHP_EOL) {
    // @author: Klemen Nagode
    $string = dos2unix($string);
    $array = array();
    $size = strlen($string);
    $columnIndex = 0;
    $rowIndex = 0;
    $fieldValue="";
    $isEnclosured = false;
    for($i=0; $i<$size;$i++) {

        $char = $string{$i};
        $addChar = "";

        if($isEnclosured) {
            if($char==$enclosureChar) {

                if($i+1<$size && $string{$i+1}==$enclosureChar){
                    // escaped char
                    $addChar=$char;
                    $i++; // dont check next char
                }else{
                    $isEnclosured = false;
                }
            }else {
                $addChar=$char;
            }
        }else {
            if($char==$enclosureChar) {
                $isEnclosured = true;
            }else {

                if($char==$separatorChar) {

                    $array[$rowIndex][$columnIndex] = $fieldValue;
                    $fieldValue="";

                    $columnIndex++;
                }elseif($char==$newlineChar) {
                    echo $char;
                    $array[$rowIndex][$columnIndex] = $fieldValue;
                    $fieldValue="";
                    $columnIndex=0;
                    $rowIndex++;
                }else {
                    $addChar=$char;
                }
            }
        }
        if($addChar!=""){
            $fieldValue.=$addChar;

        }
    }

    if($fieldValue) { // save last field
        $array[$rowIndex][$columnIndex] = $fieldValue;
    }
    return $array;
}
like image 41
Adam F Avatar answered Sep 21 '22 13:09

Adam F