I need to write a script that will search through a CSV file, and perform certain search functions on it; <ol> <li>find duplicate entries in a column</li> <li>find matches to a list of banned entries in another column</li> <li>find entries through regular expression matching on a column specified</li> </ol> Now, I have no problem at all coding this procedurally, but as I am now moving on to Object Orientated Programming, I would like to use classes and instances of objects instead. However, thinking in OOP doesn't come naturally to me yet, so I'm not entirely sure which way to go. I'm not looking for specific code, but rather suggestions on how I could design the script. My current thinking is this; <ol> <li>Create a file class. This will handle import/export of data</li> <li>Create a search class. A child class of file. This will contain the various search methods</li> </ol> How it would function in index.php: <ol> <li>get an array from the csv in the file object in index.php</li> <li>create a loop to iterate through the values of the array</li> <li>call the methods in the loop from a search object and echo them out</li> </ol> The problem I see with this approach is this; <ul> <li>I will want to point at different elements in my array to look at particular "columns". I could just put my loop in a function and pass this as a parameter, but this kind of defeats the point of OOP, I feel</li> <li>My search methods will work in different ways. To search for duplicate entries is fairly straight forward with nested loops, but I do not need a nested loop to do a simple word or regular expression searchs.</li> </ul> Should I instead go like this? <ol> <li>Create a file class. This will handle import/export of data</li> <li>Create a loop class A child of class of file. This will contain methods that deals with iterating through the array</li> <li>Create a search class. A child class of loop. This will contain the various search methods</li> </ol> My main issue with this is that it appears that I may need multiple search objects and iterate through this within my loop class. Any help would be much appreciated. I'm very new to OOP, and while I understand the individual parts, I'm not yet able to see the bigger picture. I may be overcomplicating what it is I'm trying to do, or there may be a much simpler way that I can't see yet.

I 'm going to illustrate a reasonable approach to designing OOP code that serves your stated needs. While I firmly believe that the ideas presented below are sound, please be aware that: <ul> <li>the design can be improved -- the aim here is to show the approach, not the final product</li> <li>the implementation is only meant as an example -- if it (barely) works, it's good enough</li> </ul> <hr> <h3>How to go about doing this</h3> A highly engineered solution would start by trying to define the interface to the data. That is, think about what would be a representation of the data that allows you to perform all your query operations. Here's one that would work: <ul> <li>A dataset is a finite collection of rows. Each row can be accessed given its zero-based index.</li> <li>A row is a finite collection of values. Each value is a string and can be accessed given its zero-based index (i.e. column index). All rows in a dataset have exactly the same number of values.</li> </ul> This definition is enough to implement all three types of queries you mention by looping over the rows and performing some type of test on the values of a particular column. The next move is to define an interface that describes the above in code. A not particularly nice but still adequate approach would be: <pre class="prettyprint"><code>interface IDataSet { public function getRowCount(); public function getValueAt($row, $column); } </code></pre> Now that this part is done, you can go and define a concrete class that implements this interface and can be used in your situation: <pre class="prettyprint"><code>class InMemoryDataSet implements IDataSet { private $_data = array(); public function __construct(array $data) { $this->_data = $data; } public function getRowCount() { return count($this->_data); } public function getValueAt($row, $column) { if ($row >= $this->getRowCount()) { throw new OutOfRangeException(); } return isset($this->_data[$row][$column]) ? $this->_data[$row][$column] : null; } } </code></pre> The next step is to go and write some code that converts your input data to some kind of <code>IDataSet</code>: <pre class="prettyprint"><code>function CSVToDataSet($file) { return new InMemoryDataSet(array_map('str_getcsv', file($file))); } </code></pre> Now you can trivially create an <code>IDataSet</code> from a CSV file, and you know that you can perform your queries on it because <code>IDataSet</code> was explicitly designed for that purpose. You 're almost there. The only thing missing is creating a reusable class that can perform your queries on an <code>IDataSet</code>. Here is one of them: <pre class="prettyprint"><code>class DataQuery { private $_dataSet; public function __construct(IDataSet $dataSet) { $this->_dataSet = $dataSet; } public static function getRowsWithDuplicates($columnIndex) { $values = array(); for ($i = 0; $i < $this->_dataSet->getRowCount(); ++$i) { $values[$this->_dataSet->->getValueAt($i, $columnIndex)][] = $i; } return array_filter($values, function($row) { return count($row) > 1; }); } } </code></pre> This code will return an array where the keys are values in your CSV data and the values are arrays with the zero-based indexes of the rows where each value appears. Since only duplicate values are returned, each array will have at least two elements. So at this point you are ready to go: <pre class="prettyprint"><code>$dataSet = CSVToDataSet("data.csv"); $query = new DataQuery($dataSet); $dupes = $query->getRowsWithDuplicates(0); </code></pre> <h3>What you gain by doing this</h3> Clean, maintainable code that supports being modified in the future without requiring edits all over your application. If you want to add more query operations, add them to <code>DataQuery</code> and you can instantly use them on all concrete types of data sets. The data set and any other external code will not need any modifications. If you want to change the internal representation of the data, modify <code>InMemoryDataSet</code> accordingly or create another class that implements <code>IDataSet</code> and use that one instead from <code>CSVToDataSet</code>. The query class and any other external code will not need any modifications. If you need to change the definition of the data set (perhaps to allow more types of queries to be performed efficiently) then you have to modify <code>IDataSet</code>, which also brings all the concrete data set classes into the picture and probably <code>DataQuery</code> as well. While this won't be the end of the world, it's exactly the kind of thing you would want to avoid. And this is precisely the reason why I suggested to start from this: If you come up with a good definition for the data set, everything else will just fall into place.

PHP: Searching through a CSV file the OOP way

Tags:

object

oop

loops

php

csv

I need to write a script that will search through a CSV file, and perform certain search functions on it;

find duplicate entries in a column
find matches to a list of banned entries in another column
find entries through regular expression matching on a column specified

Now, I have no problem at all coding this procedurally, but as I am now moving on to Object Orientated Programming, I would like to use classes and instances of objects instead.

However, thinking in OOP doesn't come naturally to me yet, so I'm not entirely sure which way to go. I'm not looking for specific code, but rather suggestions on how I could design the script.

My current thinking is this;

Create a file class. This will handle import/export of data
Create a search class. A child class of file. This will contain the various search methods

How it would function in index.php:

get an array from the csv in the file object in index.php
create a loop to iterate through the values of the array
call the methods in the loop from a search object and echo them out

The problem I see with this approach is this;

I will want to point at different elements in my array to look at particular "columns". I could just put my loop in a function and pass this as a parameter, but this kind of defeats the point of OOP, I feel
My search methods will work in different ways. To search for duplicate entries is fairly straight forward with nested loops, but I do not need a nested loop to do a simple word or regular expression searchs.

Should I instead go like this?

Create a file class. This will handle import/export of data
Create a loop class A child of class of file. This will contain methods that deals with iterating through the array
Create a search class. A child class of loop. This will contain the various search methods

My main issue with this is that it appears that I may need multiple search objects and iterate through this within my loop class.

Any help would be much appreciated. I'm very new to OOP, and while I understand the individual parts, I'm not yet able to see the bigger picture. I may be overcomplicating what it is I'm trying to do, or there may be a much simpler way that I can't see yet.

227

asked Nov 06 '12 10:11

Martyn Shutt

2 Answers

PHP already offers a way to read a CSV file in an OO manner with SplFileObject:

$file = new SplFileObject("data.csv");

// tell object that it is reading a CSV file
$file->setFlags(SplFileObject::READ_CSV);
$file->setCsvControl(',', '"', '\\');

// iterate over the data
foreach ($file as $row) {
    list ($fruit, $quantity) = $row;
    // Do something with values
}

Since SplFileObject streams over the CSV data, the memory consumption is quite low and you can efficiently handle large CSV files, but since it is file i/o, it is not the fastest. However, an SplFileObject implements the Iterator interface, so you can wrap that $file instance into other iterators to modify the iteration. For instance, to limit file i/o, you could wrap it into a CachingIterator:

$cachedFile = new CachingIterator($file, CachingIterator::FULL_CACHE);

To fill the cache, you iterate over the $cachedFile. This will fill the cache

foreach ($cachedFile as $row) {

To iterate over the cache then, you do

foreach ($cachedFile->getCache() as $row) {

The tradeoff is increased memory obviously.

Now, to do your queries, you could wrap that CachingIterator or the SplFileObject into a FilterIterator which would limit the output when iterating over the csv data

class BannedEntriesFilter extends FilterIterator
{
    private $bannedEntries = array();

    public function setBannedEntries(array $bannedEntries)
    {
        $this->bannedEntries = $bannedEntries;
    }

    public function accept()
    {
        foreach ($this->current() as $key => $val) {
            return !$this->isBannedEntryInColumn($val, $key);
        }
    }

    public function $isBannedEntryInColumn($entry, $column)
    {
        return isset($this->bannedEntries[$column])
            && in_array($this->bannedEntries[$column], $entry);
    }
}

A FilterIterator will omit all entries from the inner Iterator which does not satisfy the test in the FilterIterator's accept method. Above, we check the current row from the csv file against an array of banned entries and if it matches, the data is not included in the iteration. You use it like this:

$filteredCachedFile = new BannedEntriesFilter(
    new ArrayIterator($cachedFile->getCache())
)

Since the cached results are always an Array, we need to wrap that Array into an ArrayIterator before we can wrap it into our FilterIterator. Note that to use the cache, you also need to iterate the CachingIterator at least once. We just assume you already did that above. The next step is to configure the banned entries

$filteredCachedFile->setBannedEntries(
    array(
        // banned entries for column 0
        array('foo', 'bar'),
        // banned entries for column 1
        array( …
    )
);

I guess that's rather straightforward. You have a multidimensional array with one entry for each column in the CSV data holding the banned entries. You then simply iterate over the instance and it will give you only the rows not having banned entries

foreach ($filteredCachedFile as $row) {
    // do something with filtered rows
}

or, if you just want to get the results into an array:

$results = iterator_to_array($filteredCachedFile);

You can stack multiple FilterIterators to further limit the results. If you dont feel like writing a class for each filtering, have a look at the CallbackFilterIterator, which allows passing of the accept logic at runtime:

$filteredCachedFile = new CallbackFilterIterator(
    new ArrayIterator($cachedFile->getCache()),
    function(array $row) {
        static $bannedEntries = array(
            array('foo', 'bar'),
            …
        );
        foreach ($row as $key => $val) {
            // logic from above returning boolean if match is found
        }
    }
);

162

answered Nov 13 '22 08:11

Gordon

I 'm going to illustrate a reasonable approach to designing OOP code that serves your stated needs. While I firmly believe that the ideas presented below are sound, please be aware that:

the design can be improved -- the aim here is to show the approach, not the final product
the implementation is only meant as an example -- if it (barely) works, it's good enough

How to go about doing this

A highly engineered solution would start by trying to define the interface to the data. That is, think about what would be a representation of the data that allows you to perform all your query operations. Here's one that would work:

A dataset is a finite collection of rows. Each row can be accessed given its zero-based index.
A row is a finite collection of values. Each value is a string and can be accessed given its zero-based index (i.e. column index). All rows in a dataset have exactly the same number of values.

This definition is enough to implement all three types of queries you mention by looping over the rows and performing some type of test on the values of a particular column.

The next move is to define an interface that describes the above in code. A not particularly nice but still adequate approach would be:

interface IDataSet {
    public function getRowCount();
    public function getValueAt($row, $column);
}

Now that this part is done, you can go and define a concrete class that implements this interface and can be used in your situation:

class InMemoryDataSet implements IDataSet {
    private $_data = array();

    public function __construct(array $data) {
        $this->_data = $data;
    }

    public function getRowCount() {
        return count($this->_data);
    }

    public function getValueAt($row, $column) {
        if ($row >= $this->getRowCount()) {
            throw new OutOfRangeException();
        }

        return isset($this->_data[$row][$column])
            ? $this->_data[$row][$column]
            : null;
    }
}

The next step is to go and write some code that converts your input data to some kind of IDataSet:

function CSVToDataSet($file) {
    return new InMemoryDataSet(array_map('str_getcsv', file($file)));
}

Now you can trivially create an IDataSet from a CSV file, and you know that you can perform your queries on it because IDataSet was explicitly designed for that purpose. You 're almost there.

The only thing missing is creating a reusable class that can perform your queries on an IDataSet. Here is one of them:

class DataQuery {
    private $_dataSet;

    public function __construct(IDataSet $dataSet) {
        $this->_dataSet = $dataSet;
    }

    public static function getRowsWithDuplicates($columnIndex) {
        $values = array();
        for ($i = 0; $i < $this->_dataSet->getRowCount(); ++$i) {
            $values[$this->_dataSet->->getValueAt($i, $columnIndex)][] = $i;
        }

        return array_filter($values, function($row) { return count($row) > 1; });
    }
}

This code will return an array where the keys are values in your CSV data and the values are arrays with the zero-based indexes of the rows where each value appears. Since only duplicate values are returned, each array will have at least two elements.

So at this point you are ready to go:

$dataSet = CSVToDataSet("data.csv");
$query = new DataQuery($dataSet);
$dupes = $query->getRowsWithDuplicates(0);

What you gain by doing this

Clean, maintainable code that supports being modified in the future without requiring edits all over your application.

If you want to add more query operations, add them to DataQuery and you can instantly use them on all concrete types of data sets. The data set and any other external code will not need any modifications.

If you want to change the internal representation of the data, modify InMemoryDataSet accordingly or create another class that implements IDataSet and use that one instead from CSVToDataSet. The query class and any other external code will not need any modifications.

If you need to change the definition of the data set (perhaps to allow more types of queries to be performed efficiently) then you have to modify IDataSet, which also brings all the concrete data set classes into the picture and probably DataQuery as well. While this won't be the end of the world, it's exactly the kind of thing you would want to avoid.

And this is precisely the reason why I suggested to start from this: If you come up with a good definition for the data set, everything else will just fall into place.

answered Nov 13 '22 08:11

Jon

Related questions
                            
                                Colon operator in PHP
                            
                                How algorithm of strtotime(PHP Date function) Works?
                            
                                insert current timestamp into mysql with php?
                            
                                PHP documentation mouse-over JetBrains PHPStorm 3.0
                            
                                How to secure $_SERVER['PHP_SELF']?
                            
                                in Yii's criteria how to get count (*)
                            
                                Set permissions in binary
                            
                                PHP script to traverse directory/file tree and output tree as nested ULs [closed]
                            
                                PHP string with decimals to number
                            
                                PHP imagettftext invalid font filename
                            
                                Mixing html and php variables inside an echo statement
                            
                                Chrome 20 websocket handshake
                            
                                Best way to return html from php function?
                            
                                How to convert (cast) Object to Array without Class Name prefix in PHP?
                            
                                mysql_fetch_array returns non-Unicode text
                            
                                How to define channels and Levels in monolog logging in symfony2
                            
                                xampp is not finding my home folder --Mac os x
                            
                                How to enable or install XLST in php xampp
                            
                                Parse string as array in PHP
                            
                                Combining sql queries in PHP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With