Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random access csv file content

I'm looking at a way to access a csv file's cells in a random fashion. If I use Python's csv module, I can only iterate through all lines which is rather slow. I should also add that the file is pretty large (>100MB) and that I'm looking at short response time.

I could preprocess the file into a different data format for faster row/column access. Perhaps someone has done this before and can share some experiences.

Background:

I'd like to show an extract of the csv on screen provided by a web server (depending on scroll position). Keeping the file in memory is not an option.

like image 508
orange Avatar asked Feb 15 '26 09:02

orange


1 Answers

I have found SQLite good for this sort of thing. It is easy to set up and you can store the data locally, but you also get easier control over what you select than csv files and you get the facility to add indexes etc.

There is also a built in facility for loading csv files into a table: http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles.

Let me know if you want any further details on the SQLite route i.e. how to create the table, load the data in or query it from Python.

SQLite Instructions to load .csv file to table

To create a database file you can just add the filename required as an argument when opening SQLite. Navigate to the directory containing the csv file from the command line (I am assuming here that you want the SQLite .db file to be contained in the same dir). If using Windows add SQLite to your PATH environment variable if not already done, (instructions here if you need them) and open SQLite as follows with an argument for the name that you want to give your database file e.g.:

sqlite3 example.db

Check the database file has been created by entering:

.databases

Create a table to hold the data. I am using an example for a simple customer table here. If data types are inconsistent for any columns use text:

create table customers (ID integer, Title text, Forename text, Surname text, Postcode text, Addr_Line1 text, Addr_Line2 text, Town text, County text, Home_Phone text, Mobile text, Comments text);

Specify the separator to be used:

.separator ","

Issue the command to import the data, the sytnax takes the form .import filename.ext table_name e.g.:

.import cust.csv customers

Check that the data has loaded in:

select count(*) from customers;

Add an index for columns that you are likely to filter on (full syntax described here) e.g.:

create index cust_surname on customers(surname);

You should now have fast access to the data when filtering on any of the indexed columns. To leave SQLite use .exit, to get a list of other helpful non-SQL commands use .help.

Python Alternative

Alternatively if you want to stick with pure Python and pre-process the file then you could load the data into a dictionary which would allow much faster access to the data as the dictionary keys behave like an index meaning that you can get to values associated with a key quickly without going through the records one by one. I would need further details of your input data and what fields the lookups would be based on to provide further details on how to implement this.

However, unless you will know in advance when the data will be required (to be able to pre-process the file before the request for data) then you would still have the overhead of loading the file from disk into memory every time you run this. Depending on your exact usage this may make the database solution more appropriate.

like image 179
ChrisProsser Avatar answered Feb 17 '26 23:02

ChrisProsser



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!