Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to extract only certain fields from comma separated string in Python

Say I have a string containing data from a DB or spreadsheet in comma separated format.

For example:

data = "hello,how,are,you,232.3354,good morning"

Assume that there are maybe 200 fields in these "records".

I am interested in looking at just certain fields of this record. What is the fastest way in Python to get at them?

The most simple way would be something like:

fields = data.split(",")
result = [fields[4], fields[12], fields[123]]

Is there a faster way to do this, making use of the fact that:

  1. You only need to allocate a list with 3 elements and 3 string objects for the result.
  2. You can stop scanning the data string once you reach field 123.

I have tried to write some code using repeated calls to find to skip passed commas but if the last field is too far down the string this becomes slower than the basic split solution.

I am processing several million records so any speedup would be welcome.

like image 969
Jeroen Dirks Avatar asked May 18 '13 01:05

Jeroen Dirks


1 Answers

You're not going to do too much better than loading everything into memory and then dropping the parts that you need. My recommendation is compression and a better library.

As it happens I have a couple reasonably sized csv's lying around (this one is 500k lines).

> import gzip
> import pandas as pd
> %timeit pd.read_csv(gzip.open('file.csv.gz'))
1 loops, best of 3: 545 ms per loop

Dropping the columns is also pretty fast, I'm not sure what the major cost is.

> %timeit csv[['col1', 'col2']]
100 loops, best of 3: 5.5 ms per loop
like image 149
U2EF1 Avatar answered Oct 19 '22 02:10

U2EF1