Fastest way to extract only certain fields from comma separated string in Python

Question

Say I have a string containing data from a DB or spreadsheet in comma separated format.

For example:

data = "hello,how,are,you,232.3354,good morning"

Assume that there are maybe 200 fields in these "records".

I am interested in looking at just certain fields of this record. What is the fastest way in Python to get at them?

The most simple way would be something like:

fields = data.split(",")
result = [fields[4], fields[12], fields[123]]

Is there a faster way to do this, making use of the fact that:

You only need to allocate a list with 3 elements and 3 string objects for the result.
You can stop scanning the data string once you reach field 123.

I have tried to write some code using repeated calls to find to skip passed commas but if the last field is too far down the string this becomes slower than the basic split solution.

I am processing several million records so any speedup would be welcome.

U2EF1 · Accepted Answer

You're not going to do too much better than loading everything into memory and then dropping the parts that you need. My recommendation is compression and a better library.

As it happens I have a couple reasonably sized csv's lying around (this one is 500k lines).

> import gzip
> import pandas as pd
> %timeit pd.read_csv(gzip.open('file.csv.gz'))
1 loops, best of 3: 545 ms per loop

Dropping the columns is also pretty fast, I'm not sure what the major cost is.

> %timeit csv[['col1', 'col2']]
100 loops, best of 3: 5.5 ms per loop

Fastest way to extract only certain fields from comma separated string in Python

Tags:

performance

python

split

csv

Jeroen Dirks

1 Answers

U2EF1

Recent Activity

Donate For Us

Fastest way to extract only certain fields from comma separated string in Python

Tags:

performance

python

split

csv

Jeroen Dirks

1 Answers

U2EF1

Related questions

Recent Activity

Donate For Us