Say I have a string containing data from a DB or spreadsheet in comma separated format.
For example:
data = "hello,how,are,you,232.3354,good morning"
Assume that there are maybe 200 fields in these "records".
I am interested in looking at just certain fields of this record. What is the fastest way in Python to get at them?
The most simple way would be something like:
fields = data.split(",")
result = [fields[4], fields[12], fields[123]]
Is there a faster way to do this, making use of the fact that:
I have tried to write some code using repeated calls to find to skip passed commas but if the last field is too far down the string this becomes slower than the basic split solution.
I am processing several million records so any speedup would be welcome.
You're not going to do too much better than loading everything into memory and then dropping the parts that you need. My recommendation is compression and a better library.
As it happens I have a couple reasonably sized csv's lying around (this one is 500k lines).
> import gzip
> import pandas as pd
> %timeit pd.read_csv(gzip.open('file.csv.gz'))
1 loops, best of 3: 545 ms per loop
Dropping the columns is also pretty fast, I'm not sure what the major cost is.
> %timeit csv[['col1', 'col2']]
100 loops, best of 3: 5.5 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With