Consider a large list of named items (first line) returned from a large csv file (80 MB) with possible interrupted spacing
name_line = ['a',,'b',,'c' .... ,,'cb','cc']
I am reading the remainder of the data in line by line and I only need to process data with a corresponding name. Data might look like
data_line = ['10',,'.5',,'10289' .... ,,'16.7','0']
I tried it two ways. One is popping the empty columns from each line of the read
blnk_cols = [1,3, ... ,97]
while data:
...
for index in blnk_cols: data_line.pop(index)
the other is compiling the items associated with a name from L1
good_cols = [0,2,4, ... ,98,99]
while data:
...
data_line = [data_line[index] for index in good_cols]
in the data I am using there will definitely be more good lines then bad lines although it might be as high as half and half.
I used the cProfile and pstats package to determine my weakest links in speed which suggested the pop was the current slowest item. I switched to the list comp and the time almost doubled.
I imagine one fast way would be to slice the array retrieving only good data, but this would be complicated for files with alternating blank and good data.
what I really need is to be able to do
data_line = data_line[good_cols]
effectively passing a list of indices into a list to get back those items. Right now my program is running in about 2.3 seconds for a 10 MB file and the pop accounts for about .3 seconds.
Is there a faster way to access certain locations in a list. In C it would just be de-referencing an array of pointers to the correct indices in the array.
Additions: name_line in file before read
a,b,c,d,e,f,g,,,,,h,i,j,k,,,,l,m,n,
name_line after read and split(",")
['a','b','c','d','e','f','g','','','','','h','i','j','k','','','','l','m','n','\n']
So, you can grab any list element you want by using its index. To access an item, first include the name of the list and then in square brackets include the integer that corresponds to the index for the item you want to access.
The syntax for accessing the elements of a list is the same as the syntax for accessing the characters of a string. We use the index operator ( [] – not to be confused with an empty list). The expression inside the brackets specifies the index.
Try a generator expression,
data_line = (data_line[i] for i in good_cols)
Also read here about Generator Expressions vs. List Comprehension
as the top answer tells you: 'Basically, use a generator expression if all you're doing is iterating once'.
So you should benefit from this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With