Sorry for the vague title, but it's hard to explain concisely.
Basically, imagine I have a list (in Python) that looks like this:
['a', 'b', 'c\nd', 'e', 'f\ng', 'h', 'i']
From that, I want to get this:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
One way I was thinking of doing this was using reduce
like so:
reduce(lambda x, y: x + y.split('\n'), lst, [])
But I don't think this is very efficient, since it doesn't take advantage of the fact that we know every nth element has the separator in it. Any suggestions?
Edit: for more background on how the array was constructed, which may the problem.
I have text in the form:
Ignorable line
Field name 1|Field name 2|Field name 3|Field name 4
Value 1|Value 2|Value 3|Value 4
Value 1|Value 2|Value 3|Value 4
...
Where we can have an arbitrary amount of field names, and there will always an equal number of values as field names on line. Note that we can have new lines in the values. We only know that the will be separated by a '|'. So we could have
Value 1|This is an long
value that extends over multiple
lines|Value 3|Value 4
How I currently do this is by doing a s.split('\n', 2)
so that we get the field names in their own string, and the values in their own string. Then, when splitting the values by '|', we get the list of the form I originally mentioned.
Python split() Method Syntax When you need to split a string into substrings, you can use the split() method. In the above syntax: <string> is any valid Python string, sep is the separator that you'd like to split on.
You can just do ('\n'.join(lst)).split()
to get the 2nd list.
In [17]:
%timeit reduce(lambda x, y: x + y.split('\n'), lst, [])
100000 loops, best of 3: 9.64 µs per loop
In [18]:
%timeit ('\n'.join(lst)).split()
1000000 loops, best of 3: 1.09 µs per loop
Thanks to @Joran Beasley for suggesting split()
over split('\n')
!
Now I see your updated question, I think we can avoid getting into such a situation in the beginning, see (using re
):
In [71]:
L=re.findall('([^|]+)\|',
''.join(['|'+item+'|' if item.count('|')==3 else item for item in S.split('\n')[1:]])+'|')
In [72]:
zip(*[L[i::4] for i in range(4)]) #4 being the number of fields.
Out[72]:
[('Field name 1', 'Field name 2', 'Field name 3', 'Field name 4'),
('Value 1', 'Value 2', 'Value 3', 'Value 4'),
('Value 1',
'This is an longvalue that extends over multiplelines',
'Value 3',
'Value 4')]
Looks like a dataset for SAS
initially, am I right?
premature optimization is the root of all evil
if you are actually experiencing performance issues because of this code thats one thing, but I doubt you are.
when you optimize you are often sacrificing readability what I would do if it was me
list(itertools.chain(*[item.split() for item in lst]))
which is very clear what your doing
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With