Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to split every nth string element and merge into array?

Tags:

python

list

Sorry for the vague title, but it's hard to explain concisely.

Basically, imagine I have a list (in Python) that looks like this:

['a', 'b', 'c\nd', 'e', 'f\ng', 'h', 'i']

From that, I want to get this:

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']

One way I was thinking of doing this was using reduce like so:

reduce(lambda x, y: x + y.split('\n'), lst, [])

But I don't think this is very efficient, since it doesn't take advantage of the fact that we know every nth element has the separator in it. Any suggestions?

Edit: for more background on how the array was constructed, which may the problem.

I have text in the form:

Ignorable line
Field name 1|Field name 2|Field name 3|Field name 4
Value 1|Value 2|Value 3|Value 4
Value 1|Value 2|Value 3|Value 4
...

Where we can have an arbitrary amount of field names, and there will always an equal number of values as field names on line. Note that we can have new lines in the values. We only know that the will be separated by a '|'. So we could have

Value 1|This is an long
value that extends over multiple
lines|Value 3|Value 4

How I currently do this is by doing a s.split('\n', 2) so that we get the field names in their own string, and the values in their own string. Then, when splitting the values by '|', we get the list of the form I originally mentioned.

like image 781
mp94 Avatar asked Apr 06 '14 01:04

mp94


People also ask

How do you split a string into substrings in Python?

Python split() Method Syntax When you need to split a string into substrings, you can use the split() method. In the above syntax: <string> is any valid Python string, sep is the separator that you'd like to split on.


2 Answers

You can just do ('\n'.join(lst)).split() to get the 2nd list.

In [17]:

%timeit reduce(lambda x, y: x + y.split('\n'), lst, [])
100000 loops, best of 3: 9.64 µs per loop
In [18]:

%timeit ('\n'.join(lst)).split() 
1000000 loops, best of 3: 1.09 µs per loop

Thanks to @Joran Beasley for suggesting split() over split('\n')!

Edit

Now I see your updated question, I think we can avoid getting into such a situation in the beginning, see (using re):

In [71]:

L=re.findall('([^|]+)\|',
           ''.join(['|'+item+'|' if item.count('|')==3 else item for item in S.split('\n')[1:]])+'|')
In [72]:

zip(*[L[i::4] for i in range(4)]) #4 being the number of fields.
Out[72]:
[('Field name 1', 'Field name 2', 'Field name 3', 'Field name 4'),
 ('Value 1', 'Value 2', 'Value 3', 'Value 4'),
 ('Value 1',
  'This is an longvalue that extends over multiplelines',
  'Value 3',
  'Value 4')]

Looks like a dataset for SAS initially, am I right?

like image 149
CT Zhu Avatar answered Oct 18 '22 00:10

CT Zhu


premature optimization is the root of all evil

if you are actually experiencing performance issues because of this code thats one thing, but I doubt you are.

when you optimize you are often sacrificing readability what I would do if it was me

list(itertools.chain(*[item.split() for item in lst]))

which is very clear what your doing

like image 38
Joran Beasley Avatar answered Oct 17 '22 23:10

Joran Beasley