Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Splitting a large string by number of delimiter occurrences

I'm still learning Python, and I have a question I haven't been able to solve. I have a very long string (millions of lines long) which I would like to be split into a smaller string length based on a specified number of occurrences of a delimeter.

For instance:

ABCDEF
//
GHIJKLMN
//
OPQ
//
RSTLN
//
OPQR
//
STUVW
//
XYZ
//

In this case I would want to split based on "//" and return a string of all lines before the nth occurrence of the delimeter.

So an input of splitting the string by // by 1 would return:

ABCDEF

an input of splitting the string by // by 2 would return:

ABCDEF
//
GHIJKLMN

an input of splitting the string by // by 3 would return:

ABCDEF
//
GHIJKLMN
//
OPQ

And so on... However, The length of the original 2 million line string appeared to be a problem when I simply tried to split the entire string and by "//" and just work with the individual indexes. (I was getting a memory error) Perhaps Python can't handle so many lines in one split? So I can't do that.

I'm looking for a way that I don't need to split the entire string into a hundred-thousand indexes when I may only need 100, but instead just start from the beginning until a certain point, stop and return everything before it, which I assume may also be faster? I hope my question is as clear as possible.

Is there a simple or elegant way to achieve this? Thanks!

like image 323
Indie Avatar asked Jun 04 '15 14:06

Indie


People also ask

How do you split a string with multiple separators in Python?

To split a string with multiple delimiters in Python, use the re. split() method. The re. split() function splits the string by each occurrence of the pattern.

How do you split a string after a certain number of characters Python?

Use range() function and slicing notation to split string by a number of characters in Python.

How do you split a string at the last delimiter in Python?

Python provides a method that split the string from rear end of the string. The inbuilt Python function rsplit() that split the string on the last occurrence of the delimiter. In rsplit() function 1 is passed with the argument so it breaks the string only taking one delimiter from last.


1 Answers

If you want to work with files instead of strings in memory, here is another answer.

This version is written as a function that reads lines and immediately prints them out until the specified number of delimiters have been found (no extra memory needed to store the entire string).

def file_split(file_name, delimiter, n=1):
    with open(file_name) as fh:
        for line in fh:
            line = line.rstrip()    # use .rstrip("\n") to only strip newlines
            if line == delimiter:
                n -= 1
                if n <= 0:
                    return
            print line

file_split('data.txt', '//', 3)

You can use this to write the output to a new file like this:

python split.py > newfile.txt

With a little extra work, you can use argparse to pass parameters to the program.

like image 147
Brent Washburne Avatar answered Oct 11 '22 10:10

Brent Washburne