Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split one file into multiple files based on pattern (cut can occur within lines)

A lot of solutions exist, but the specificity here is I need to be able to split within a line, the cut should occur just before the pattern. Ex:

Infile:

<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla><?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla><?xml 2><blabla><blabla>

Should become with pattern <?xml

Outfile1:

<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla>

Outfile2:

<?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla>

Outfile3:

<?xml 2><blabla><blabla>

Actually the perl script in the validated answer here works fine for my little example. But it generates an error for my bigger (about 6GB) actual files. The error is:

panic: sv_setpvn called with negative strlen at /home/.../split.pl line 7, <> chunk 1.

I don't have the permissions to comment, that's why I started a new post. And finally, a Python solution would be even more appreciated, as I understand it better.

like image 209
LostInTranslation Avatar asked Oct 03 '12 21:10

LostInTranslation


People also ask

How do I split a file into multiple files?

Open the Zip file. Open the Tools tab. Click the Split Size dropdown button and select the appropriate size for each of the parts of the split Zip file. If you choose Custom Size in the Split Size dropdown list, another small window will open and allow you to enter in a custom size specified in megabytes.

How do I split a file into multiple lines in Python?

Example 1: Using the splitlines() the read() method reads the data from the file which is stored in the variable file_data. splitlines() method splits the data into lines and returns a list object. After printing out the list, the file is closed using the close() method.


3 Answers

This performs the split without reading everything into RAM:

def files():
    n = 0
    while True:
        n += 1
        yield open('/output/dir/%d.part' % n, 'w')


pat = '<?xml'
fs = files()
outfile = next(fs) 

with open(filename) as infile:
    for line in infile:
        if pat not in line:
            outfile.write(line)
        else:
            items = line.split(pat)
            outfile.write(items[0])
            for item in items[1:]:
                outfile = next(fs)
                outfile.write(pat + item)

A word of warning: this doesn't work if your pattern spreads across multiple lines (that is, contains "\n"). Consider the mmap solution if this is the case.

like image 110
georg Avatar answered Oct 26 '22 15:10

georg


Perl can parse large files line by line instead of slurping the whole file into memory. Here is a short script (with explanation):

perl -n -E 'if (/(.*)(<\?xml.*)/ ) {
   print $fh $1 if $1;
   open $fh, ">output." . ++$i;
   print $fh $2;
} else { print $fh $_ }'  in.txt

perl -n : The -n flag will loop over your file line by line (setting the contents to $_)

-E : Execute the following text (Perl expects a filename by default)

if (/(.*)(<\?xml.*) ) if a line matches <?xml split that line (using regex matchs) into $1 and $2.

print $fh $1 if $1 Print the start of the line to the old file.

open $fh, ">output.". ++$i; Create a new file-handle for writing.

print $fh $2 Print the rest of the line to the new file.

} else { print $fn $_ } If the line didn't match <?xml just print it to the current file-handle.

Note: this script assumes your input file starts with <?xml.

like image 22
CoffeeMonster Avatar answered Oct 26 '22 14:10

CoffeeMonster


For files of that size, you'll probably want to use the mmap module, so you don't have to handle chunking up the file yourself. From the docs there:

Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing obj[index] = 'a', or change a substring by assigning to a slice: obj[i1:i2] = '...'. You can also read and write data starting at the current file position, and seek() through the file to different positions.

Here's a quick example that shows you how to find each occurrence of <?xml #> in the file. You can write the chunks to new files as you go, but I haven't written that part.

import mmap
import re

# a regex to match the "xml" nodes
r = re.compile(r'\<\?xml\s\d+\>')

with open('so.txt','r+b') as f:
    mp = mmap.mmap(f.fileno(),0)
    for m in r.finditer(mp):
        # here you can start collecting the starting positions and 
        # writing chunks to new files 
        print m.start()
like image 25
John Vinyard Avatar answered Oct 26 '22 16:10

John Vinyard