A lot of solutions exist, but the specificity here is I need to be able to split within a line, the cut should occur just before the pattern. Ex: Infile: <pre class="prettyprint"><code><?xml 1><blabla1> <blabla><blabla2><blabla> <blabla><blabla> <blabla><blabla3><blabla><blabla> <blabla><blabla><blabla><?xml 4> <blabla> <blabla><blabla><blabla> <blabla><?xml 2><blabla><blabla> </code></pre> Should become with pattern <code><?xml</code> Outfile1: <pre class="prettyprint"><code><?xml 1><blabla1> <blabla><blabla2><blabla> <blabla><blabla> <blabla><blabla3><blabla><blabla> <blabla><blabla><blabla> </code></pre> Outfile2: <pre class="prettyprint"><code><?xml 4> <blabla> <blabla><blabla><blabla> <blabla> </code></pre> Outfile3: <pre class="prettyprint"><code><?xml 2><blabla><blabla> </code></pre> Actually the <code>perl</code> script in the validated answer here works fine for my little example. But it generates an error for my bigger (about 6GB) actual files. The error is: <pre class="prettyprint"><code>panic: sv_setpvn called with negative strlen at /home/.../split.pl line 7, <> chunk 1. </code></pre> I don't have the permissions to comment, that's why I started a new post. And finally, a <code>Python</code> solution would be even more appreciated, as I understand it better.

Perl can parse large files line by line instead of slurping the whole file into memory. Here is a short script (with explanation): <pre class="prettyprint"><code>perl -n -E 'if (/(.*)(<\?xml.*)/ ) { print $fh $1 if $1; open $fh, ">output." . ++$i; print $fh $2; } else { print $fh $_ }' in.txt </code></pre> <code>perl -n</code> : The -n flag will loop over your file line by line (setting the contents to $_) <code>-E</code> : Execute the following text (Perl expects a filename by default) <code>if (/(.*)(<\?xml.*) )</code> if a line matches <code><?xml</code> split that line (using regex matchs) into $1 and $2. <code>print $fh $1 if $1</code> Print the start of the line to the old file. <code>open $fh, ">output.". ++$i;</code> Create a new file-handle for writing. <code>print $fh $2</code> Print the rest of the line to the new file. <code>} else { print $fn $_ }</code> If the line didn't match <code><?xml</code> just print it to the current file-handle. Note: this script assumes your input file starts with <code><?xml</code>.

For files of that size, you'll probably want to use the <code>mmap</code> module, so you don't have to handle chunking up the file yourself. From the docs there: <blockquote> Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing <code>obj[index] = 'a'</code>, or change a substring by assigning to a slice: <code>obj[i1:i2] = '...'</code>. You can also read and write data starting at the current file position, and <code>seek()</code> through the file to different positions. </blockquote> Here's a quick example that shows you how to find each occurrence of <code><?xml #></code> in the file. You can write the chunks to new files as you go, but I haven't written that part. <pre class="prettyprint"><code>import mmap import re # a regex to match the "xml" nodes r = re.compile(r'\<\?xml\s\d+\>') with open('so.txt','r+b') as f: mp = mmap.mmap(f.fileno(),0) for m in r.finditer(mp): # here you can start collecting the starting positions and # writing chunks to new files print m.start() </code></pre>

Split one file into multiple files based on pattern (cut can occur within lines)

Q: How do I split a file into multiple files?

Open the Zip file. Open the Tools tab. Click the Split Size dropdown button and select the appropriate size for each of the parts of the split Zip file. If you choose Custom Size in the Split Size dropdown list, another small window will open and allow you to enter in a custom size specified in megabytes.

Q: How do I split a file into multiple lines in Python?

Example 1: Using the splitlines() the read() method reads the data from the file which is stored in the variable file_data. splitlines() method splits the data into lines and returns a list object. After printing out the list, the file is closed using the close() method.

Tags:

python

split

awk

gnu

perl

A lot of solutions exist, but the specificity here is I need to be able to split within a line, the cut should occur just before the pattern. Ex:

Infile:

<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla><?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla><?xml 2><blabla><blabla>

Should become with pattern <?xml

Outfile1:

<?xml 1><blabla1>
<blabla><blabla2><blabla>
<blabla><blabla>
<blabla><blabla3><blabla><blabla>
<blabla><blabla><blabla>

Outfile2:

<?xml 4>
<blabla>
<blabla><blabla><blabla>
<blabla>

Outfile3:

<?xml 2><blabla><blabla>

Actually the perl script in the validated answer here works fine for my little example. But it generates an error for my bigger (about 6GB) actual files. The error is:

panic: sv_setpvn called with negative strlen at /home/.../split.pl line 7, <> chunk 1.

I don't have the permissions to comment, that's why I started a new post. And finally, a Python solution would be even more appreciated, as I understand it better.

209

asked Oct 03 '12 21:10

LostInTranslation

3 Answers

This performs the split without reading everything into RAM:

def files():
    n = 0
    while True:
        n += 1
        yield open('/output/dir/%d.part' % n, 'w')


pat = '<?xml'
fs = files()
outfile = next(fs) 

with open(filename) as infile:
    for line in infile:
        if pat not in line:
            outfile.write(line)
        else:
            items = line.split(pat)
            outfile.write(items[0])
            for item in items[1:]:
                outfile = next(fs)
                outfile.write(pat + item)

A word of warning: this doesn't work if your pattern spreads across multiple lines (that is, contains "\n"). Consider the mmap solution if this is the case.

110

answered Oct 26 '22 15:10

georg

Perl can parse large files line by line instead of slurping the whole file into memory. Here is a short script (with explanation):

perl -n -E 'if (/(.*)(<\?xml.*)/ ) {
   print $fh $1 if $1;
   open $fh, ">output." . ++$i;
   print $fh $2;
} else { print $fh $_ }'  in.txt

perl -n : The -n flag will loop over your file line by line (setting the contents to $_)

-E : Execute the following text (Perl expects a filename by default)

if (/(.*)(<\?xml.*) ) if a line matches <?xml split that line (using regex matchs) into $1 and $2.

print $fh $1 if $1 Print the start of the line to the old file.

open $fh, ">output.". ++$i; Create a new file-handle for writing.

print $fh $2 Print the rest of the line to the new file.

} else { print $fn $_ } If the line didn't match <?xml just print it to the current file-handle.

Note: this script assumes your input file starts with <?xml.

answered Oct 26 '22 14:10

CoffeeMonster

For files of that size, you'll probably want to use the mmap module, so you don't have to handle chunking up the file yourself. From the docs there:

Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing obj[index] = 'a', or change a substring by assigning to a slice: obj[i1:i2] = '...'. You can also read and write data starting at the current file position, and seek() through the file to different positions.

Here's a quick example that shows you how to find each occurrence of <?xml #> in the file. You can write the chunks to new files as you go, but I haven't written that part.

import mmap
import re

# a regex to match the "xml" nodes
r = re.compile(r'\<\?xml\s\d+\>')

with open('so.txt','r+b') as f:
    mp = mmap.mmap(f.fileno(),0)
    for m in r.finditer(mp):
        # here you can start collecting the starting positions and 
        # writing chunks to new files 
        print m.start()

answered Oct 26 '22 16:10

John Vinyard

Related questions
                            
                                Sorting while preserving order in python
                            
                                A quickie: python, terminal "print command not found"
                            
                                Take a list, sort by popularity and then remove duplicates [duplicate]
                            
                                Why can't I call a private method when I'm inside a public method?
                            
                                Check if class attribute was defined or derived in given class
                            
                                Creating a Dictionary from a List of 2-Tuples
                            
                                The best way to store a python list to a database?
                            
                                Asterisk art in python
                            
                                Infinite sums in python
                            
                                Can a python subclass be store in a seperate module from its superclass
                            
                                How to replace all those Special Characters with white spaces in python?
                            
                                Time measure script in python
                            
                                Python hash table design
                            
                                Most pythonic way to truncate a list to N indices when you can't guarantee the list is at least N length?
                            
                                Get built-in function from the function name
                            
                                The most effective way to assign unique integer id to a string?
                            
                                how to check if 3 characters are in consecutive alpha order
                            
                                Python : how to append new elements in a list of list?
                            
                                What is the order of complexity of comparing two python lists?
                            
                                Installing Python imaging library (PIL) on Ubuntu

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With