Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using python to separate a long text file into multiple files based on hyphen line separators?

Tags:

python

txt

Working to separate a single long text file into multiple files. Each section that needs to be placed into its own file, is separated by hyphen lines that look something like:

     This is section of some sample text
        that says something.
        
        2---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        
        This says something else
        
        3---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    
    Maybe this says something eles
    
    4---------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------

I have started the attempt in python without much success. I considered using the split fnx but I'm finding most examples provided for the split fnx revolve around len rather than regex type characters. This only generates one large file.

with open ('someName.txt','r') as fo:

    start=1
    cntr=0
    for x in fo.read().split("\n"):
        if x=='---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------':
            start = 1
            cntr += 1
            continue
        with open (str(cntr)+'.txt','a+') as opf:
            if not start:
                x = '\n'+x
            opf.write(x)
            start = 0
like image 687
DataMiner_NLP Avatar asked Mar 05 '26 17:03

DataMiner_NLP


1 Answers

You might get better results from switching the conditional from == to in. That way if the line you are testing has any leading characters it will still pass the condition. For example below I changed the x=='-----...' to '-----' in x. the change is at the very end of the long string of hyphens.

with open ('someName.txt','r') as fo:

    start=1
    cntr=0
    for x in fo.read().split("\n"):
        if ('-----------------------------------------------------'
            '-----------------------------------------------------'
            '-----------------------------------------------------'
            '------------------------------------------------') in x:
            start = 1
            cntr += 1
            continue
        with open (str(cntr)+'.txt','a+') as opf:
            if not start:
                x = '\n'+x
            opf.write(x)
            start = 0

An alternative solution would be to use regular expressions. For example...

import re

with open('someName.txt', 'rt') as fo:
    counter = 0
    pattern = re.compile(r'--+')  # this is the regex pattern
    for group in re.split(pattern, fo.read()):
        # the re.split function used in the loop splits text by the pattern
        with open(str(counter)+'.txt','a+') as opf:
            opf.write(group)
        counter += 1
like image 142
Alexander Avatar answered Mar 07 '26 06:03

Alexander