Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a text file with random spacing and repeating text?

Tags:

python

I'm trying to parse a large text file that has inconsistent spacing and repeating lines. A lot of the text in the file I do not need, but for example in one line I may need 6 items from it, some separated by commas, some separated by space.

Example Line: 1 23456 John,Doe 366: F.7

What I want (in CSV): 1 2456 John Doe 366 F.7 (All as their own cell)

Ultimately I'm trying to get the output to CSV and would like I've tried going line by line in the file so far separating the components I'm trying to extract by their specific space, but I feel like there's a better way.

import csv

def is_page_header(line):
    return(line[0] == '1') and ("RUN DATE:" not in line)

def read_header(inFile):
    while True:
        line = inFile.readline()
        if '************************' in line:
            break

def is_rec_start(line):
    try:
        x = int(line[0:6])
        return True
    except:
        return False

filename = r"TEXT_TEST.txt"

inFile = open(filename)

while True:
    line = inFile.readline()    

    if line == "\n":
        continue
    elif line == "":
        break
    elif is_page_header(line):
        read_header(inFile)
    elif is_rec_start(line):
          docketno = int(line[0:6])
          fileno = line[8:20]
    elif 'FINGERPRINTED' in line:
        fingerprinted = True
    else:
        print(line)
like image 950
Bruce Wayne Avatar asked Dec 02 '19 15:12

Bruce Wayne


Video Answer


1 Answers

you can use regex

import re
import csv
pattern = re.compile("(\d+)\s+(\d+)\s*(\w+)\s*\,\s*(\w+)\s*(\d+)\s*\:\s*([\w\.]+)")
with open("TEXT_TEST.txt") as txt_file, open("CSV_TEST.csv", "w") as csv_file:
    csv_writer = csv.writer(csv_file)
    for line in txt_file:
        g = pattern.findall(line)
        if g: csv_writer.writerows(g)

(\d+): \d match any digit from 0 to 9, + after means match one or more, () is used to capture and extract information for further processing.

\s+: \s to match whitespace, + one or more.

\s*: * after \s match zero or more of whitespaces.

\w: is used to match characters in range A-Z, a-z, 0-9

[] is used for matching specific characters, eg. [abc] will only match a single a, b, or c letter and nothing else, so [\w\.] matches A-Z, a-z, 0-9 or ., \ before . is used to escape a character that has special meaning inside a regular expression.

\d \w \s * + . [] () re.findall

like image 113
scicyb Avatar answered Nov 15 '22 03:11

scicyb