I'm trying to parse a large text file that has inconsistent spacing and repeating lines. A lot of the text in the file I do not need, but for example in one line I may need 6 items from it, some separated by commas, some separated by space.
Example Line: 1 23456 John,Doe 366: F.7
What I want (in CSV): 1 2456 John Doe 366 F.7 (All as their own cell)
Ultimately I'm trying to get the output to CSV and would like I've tried going line by line in the file so far separating the components I'm trying to extract by their specific space, but I feel like there's a better way.
import csv
def is_page_header(line):
return(line[0] == '1') and ("RUN DATE:" not in line)
def read_header(inFile):
while True:
line = inFile.readline()
if '************************' in line:
break
def is_rec_start(line):
try:
x = int(line[0:6])
return True
except:
return False
filename = r"TEXT_TEST.txt"
inFile = open(filename)
while True:
line = inFile.readline()
if line == "\n":
continue
elif line == "":
break
elif is_page_header(line):
read_header(inFile)
elif is_rec_start(line):
docketno = int(line[0:6])
fileno = line[8:20]
elif 'FINGERPRINTED' in line:
fingerprinted = True
else:
print(line)
you can use regex
import re
import csv
pattern = re.compile("(\d+)\s+(\d+)\s*(\w+)\s*\,\s*(\w+)\s*(\d+)\s*\:\s*([\w\.]+)")
with open("TEXT_TEST.txt") as txt_file, open("CSV_TEST.csv", "w") as csv_file:
csv_writer = csv.writer(csv_file)
for line in txt_file:
g = pattern.findall(line)
if g: csv_writer.writerows(g)
(\d+)
: \d
match any digit from 0 to 9, +
after means match one or more, ()
is used to capture and extract information for further processing.
\s+
: \s
to match whitespace, +
one or more.
\s*
: *
after \s
match zero or more of whitespaces.
\w
: is used to match characters in range A-Z
, a-z
, 0-9
[]
is used for matching specific characters, eg. [abc]
will only match a single a
, b
, or c
letter and nothing else, so [\w\.]
matches A-Z
, a-z
, 0-9
or .
, \
before .
is used to escape a character that has special meaning inside a regular expression.
\d
\w
\s
*
+
.
[]
()
re.findall
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With