I am trying to parse a series of text files and save them as CSV files using Python (2.7.3). All text files have a 4 line long header which needs to be stripped out. The data lines have various delimiters including " (quote), - (dash), : column, and blank space. I found it a pain to code it in C++ with all these different delimiters, so I decided to try it in Python hearing it is relatively easier to do compared to C/C++.
I wrote a piece of code to test it for a single line of data and it works, however, I could not manage to make it work for the actual file. For parsing a single line I was using the text object and "replace" method. It looks like my current implementation reads the text file as a list, and there is no replace method for the list object.
Being a novice in Python, I got stuck at this point. Any input would be appreciated!
Thanks!
# function for parsing the data
def data_parser(text, dic):
for i, j in dic.iteritems():
text = text.replace(i,j)
return text
# open input/output files
inputfile = open('test.dat')
outputfile = open('test.csv', 'w')
my_text = inputfile.readlines()[4:] #reads to whole text file, skipping first 4 lines
# sample text string, just for demonstration to let you know how the data looks like
# my_text = '"2012-06-23 03:09:13.23",4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,"NAN",-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636'
# dictionary definition 0-, 1- etc. are there to parse the date block delimited with dashes, and make sure the negative numbers are not effected
reps = {'"NAN"':'NAN', '"':'', '0-':'0,','1-':'1,','2-':'2,','3-':'3,','4-':'4,','5-':'5,','6-':'6,','7-':'7,','8-':'8,','9-':'9,', ' ':',', ':':',' }
txt = data_parser(my_text, reps)
outputfile.writelines(txt)
inputfile.close()
outputfile.close()
Computer programmers often use parsing programs to convert text into formats that other applications can use. Parsers split items in a text string into separate fields. If, for example, you have a business database application that reads comma-delimited input files, a parser can help you create a comma-delimited file.
File Parsing and Data Analysis in Python Part I (Interactive Parsing and Data Visualisation) 1) File Parsing Definition: Parse essentially means to ''resolve (a sentence) into its component parts and describe their syntactic roles''. In computing, parsing is 'an act of parsing a string or a text'.
We can read data from a text file using read_table() in pandas. This function reads a general delimited file to a DataFrame object. This function is essentially the same as the read_csv() function but with the delimiter = '\t', instead of a comma by default.
I would use a for
loop to iterate over the lines in the text file:
for line in my_text:
outputfile.writelines(data_parser(line, reps))
If you want to read the file line-by-line instead of loading the whole thing at the start of the script you could do something like this:
inputfile = open('test.dat')
outputfile = open('test.csv', 'w')
# sample text string, just for demonstration to let you know how the data looks like
# my_text = '"2012-06-23 03:09:13.23",4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,"NAN",-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636'
# dictionary definition 0-, 1- etc. are there to parse the date block delimited with dashes, and make sure the negative numbers are not effected
reps = {'"NAN"':'NAN', '"':'', '0-':'0,','1-':'1,','2-':'2,','3-':'3,','4-':'4,','5-':'5,','6-':'6,','7-':'7,','8-':'8,','9-':'9,', ' ':',', ':':',' }
for i in range(4): inputfile.next() # skip first four lines
for line in inputfile:
outputfile.writelines(data_parser(line, reps))
inputfile.close()
outputfile.close()
From the accepted answer, it looks like your desired behaviour is to turn
skip 0
skip 1
skip 2
skip 3
"2012-06-23 03:09:13.23",4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,"NAN",-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636
into
2012,06,23,03,09,13.23,4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,NAN,-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636
If that's right, then I think something like
import csv
with open("test.dat", "rb") as infile, open("test.csv", "wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile, quoting=False)
for i, line in enumerate(reader):
if i < 4: continue
date = line[0].split()
day = date[0].split('-')
time = date[1].split(':')
newline = day + time + line[1:]
writer.writerow(newline)
would be a little simpler than the reps
stuff.
There are a few ways to go about this. One option would be to use inputfile.read()
instead of inputfile.readlines()
- you'd need to write separate code to strip the first four lines, but if you want the final output as a single string anyway, this might make the most sense.
A second, simpler option would be to rejoin the strings after striping the first four lines with my_text = ''.join(my_text)
. This is a little inefficient, but if speed isn't a major concern, the code will be simplest.
Finally, if you actually want the output as a list of strings instead of a single string, you can just modify your data parser to iterate over the list. That might looks something like this:
def data_parser(lines, dic):
for i, j in dic.iteritems():
for (k, line) in enumerate(lines):
lines[k] = line.replace(i, j)
return lines
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With