I'm working on text file processing using Python. I've got a text file (ctl_Files.txt) which has the following content/ or similar to this:
------------------------
Changeset: 143
User: Sarfaraz
Date: Tuesday, April 05, 2011 5:34:54 PM
Comment:
Initial add, all objects.
Items:
add $/Systems/DB/Expences/Loader
add $/Systems/DB/Expences/Loader/AAA.txt
add $/Systems/DB/Expences/Loader/BBB.txt
add $/Systems/DB/Expences/Loader/CCC.txt
Check-in Notes:
Code Reviewer:
Performance Reviewer:
Reviewer:
Security Reviewer:
------------------------
Changeset: 145
User: Sarfaraz
Date: Thursday, April 07, 2011 5:34:54 PM
Comment:
edited objects.
Items:
edit $/Systems/DB/Expences/Loader
edit $/Systems/DB/Expences/Loader/AAA.txt
edit $/Systems/DB/Expences/Loader/AAB.txt
Check-in Notes:
Code Reviewer:
Performance Reviewer:
Reviewer:
Security Reviewer:
------------------------
Changeset: 147
User: Sarfaraz
Date: Wednesday, April 06, 2011 5:34:54 PM
Comment:
Initial add, all objects.
Items:
delete, source rename $/Systems/DB/Expences/Loader/AAA.txt;X34892
rename $/Systems/DB/Expences/Loader/AAC.txt.
Check-in Notes:
Code Reviewer:
Performance Reviewer:
Reviewer:
Security Reviewer:
------------------------
To process this file I wrote the following code:
#Tags - used for spliting the information
tag1 = 'Changeset:'
tag2 = 'User:'
tag3 = 'Date:'
tag4 = 'Comment:'
tag5 = 'Items:'
tag6 = 'Check-in Notes:'
#opening and reading the input file
#In path to input file use '\' as escape character
with open ("C:\\Users\\md_sarfaraz\\Desktop\\ctl_Files.txt", "r") as myfile:
val=myfile.read().replace('\n', ' ')
#counting the occurence of any one of the above tag
#As count will be same for all the tags
occurence = val.count(tag1)
#initializing row variable
row=""
#passing the count - occurence to the loop
for count in range(1, occurence+1):
row += ( (val.split(tag1)[count].split(tag2)[0]).strip() + '|' \
+ (val.split(tag2)[count].split(tag3)[0]).strip() + '|' \
+ (val.split(tag3)[count].split(tag4)[0]).strip() + '|' \
+ (val.split(tag4)[count].split(tag5)[0]).strip() + '|' \
+ (val.split(tag5)[count].split(tag6)[0]).strip() + '\n')
#opening and writing the output file
#In path to output file use '\' as escape character
file = open("C:\\Users\\md_sarfaraz\\Desktop\\processed_ctl_Files.txt", "w+")
file.write(row)
file.close()
and got the following result/File (processed_ctl_Files.txt):
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader add $/Systems/DB/Expences/Loader/AAA.txt add $/Systems/DB/Expences/Loader/BBB.txt add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader edit $/Systems/DB/Expences/Loader/AAA.txt edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892 rename $/Systems/DB/Rascal/Expences/AAC.txt.
But, I want the result like this:
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader
add $/Systems/DB/Expences/Loader/AAA.txt
add $/Systems/DB/Expences/Loader/BBB.txt
add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader
edit $/Systems/DB/Expences/Loader/AAA.txt
edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892
rename $/Systems/DB/Rascal/Expences/AAC.txt.
or it would be great if we can get results like this :
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/AAA.txt
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/BBB.txt
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAA.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|rename $/Systems/DB/Rascal/Expences/AAC.txt.
Let me know how I can do this. Also, I'm very new to Python so please ignore if I've written some lousy or redundant code. And help me to improve this.
This solution is not as short and probably not as effective as the answer utilizing regular expressions, but it should be quite easy to understand. The solution does make it easier to use the parsed data because each section data is stored into a dictionary.
ctl_file = "ctl_Files.txt" # path of source file
processed_ctl_file = "processed_ctl_Files.txt" # path of destination file
#Tags - used for spliting the information
changeset_tag = 'Changeset:'
user_tag = 'User:'
date_tag = 'Date:'
comment_tag = 'Comment:'
items_tag = 'Items:'
checkin_tag = 'Check-in Notes:'
section_separator = "------------------------"
changesets = []
#open and read the input file
with open(ctl_file, 'r') as read_file:
first_section = True
changeset_dict = {}
items = []
comment_stage = False
items_stage = False
checkin_dict = {}
# Read one line at a time
for line in read_file:
# Check which tag matches the current line and store the data to matching key in the dictionary
if changeset_tag in line:
changeset = line.split(":")[1].strip()
changeset_dict[changeset_tag] = changeset
elif user_tag in line:
user = line.split(":")[1].strip()
changeset_dict[user_tag] = user
elif date_tag in line:
date = line.split(":")[1].strip()
changeset_dict[date_tag] = date
elif comment_tag in line:
comment_stage = True
elif items_tag in line:
items_stage = True
elif checkin_tag in line:
pass # not implemented due to example file not containing any data
elif section_separator in line: # new section
if first_section:
first_section = False
continue
tmp = changeset_dict
changesets.append(tmp)
changeset_dict = {}
items = []
# Set stages to false just in case
items_stage = False
comment_stage = False
elif not line.strip(): # empty line
if items_stage:
changeset_dict[items_tag] = items
items_stage = False
comment_stage = False
else:
if comment_stage:
changeset_dict[comment_tag] = line.strip() # Only works for one line comment
elif items_stage:
items.append(line.strip())
#open and write to the output file
with open(processed_ctl_file, 'w') as write_file:
for changeset in changesets:
row = "{0}|{1}|{2}|{3}|".format(changeset[changeset_tag], changeset[user_tag], changeset[date_tag], changeset[comment_tag])
distance = len(row)
items = changeset[items_tag]
join_string = "\n" + distance * " "
items_part = str.join(join_string, items)
row += items_part + "\n"
write_file.write(row)
Also, try to use variable names which describes its content. Names like tag1, tag2, etc. does not say much about the variable content. This makes code difficult to read, especially when scripts gets longer. Readability might seem unimportant in most cases, but when re-visiting old code it takes much longer to understand what the code does with non describing variables.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With