file processing in python

Question

I'm working on text file processing using Python. I've got a text file (ctl_Files.txt) which has the following content/ or similar to this:

------------------------
Changeset: 143
User: Sarfaraz
Date: Tuesday, April 05, 2011 5:34:54 PM

Comment:
  Initial add, all objects.

Items:
  add $/Systems/DB/Expences/Loader
  add $/Systems/DB/Expences/Loader/AAA.txt
  add $/Systems/DB/Expences/Loader/BBB.txt
  add $/Systems/DB/Expences/Loader/CCC.txt  

Check-in Notes:
  Code Reviewer:
  Performance Reviewer:
  Reviewer:
  Security Reviewer:
------------------------
Changeset: 145
User: Sarfaraz
Date: Thursday, April 07, 2011 5:34:54 PM

Comment:
  edited objects.

Items:
  edit $/Systems/DB/Expences/Loader
  edit $/Systems/DB/Expences/Loader/AAA.txt
  edit $/Systems/DB/Expences/Loader/AAB.txt  

Check-in Notes:
  Code Reviewer:
  Performance Reviewer:
  Reviewer:
  Security Reviewer:
------------------------
Changeset: 147
User: Sarfaraz
Date: Wednesday, April 06, 2011 5:34:54 PM

Comment:
  Initial add, all objects.

Items:
  delete, source rename $/Systems/DB/Expences/Loader/AAA.txt;X34892
  rename                $/Systems/DB/Expences/Loader/AAC.txt.

Check-in Notes:
  Code Reviewer:
  Performance Reviewer:
  Reviewer:
  Security Reviewer:
------------------------

To process this file I wrote the following code:

#Tags - used for spliting the information

tag1 = 'Changeset:'
tag2 = 'User:'
tag3 = 'Date:'
tag4 = 'Comment:'
tag5 = 'Items:'
tag6 = 'Check-in Notes:'

#opening and reading the input file
#In path to input file use '\' as escape character
with open ("C:\Users\md_sarfaraz\Desktop\ctl_Files.txt", "r") as myfile:
    val=myfile.read().replace('
', ' ')

#counting the occurence of any one of the above tag
#As count will be same for all the tags
occurence = val.count(tag1)

#initializing row variable
row=""

#passing the count - occurence to the loop
for count in  range(1, occurence+1):
   row += ( (val.split(tag1)[count].split(tag2)[0]).strip() + '|' \
    + (val.split(tag2)[count].split(tag3)[0]).strip() + '|' \
    + (val.split(tag3)[count].split(tag4)[0]).strip() + '|' \
    + (val.split(tag4)[count].split(tag5)[0]).strip() + '|' \
    + (val.split(tag5)[count].split(tag6)[0]).strip() + '
')

#opening and writing the output file
#In path to output file use '\' as escape character
file = open("C:\Users\md_sarfaraz\Desktop\processed_ctl_Files.txt", "w+") 
file.write(row)
file.close()

and got the following result/File (processed_ctl_Files.txt):

143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader   add $/Systems/DB/Expences/Loader/AAA.txt   add $/Systems/DB/Expences/Loader/BBB.txt   add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader   edit $/Systems/DB/Expences/Loader/AAA.txt   edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892   rename                $/Systems/DB/Rascal/Expences/AAC.txt.

But, I want the result like this:

143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader   
                                                                          add $/Systems/DB/Expences/Loader/AAA.txt   
                                                                          add $/Systems/DB/Expences/Loader/BBB.txt   
                                                                          add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader   
                                                                 edit $/Systems/DB/Expences/Loader/AAA.txt   
                                                                 edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892   
                                                                            rename                $/Systems/DB/Rascal/Expences/AAC.txt.

or it would be great if we can get results like this :

143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader   
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/AAA.txt   
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/BBB.txt   
143|Sarfaraz|Tuesday, April 05, 2011 5:34:54 PM|Initial add, all objects.|add $/Systems/DB/Expences/Loader/CCC.txt
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader   
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAA.txt   
145|Sarfaraz|Thursday, April 07, 2011 5:34:54 PM|edited objects.|edit $/Systems/DB/Expences/Loader/AAB.txt
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|delete, source rename $/Systems/DB/Rascal/Expences/AAA.txt;X34892   
147|Sarfaraz|Wednesday, April 06, 2011 5:34:54 PM|Initial add, all objects.|rename                $/Systems/DB/Rascal/Expences/AAC.txt.

Let me know how I can do this. Also, I'm very new to Python so please ignore if I've written some lousy or redundant code. And help me to improve this.

Spacy · Accepted Answer

This solution is not as short and probably not as effective as the answer utilizing regular expressions, but it should be quite easy to understand. The solution does make it easier to use the parsed data because each section data is stored into a dictionary.

    ctl_file = "ctl_Files.txt" # path of source file
    processed_ctl_file = "processed_ctl_Files.txt" # path of destination file

    #Tags - used for spliting the information
    changeset_tag = 'Changeset:'
    user_tag = 'User:'
    date_tag = 'Date:'
    comment_tag = 'Comment:'
    items_tag = 'Items:'
    checkin_tag = 'Check-in Notes:'

    section_separator = "------------------------"
    changesets = []

    #open and read the input file
    with open(ctl_file, 'r') as read_file:
        first_section = True
        changeset_dict = {}
        items = []
        comment_stage = False
        items_stage = False
        checkin_dict = {}
        # Read one line at a time
        for line in read_file:
            # Check which tag matches the current line and store the data to matching key in the dictionary
            if changeset_tag in line:
                changeset = line.split(":")[1].strip()
                changeset_dict[changeset_tag] = changeset
            elif user_tag in line:
                user = line.split(":")[1].strip()
                changeset_dict[user_tag] = user
            elif date_tag in line:
                date = line.split(":")[1].strip()
                changeset_dict[date_tag] = date
            elif comment_tag in line:
                comment_stage = True
            elif items_tag in line:
                items_stage = True
            elif checkin_tag in line:
                pass                        # not implemented due to example file not containing any data
            elif section_separator in line: # new section
                if first_section:
                    first_section = False
                    continue
                tmp = changeset_dict
                changesets.append(tmp)          
                changeset_dict = {}
                items = []
                # Set stages to false just in case
                items_stage = False
                comment_stage = False
            elif not line.strip():  # empty line
                if items_stage:
                    changeset_dict[items_tag] = items
                    items_stage = False
                comment_stage = False
            else:
                if comment_stage:
                    changeset_dict[comment_tag] = line.strip()  # Only works for one line comment  
                elif items_stage:
                    items.append(line.strip())

    #open and write to the output file
    with open(processed_ctl_file, 'w') as write_file:
        for changeset in changesets:        
            row = "{0}|{1}|{2}|{3}|".format(changeset[changeset_tag], changeset[user_tag], changeset[date_tag], changeset[comment_tag])
            distance = len(row)
            items = changeset[items_tag]
            join_string = "
" + distance * " "
            items_part = str.join(join_string, items)
            row += items_part + "
"
            write_file.write(row)

Also, try to use variable names which describes its content. Names like tag1, tag2, etc. does not say much about the variable content. This makes code difficult to read, especially when scripts gets longer. Readability might seem unimportant in most cases, but when re-visiting old code it takes much longer to understand what the code does with non describing variables.

file processing in python

Tags:

python

file-io

text-processing

MD SARFARAZ

1 Answers

Spacy

Recent Activity

Donate For Us

file processing in python

Tags:

python

file-io

text-processing

MD SARFARAZ

1 Answers

Spacy

Related questions

Recent Activity

Donate For Us