(Python) Breaking an output text file into tokens

Question

Short story: I have an output file that comes from a system, broken into tokens divided by "| |;", on which I need to get the content between the pipes "|" and write them on another file.

This is how the output file looks like:

|Operation_ID|,|Operation_Name|,|business_group_name|,|business_unit_name|,|Program_ID|,|Program_Name|,|Project_ID|,|Project_Name|,|Program_Type_Name|,|Program_Cost_Type_Name|,|Start_date|,|Estimated_End_Date|,|End_Date|,|SQA_Name|,|CMA_Name|,|SSE_Name|,|PMs|,|TLs|,|PortfolioManager|,|Finished|,|Research|,|SQA_ID|,|CMA_ID|,|SSE_ID|
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2163|,|QQQ|,||,||,|15/12/2008|,||,|22/01/2009|,||,||,||,|EEE EEE |,||,||,|True|,||,||,||,||
|22|,|XXX|,|YYY|,|ZZZ|,|3|,|WWW|,|2165|,|QQQ|,||,||,|01/01/2009|,||,|09/04/2010|,||,||,||,|EEE EEE EEE|,||,||,|True|,|False|,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|10|,|WWW|,|2164|,|QQQ|,|Development|,|Direct|,|15/12/2008|,||,|26/02/2010|,||,||,||,|EEE |,|EEE EEE ; EEE EEE ; EEE EEE |,||,|True|,|False|,||,||,||
|22|,|XXX|,|YYY|,|ZZZ|,|3|,|WWW|,|2166|,|QQQ|,||,||,|15/12/2008|,||,|31/05/2010|,||,||,||,||,||,||,|True|,|False|,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|10|,|WWW|,|2168|,|QQQ|,|Development|,|Direct|,|05/01/2009|,||,|20/05/2009|,||,||,||,|EEE EEE EEE|,|EEE EEE |,||,|True|,||,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2169|,|QQQ|,||,||,|13/01/2009|,||,|22/05/2009|,||,||,||,|EEE EEE EEE|,|EEE EEE EEE EEE|,||,|True|,||,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|2|,|WWW|,|2174|,|QQQ|,||,||,|08/01/2009|,||,|20/04/2009|,||,||,||,|EEE EEE |,|EEE EEE|,||,|True|,||,||,||,||
|23|,|XXX|,|YYY|,|ZZZ|,|47|,|WWW|,|2176|,|QQQ|,|Internal|,|Indirect|,|21/01/2009|,||,|17/12/2010|,||,||,||,|EEE EEE; EEE EEE|,||,||,|True|,|True|,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2142|,|QQQ|,||,||,|21/10/2008|,||,|13/05/2009|,||,||,||,|EEE EEE |,||,||,|True|,||,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2147|,|QQQ|,||,||,|07/11/2008|,||,|26/11/2008|,||,||,||,|EEE EEE EEE EEE |,|EEE EEE |,||,|True|,||,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2148|,|QQQ|,||,||,|07/11/2008|,||,|09/04/2009|,||,||,||,||,||,||,|True|,||,||,||,||
|22|,|XXX|,|YYY|,|ZZZ|,|3|,|WWW|,|2149|,|QQQ|,||,||,|01/11/2008|,|31/01/2011|,|01/12/2010|,||,||,||,|EEE EEE ; EEE EEE|,|EEE EEE; EEE EEE|,||,|True|,|False|,||,||,||
|22|,|XXX|,|YYY|,|ZZZ|,|20|,|WWW|,|2150|,|QQQ|,|Development|,||,|31/10/2008|,|31/10/2010|,|29/10/2010|,||,||,||,|EEE EEE |,|EEE EEE |,||,|True|,|False|,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2152|,|QQQ|,||,||,|26/11/2008|,||,|03/07/2009|,||,||,||,|EEE EEE EEE ; EEE EEE EEE EEE |,|EEE EEE |,||,|True|,||,||,||,||
|22|,|XXX|,|YYY|,|ZZZ|,|3|,|WWW|,|2151|,|QQQ|,||,||,|01/11/2008|,||,|29/01/2009|,||,||,||,||,||,||,|True|,||,||,||,||
|23|,|XXX|,|YYY|,|ZZZ|,|47|,|WWW|,|2187|,|QQQ|,|Internal|,|Indirect|,|21/01/2009|,||,|03/12/2009|,||,||,||,|EEE EEE|,|EEE EEE EEE|,||,|True|,|True|,||,||,||
|23|,|XXX|,|YYY|,|ZZZ|,|47|,|WWW|,|2192|,|QQQ|,|Internal|,|Indirect|,|21/01/2009|,||,|11/01/2011|,||,||,||,|EEE EEE EEE; EEE EEE|,||,||,|True|,|True|,||,||,||
|20|,|XXX|,|YYY|,|ZZZ|,|1|,|WWW|,|2196|,|QQQ|,||,||,|23/01/2009|,||,|24/03/2010|,||,||,||,|EEE EEE |,||,||,|True|,|False|,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|41|,|WWW|,|2231|,|QQQ|,|Research|,||,|21/05/2009|,||,|01/12/2009|,||,||,||,||,||,||,|True|,|False|,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|41|,|WWW|,|2230|,|QQQ|,|Research|,||,|21/05/2009|,||,|30/11/2009|,||,||,||,||,||,||,|True|,|False|,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|41|,|WWW|,|2232|,|QQQ|,|Research|,||,|21/05/2009|,||,|09/07/2010|,||,||,||,||,|EEE EEE EEE|,||,|True|,|True|,||,||,||
|24|,|XXX|,|YYY|,|ZZZ|,|44|,|WWW|,|2237|,|QQQ|,|Research|,|Indirect|,|21/05/2009|,||,|22/01/2010|,||,||,||,||,||,||,|True|,|False|,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|41|,|WWW|,|2238|,|QQQ|,|Research|,||,|21/05/2009|,||,|25/02/2010|,||,||,||,||,||,||,|True|,|False|,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|41|,|WWW|,|2239|,|QQQ|,|Research|,||,|21/05/2009|,||,|04/01/2011|,||,||,||,||,||,||,|True|,|True|,||,||,||
|21|,|XXX|,|YYY|,|ZZZ|,|41|,|WWW|,|2240|,|QQQ|,|Research|,||,|21/05/2009|,||,|05/01/2011|,||,||,||,||,||,||,|True|,|True|,||,||,||
|26|,|XXX|,|YYY|,|ZZZ|,|50|,|WWW|,|2242|,|QQQ|,|Internal|,|Indirect|,|21/05/2009|,||,|14/10/2010|,||,||,||,||,||,||,|True|,|True|,||,||,||
|22|,|XXX|,|YYY|,|ZZZ|,|3|,|WWW|,|2273|,|QQQ|,||,||,|25/05/2009|,||,|29/01/2010|,||,||,||,||,|EEE EEE|,||,|True|,|False|,||,||,||

I'm new to python/programming in general, so I tried writing the following algorithm:

# => Reads the file test.txt;
# => Scans character by character for '|' character;
# => If character '|' is found, skips to next character and add subsequent
# characters to a 'token' array, until next character is '|' again;
# => When next character is '|', add 'token' array to 'array_of_tokens';
# => Once END OF FILE arrives, writes 'array_of_tokens' to 'test_output.txt'
# file;


test_file = 'test.txt'
test_output = 'test_output.txt'
token = []
array_of_tokens = []
index = 0

# => Reads the file test.txt;
with open(test_file) as file:
    while True:
        # => Scans character by character for '|' character;
        character = file.read(1)
        # => If character '|' is found,
        if character == "|"
            # skips to next character
            character = next(character),
            # until next character is '|' again;
            while not character == '|'
                # add subsequent characters to a 'token' array
                token(index) = character
                index ++
                character = next(character)
            # => When next character is '|', add 'token' array to 'array_of_tokens';
            if next(character) == '|'
                array_of_tokens = token

        else if not character:
            break
        print "Read a character: ", character

# => Once END OF FILE arrives, writes 'array_of_tokens' to 'test_output.txt'
# file;
test_output.write(str(array_of_tokens))

And it obviously did not work. The thing is, I'm not entirely sure of what am I supposed to do now, I know the result I need (written in the comments), but I'm not sure how to make my code work. Can anybody help please? Also, if there's any tip for where to look for advice/resources I could look into to become a better programmer, a real one, I'd deeply appreciate it!

Thanks in advance!

Padraic Cunningham · Accepted Answer

Just use str.translate to remove the |, split on the , and filter empty strings:

In [9]: s="|22|,|XXX|,|YYY|,|ZZZ|,|3|,|WWW|,|2273|,|QQQ|,||,||,|25/05/2009|,||,|29/01/2010|,||,||,||,||,|EEE EEE|,||,|True|,|False|,||,||,||"



In [10]: print(filter(None,s.translate(None,"|").split(",")))
['22', 'XXX', 'YYY', 'ZZZ', '3', 'WWW', '2273', 'QQQ', '25/05/2009', '29/01/2010', 'EEE EEE', 'True', 'False']

If you need the data to line up to the columns don't filter.

So using your input all you need is something like the following depending on how you want to write the data to the output file:

with open("test.txt") as f, open('test_output.txt',"w") as out:
    wr = csv.writer(out, delimiter=",")
    for line in f:
        wr.writerow(filter(None, line.rstrip().translate(None, "|").split(",")))

Your output will be:

Operation_ID,Operation_Name,business_group_name,business_unit_name,Program_ID,Program_Name,Project_ID,Project_Name,Program_Type_Name,Program_Cost_Type_Name,Start_date,Estimated_End_Date,End_Date,SQA_Name,CMA_Name,SSE_Name,PMs,TLs,PortfolioManager,Finished,Research,SQA_ID,CMA_ID,SSE_ID
20,XXX,YYY,ZZZ,1,WWW,2163,QQQ,15/12/2008,22/01/2009,EEE EEE ,True
22,XXX,YYY,ZZZ,3,WWW,2165,QQQ,01/01/2009,09/04/2010,EEE EEE EEE,True,False
20,XXX,YYY,ZZZ,10,WWW,2164,QQQ,Development,Direct,15/12/2008,26/02/2010,EEE ,EEE EEE ; EEE EEE ; EEE EEE ,True,False
22,XXX,YYY,ZZZ,3,WWW,2166,QQQ,15/12/2008,31/05/2010,True,False
20,XXX,YYY,ZZZ,10,WWW,2168,QQQ,Development,Direct,05/01/2009,20/05/2009,EEE EEE EEE,EEE EEE ,True
20,XXX,YYY,ZZZ,1,WWW,2169,QQQ,13/01/2009,22/05/2009,EEE EEE EEE,EEE EEE EEE EEE,True
 etc.................

As tdelaney mentioned in a comment this does presume you don't have any pipes inside pipes.

For python3 we need to do a bit more work as str.translate is slightly different. We need to use str.maketrans to create a table:

import csv

with open("test.txt") as f, open('test_output.txt', "w") as out:
    wr = csv.writer(out, delimiter=",")
    table = str.maketrans("|",",")
    for line in f:
        wr.writerow(list(filter(None, line.rstrip().translate(table).split(","))

Another approach would be to just split on "|" and filter commas and empty strings:

with open("in.txt") as f, open('test_output.txt', "w") as out:
    wr = csv.writer(out, delimiter=",")
    for line in f:
        wr.writerow(filter(lambda x: x not in  {",",""},line.rstrip().split("|")))

(Python) Breaking an output text file into tokens

Tags:

python

io

token

Filipe Gorges Reuwsaat

1 Answers

Padraic Cunningham

Recent Activity

Donate For Us

(Python) Breaking an output text file into tokens

Tags:

python

io

token

Filipe Gorges Reuwsaat

1 Answers

Padraic Cunningham

Related questions

Recent Activity

Donate For Us