Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing white space from txt with python

I have a .txt file (scraped as pre-formatted text from a website) where the data looks like this:

B, NICKOLAS                       CT144531X       D1026    JUDGE ANNIE WHITE JOHNSON  
ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS        

I'd like to remove all extra spaces (they're actually different number of spaces, not tabs) in between the columns. I'd also then like to replace it with some delimiter (tab or pipe since there's commas within the data), like so:

ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS

Looked around and found that the best options are using regex or shlex to split. Two similar scenarios:

  • Python Regular expression must strip whitespace except between quotes,
  • Remove white spaces from dict : Python.
like image 922
aysha Avatar asked Apr 30 '16 17:04

aysha


1 Answers

You can apply the regex '\s{2,}' (two or more whitespace characters) to each line and substitute the matches with a single '|' character.

>>> import re
>>> line = 'ANDREWS VS BALL                   JA-15-0050      D0015    JUDGE EDWARD A ROBERTS        '
>>> re.sub('\s{2,}', '|', line.strip())
'ANDREWS VS BALL|JA-15-0050|D0015|JUDGE EDWARD A ROBERTS'

Stripping any leading and trailing whitespace from the line before applying re.sub ensures that you won't get '|' characters at the start and end of the line.

Your actual code should look similar to this:

import re
with open(filename) as f:
    for line in f:
        subbed = re.sub('\s{2,}', '|', line.strip())
        # do something here
like image 50
timgeb Avatar answered Sep 23 '22 01:09

timgeb