Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Excel-like text import in Python: automatically parsing fixed width columns

In Excel, if you import whitespace delineated text in which the columns do not line up perfectly and data may be missing, like

  pH             pKa/Em  n(slope) 1000*chi2      vdw0  
CYS-I0014_        >14.0                          0.00  
LYS+I0013_       11.827     0.781     0.440      0.18

you are given the option to treat it as fixed width columns, and Excel can figure out the column widths automatically, usually with pretty good results. Is there a library in Python which can break up poorly formated fixed-width text in a similarly automatic fashion?

EDIT: This is what the fixed width text import looks like in Excel. In step one, you just check the 'fixed-width' radio button, and then here in step two Excel has already added in column breaks automatically. The only time it fails to do so properly is when there is not at least one whitespace character overlapping in each column break in each line.

fixed width text importing in excel

like image 386
tel Avatar asked Jun 14 '11 19:06

tel


1 Answers

First off, Excel (2003, at home) isn't quite so smart. If your column 1000*chi2 contains spaces, e.g. 1000 * chi2, excel will guess wrong.

Trivial case: if your data was originally separated by tabs (not spaces), and multiple tabs were used to indicate empty columns, then, at least in TCL, it's easy to split each line by tab content, and I guess trivial too in Python.

But I'm guessing your problem is that they've used space characters only. The biggest clue I see to solving this one was to paste your text into notepad and choose a fixed size font. Everything lines up neatly, and you can use the number of characters in each line as a measure of "length".

So, IF you can rely on this feature of your input, then you can use a "sieve" approach to identifying where the column breaks are automatically. As you munch through the lines in a first pass, note the "positions" along the line that are occupied by non white-space, eliminating a position from your list if it's EVER occupied by non white space. As you go, you'll quickly arrive at a set of positions that are NEVER occupied by non white space. These, then, are your column dividers. In your example, your "sieve" would end up with positions 10-16, 23-24,32, 42-47 never occupied by non whitespace (assuming I can count). Thus the complement of that set is your set of column positions in which your data must lie. So, foreach line, each block of nonwhitespace will fit into exacly one of your columns from the set of positions (i.e the complement set) identified above. I've never coded in Python, so attached is a TCL script that will identify,using the sieve approach, where the column breaks are in the text, and emit a new text file with exactly those space characters replaced by a single tab - ie. 10-16 replaced by one tab, 23-24 by another, etc. The resulting file is tab-separated, i.e. the trivial case. I confess I've only tried it on YOUR little case data, copied into a text file called ex.txt; output goes to ex_.txt. I suspect it also might have problems if headings contain spaces.

Hope this helps!

set fh [open ex.txt]
set contents [read $fh];#ok for small-to-medium files.
close $fh

#first pass
set occupied {}
set lines [split $contents \n];#split contents at line breaks.
foreach line $lines {
  set chrs [split $line {}];#split each line into chars.
  set pos 0
  foreach chr $chrs {
    if {$chr ne " "} {
      lappend occupied $pos
    }
    incr pos
  }
}

#drop out with long list of occupied "positions": sort to create
#our sieve.
set datacols [lsort -unique -integer $occupied]
puts "occupied: $datacols"

#identify column boundaries.
set colset {}
set start [lindex $datacols 0];#first occupied pos might be > 0??
foreach index $datacols {
  if {$start < $index} {
    set end $index;incr end -1
    lappend colset [list $start $end]
    puts "col break starts at $start, ends at $end";#some instro!
    set start $index
  }
  incr start
}

#Now convert input file to trivial case output file, replacing
#sieved space chars with tab characters.
set tesloc [lreverse $colset];#reverse the column list!
set fh [open ex_.txt w]
foreach line $lines {
  foreach ele $tesloc {
    set line [string replace $line [lindex $ele 0] [lindex $ele 1] "\t" ]
  }
  puts "newline is $line"
  puts $fh $line
}
close $fh
like image 172
Stephen Bissell Avatar answered Sep 27 '22 21:09

Stephen Bissell