Can pandas handle variable-length whitespace as column delimiters [duplicate]

Tags:

I have a textfile where columns are separated by variable amounts of whitespace. Is it possible to load this file directly as a pandas dataframe without pre-processing the file? In the pandas documentation the delimiter section says that I can use a 's*' construct but I couldn't get this to work.

## sample data
head sample.txt

#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
ABC_membrane         PF00664.18   275 AAF67494.2_AF170880  -            615     8e-29  100.7  11.4   1   1     3e-32     1e-28  100.4   7.9     3   273    42   313    40   315 0.95 ABC transporter transmembrane region
ABC_tran             PF00005.22   118 AAF67494.2_AF170880  -            615   2.6e-20   72.8   0.0   1   1   1.9e-23   6.4e-20   71.5   0.0     1   118   402   527   402   527 0.93 ABC transporter
SMC_N                PF02463.14   220 AAF67494.2_AF170880  -            615   3.8e-08   32.7   0.2   1   2    0.0036        12    4.9   0.0    27    40   391   404   383   408 0.86 RecF/RecN/SMC N terminal domain
SMC_N                PF02463.14   220 AAF67494.2_AF170880  -            615   3.8e-08   32.7   0.2   2   2   1.8e-09   6.1e-06   25.4   0.0   116   210   461   568   428   575 0.85 RecF/RecN/SMC N terminal domain
AAA_16               PF13191.1    166 AAF67494.2_AF170880  -            615   3.1e-06   27.5   0.3   1   1     2e-09     7e-06   26.4   0.2    20   158   386   544   376   556 0.72 AAA ATPase domain
YceG                 PF02618.11   297 AAF67495.1_AF170880  -            284   3.4e-64  216.6   0.0   1   1   2.9e-68     4e-64  216.3   0.0    68   296    53   274    29   275 0.85 YceG-like family
Pyr_redox_3          PF13738.1    203 AAF67496.2_AF170880  -            352   2.9e-28   99.1   0.0   1   2   2.8e-30   4.8e-27   95.2   0.0     1   201     4   198     4   200 0.85 Pyridine nucleotide-disulphide oxidoreductase

#load data
from pandas import *
data = read_table('sample.txt', skiprows=3, header=None, sep=" ")

ValueError: Expecting 83 columns, got 91 in row 4

#load data part 2
data = read_table('sample.txt', skiprows=3, header=None, sep="'s*' ")
#this mushes some of the columns into the first column and drops the rest.
    X.1
1    ABC_tran PF00005.22 118 AAF67494.2_
2    SMC_N PF02463.14 220 AAF67494.2_
3    SMC_N PF02463.14 220 AAF67494.2_
4    AAA_16 PF13191.1 166 AAF67494.2_
5    YceG PF02618.11 297 AAF67495.1_
6    Pyr_redox_3 PF13738.1 203 AAF67496.2_
7    Pyr_redox_3 PF13738.1 203 AAF67496.2_
8    FMO-like PF00743.14 532 AAF67496.2_
9    FMO-like PF00743.14 532 AAF67496.2_

While I can preprocess the files to change the whitespace to commas/tabs it would be nice to load them directly.

(FYI this is the *.hmmdomtblout output from the hmmscan program)

312

asked Aug 18 '12 19:08

zach

3 Answers

You should be able to just do this, which @DSM just taught me in another thread:

data = read_table('sample.txt', skiprows=3, header=None, delim_whitespace=True)

Documentation

answered Oct 20 '22 08:10

tommy.carstensen

I think there's just a missing \ in the docs (maybe because it was interpreted as an escape marker at some point?) It's a regexp, after all:

In [68]: data = read_table('sample.txt', skiprows=3, header=None, sep=r"\s*")

In [69]: data
Out[69]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns:
X.1     7  non-null values
X.2     7  non-null values
X.3     7  non-null values
X.4     7  non-null values
X.5     7  non-null values
X.6     7  non-null values
[...]
X.23    7  non-null values
X.24    7  non-null values
X.25    5  non-null values
X.26    3  non-null values
dtypes: float64(8), int64(10), object(8)

Because of the delimiter problem noted by @MRAB, it has some trouble with the last few columns:

In [73]: data.ix[:,20:]
Out[73]: 
   X.21  X.22           X.23                   X.24            X.25    X.26
0   315  0.95            ABC            transporter   transmembrane  region
1   527  0.93            ABC            transporter            None    None
2   408  0.86  RecF/RecN/SMC                      N        terminal  domain
3   575  0.85  RecF/RecN/SMC                      N        terminal  domain
4   556  0.72            AAA                 ATPase          domain    None
5   275  0.85      YceG-like                 family            None    None
6   200  0.85       Pyridine  nucleotide-disulphide  oxidoreductase    None

but that can be patched up at the end.

answered Oct 20 '22 07:10

DSM

None of the given answers works in a case like this:

Block..Col.name.with.spaces..col3
...1..6.141754e+003..2.998903e+000
2048..6.154461e+003..6.010216e+000

that is, two or more spaces are used as separators, but the column names can themselves contain one space.

In such a case, we need a regular expression for two or more spaces. This will work:

sep=r"[ ]{2,}"

But again, the drawback is that it triggers the python parser.

answered Oct 20 '22 06:10

germ

Related questions
                            
                                How to prevent BrokenPipeError when doing a flush in Python?
                            
                                pandas read_json: "If using all scalar values, you must pass an index"
                            
                                Python 3.7.0 No module named 'PyQt5.QtWebEngineWidgets'
                            
                                How do I remove the light grey border around my Canvas widget?
                            
                                create file of particular size in python
                            
                                What's win32con module in python? Where can I find it?
                            
                                How can I break a for loop in jinja2?
                            
                                Python Pylab scatter plot error bars (the error on each point is unique)
                            
                                Is 'encoding is an invalid keyword' error inevitable in python 2.x?
                            
                                Preprocessing in scikit learn - single sample - Depreciation warning
                            
                                Convert UPPERCASE string to sentence case in Python
                            
                                Compare two images the python/linux way
                            
                                Python: get string representation of PyObject?
                            
                                Fitting a Weibull distribution using Scipy
                            
                                Return a requests.Response object from Flask
                            
                                Get mapping of categorical variables in pandas
                            
                                How to use youtube-dl script to download starting from some index in a playlist?
                            
                                Can't find module cPickle using Python 3.5 and Anaconda
                            
                                coverage.py: exclude files
                            
                                What is PyMySQL and how does it differ from MySQLdb? Can it affect Django deployment?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can pandas handle variable-length whitespace as column delimiters [duplicate]

Tags:

python

pandas

zach

People also ask

3 Answers

tommy.carstensen

DSM

germ

Recent Activity

Donate For Us