I have a textfile where columns are separated by variable amounts of whitespace. Is it possible to load this file directly as a pandas dataframe without pre-processing the file? In the pandas documentation the delimiter section says that I can use a 's*'
construct but I couldn't get this to work.
## sample data
head sample.txt
# --- full sequence --- -------------- this domain ------------- hmm coord ali coord env coord
# target name accession tlen query name accession qlen E-value score bias # of c-Evalue i-Evalue score bias from to from to from to acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
ABC_membrane PF00664.18 275 AAF67494.2_AF170880 - 615 8e-29 100.7 11.4 1 1 3e-32 1e-28 100.4 7.9 3 273 42 313 40 315 0.95 ABC transporter transmembrane region
ABC_tran PF00005.22 118 AAF67494.2_AF170880 - 615 2.6e-20 72.8 0.0 1 1 1.9e-23 6.4e-20 71.5 0.0 1 118 402 527 402 527 0.93 ABC transporter
SMC_N PF02463.14 220 AAF67494.2_AF170880 - 615 3.8e-08 32.7 0.2 1 2 0.0036 12 4.9 0.0 27 40 391 404 383 408 0.86 RecF/RecN/SMC N terminal domain
SMC_N PF02463.14 220 AAF67494.2_AF170880 - 615 3.8e-08 32.7 0.2 2 2 1.8e-09 6.1e-06 25.4 0.0 116 210 461 568 428 575 0.85 RecF/RecN/SMC N terminal domain
AAA_16 PF13191.1 166 AAF67494.2_AF170880 - 615 3.1e-06 27.5 0.3 1 1 2e-09 7e-06 26.4 0.2 20 158 386 544 376 556 0.72 AAA ATPase domain
YceG PF02618.11 297 AAF67495.1_AF170880 - 284 3.4e-64 216.6 0.0 1 1 2.9e-68 4e-64 216.3 0.0 68 296 53 274 29 275 0.85 YceG-like family
Pyr_redox_3 PF13738.1 203 AAF67496.2_AF170880 - 352 2.9e-28 99.1 0.0 1 2 2.8e-30 4.8e-27 95.2 0.0 1 201 4 198 4 200 0.85 Pyridine nucleotide-disulphide oxidoreductase
#load data
from pandas import *
data = read_table('sample.txt', skiprows=3, header=None, sep=" ")
ValueError: Expecting 83 columns, got 91 in row 4
#load data part 2
data = read_table('sample.txt', skiprows=3, header=None, sep="'s*' ")
#this mushes some of the columns into the first column and drops the rest.
X.1
1 ABC_tran PF00005.22 118 AAF67494.2_
2 SMC_N PF02463.14 220 AAF67494.2_
3 SMC_N PF02463.14 220 AAF67494.2_
4 AAA_16 PF13191.1 166 AAF67494.2_
5 YceG PF02618.11 297 AAF67495.1_
6 Pyr_redox_3 PF13738.1 203 AAF67496.2_
7 Pyr_redox_3 PF13738.1 203 AAF67496.2_
8 FMO-like PF00743.14 532 AAF67496.2_
9 FMO-like PF00743.14 532 AAF67496.2_
While I can preprocess the files to change the whitespace to commas/tabs it would be nice to load them directly.
(FYI this is the *.hmmdomtblout output from the hmmscan program)
You can refer to column names that contain spaces or operators by surrounding them in backticks. This way you can also escape names that start with a digit, or those that are a Python keyword. Basically when it is not valid Python identifier. See notes down for more details.
Pandas provide 3 methods to handle white spaces(including New line) in any text data. As it can be seen in the name, str. lstrip() is used to remove spaces from the left side of string, str. rstrip() to remove spaces from right side of the string and str.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
You should be able to just do this, which @DSM just taught me in another thread:
data = read_table('sample.txt', skiprows=3, header=None, delim_whitespace=True)
Documentation
I think there's just a missing \
in the docs (maybe because it was interpreted as an escape marker at some point?) It's a regexp, after all:
In [68]: data = read_table('sample.txt', skiprows=3, header=None, sep=r"\s*")
In [69]: data
Out[69]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns:
X.1 7 non-null values
X.2 7 non-null values
X.3 7 non-null values
X.4 7 non-null values
X.5 7 non-null values
X.6 7 non-null values
[...]
X.23 7 non-null values
X.24 7 non-null values
X.25 5 non-null values
X.26 3 non-null values
dtypes: float64(8), int64(10), object(8)
Because of the delimiter problem noted by @MRAB, it has some trouble with the last few columns:
In [73]: data.ix[:,20:]
Out[73]:
X.21 X.22 X.23 X.24 X.25 X.26
0 315 0.95 ABC transporter transmembrane region
1 527 0.93 ABC transporter None None
2 408 0.86 RecF/RecN/SMC N terminal domain
3 575 0.85 RecF/RecN/SMC N terminal domain
4 556 0.72 AAA ATPase domain None
5 275 0.85 YceG-like family None None
6 200 0.85 Pyridine nucleotide-disulphide oxidoreductase None
but that can be patched up at the end.
None of the given answers works in a case like this:
Block..Col.name.with.spaces..col3
...1..6.141754e+003..2.998903e+000
2048..6.154461e+003..6.010216e+000
that is, two or more spaces are used as separators, but the column names can themselves contain one space.
In such a case, we need a regular expression for two or more spaces. This will work:
sep=r"[ ]{2,}"
But again, the drawback is that it triggers the python parser.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With