How do I read the following (two columns) data (from a .dat file) with Pandas
TIME XGSM
2004 006 01 00 01 37 600 1
2004 006 01 00 02 32 800 5
2004 006 01 00 03 28 000 8
2004 006 01 00 04 23 200 11
2004 006 01 00 05 18 400 17
Column separator is (at least) 2 spaces.
I tried
df = pd.read_table("test.dat", sep="\s+", usecols=['TIME', 'XGSM'])
print df
But it prints
TIME XGSM
2004 6
2004 6
2004 6
2004 6
2004 6
We can read data from a text file using read_table() in pandas. This function reads a general delimited file to a DataFrame object. This function is essentially the same as the read_csv() function but with the delimiter = '\t', instead of a comma by default.
Most DAT files contain text, so you can open them with text editors, like Notepad, Notepad++, VS Code, and so on. If you are sure the information contained in the DAT file is a video or audio, then your media player can open it. If it's a PDF, then Adobe Reader can open it, and so on.
You can use parameter usecols with order of columns:
import pandas as pd
from pandas.compat import StringIO
temp=u"""TIME XGSM
2004 006 01 00 01 37 600 1
2004 006 01 00 02 32 800 5
2004 006 01 00 03 28 000 8
2004 006 01 00 04 23 200 11
2004 006 01 00 05 18 400 17"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp),
sep="\s+",
skiprows=1,
usecols=[0,7],
names=['TIME','XGSM'])
print (df)
TIME XGSM
0 2004 1
1 2004 5
2 2004 8
3 2004 11
4 2004 17
Edit:
You can use separator regex
- 2 and more spaces and then add engine='python'
because warning:
ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
import pandas as pd
from pandas.compat import StringIO
temp=u"""TIME XGSM
2004 006 01 00 01 37 600 1
2004 006 01 00 02 32 800 5
2004 006 01 00 03 28 000 8
2004 006 01 00 04 23 200 11
2004 006 01 00 05 18 400 17"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep=r'\s{2,}', engine='python')
print (df)
TIME XGSM
0 2004 006 01 00 01 37 600 1
1 2004 006 01 00 02 32 800 5
2 2004 006 01 00 03 28 000 8
3 2004 006 01 00 04 23 200 11
4 2004 006 01 00 05 18 400 17
Could also try pd.read_fwf()
(Read a table of fixed-width formatted lines into DataFrame):
import pandas as pd
from io import StringIO
pd.read_fwf(StringIO("""TIME XGSM
2004 006 01 00 01 37 600 1
2004 006 01 00 02 32 800 5
2004 006 01 00 03 28 000 8
2004 006 01 00 04 23 200 11
2004 006 01 00 05 18 400 17"""), usecols = ["TIME", "XGSM"])
# TIME XGSM
#0 2004 1
#1 2004 5
#2 2004 8
#3 2004 11
#4 2004 17
I too experienced the problem while importing when there are lots of white space. I could solve by using
pd.read_fwf(file_name)
If you want to import files with fixed width text file, then read_fwf might be the solution without needing to use StringIO.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With