Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: read_csv indicating 'space-delimited'

Tags:

python

io

pandas

I have the following file.txt (abridged):

SICcode        Catcode        Category                              SICname        MultSIC
0111        A1500        Wheat, corn, soybeans and cash grain        Wheat        X
0112        A1600        Other commodities (incl rice, peanuts)      Rice        X
0115        A1500        Wheat, corn, soybeans and cash grain        Corn        X
0116        A1500        Wheat, corn, soybeans and cash grain        Soybeans        X
0119        A1500        Wheat, corn, soybeans and cash grain        Cash grains, NEC        X
0131        A1100        Cotton        Cotton        X
0132        A1300        Tobacco & Tobacco products                  Tobacco        X

I'm having some problems reading it into a pandas df. I tried pd.read_csvwith the following specifications engine='python', sep='Tab'but it returned the file in one column:

    SICcode Catcode Category SICname MultSIC
0   0111 A1500 Wheat, corn, soybeans...
1   0112 A1600 Other commodities (in...
2   0115 A1500 Wheat, corn, soybeans...
3   0116 A1500 Wheat, corn, soybeans...

Then I tried to put it into a gnumeric file using 'tab' as a delimiter, but it read the file as one column. Does anyone have an idea on this?

like image 608
Collective Action Avatar asked Mar 13 '23 12:03

Collective Action


2 Answers

If df = pd.read_csv('file.txt', sep='\t') returns a DataFrame with one column, then apparently file.txt is not using tabs as separators. Your data might simply have spaces as separators. In that case you could try

df = pd.read_csv('data', sep=r'\s{2,}')

which uses the regex pattern \s{2,} as the separator. This regex matches 2-or-more whitespace characters.

In [8]: df
Out[8]: 
   SICcode Catcode                                Category           SICname  \
0      111   A1500    Wheat, corn, soybeans and cash grain             Wheat   
1      112   A1600  Other commodities (incl rice, peanuts)              Rice   
2      115   A1500    Wheat, corn, soybeans and cash grain              Corn   
3      116   A1500    Wheat, corn, soybeans and cash grain          Soybeans   
4      119   A1500    Wheat, corn, soybeans and cash grain  Cash grains, NEC   
5      131   A1100                                  Cotton            Cotton   
6      132   A1300              Tobacco & Tobacco products           Tobacco   

  MultSIC  
0       X  
1       X  
2       X  
3       X  
4       X  
5       X  
6       X  

If this does not work, please post print(repr(open(file.txt, 'rb').read(100)). This will show us an unambiguous representation of the first 100 bytes of file.txt.

like image 62
unutbu Avatar answered Mar 16 '23 00:03

unutbu


I think you can try add sep="\t" to read_csv if data in csv are separated by Tabulator.

import pandas as pd

df = pd.read_csv('test/a.csv', sep="\t")
print df
   SICcode Catcode                               Category           SICname  \
0      111   A1500   Wheat, corn, soybeans and cash grain             Wheat   
1      112   A1600  ther commodities (incl rice, peanuts)              Rice   
2      115   A1500   Wheat, corn, soybeans and cash grain              Corn   
3      116   A1500   Wheat, corn, soybeans and cash grain          Soybeans   
4      119   A1500   Wheat, corn, soybeans and cash grain  Cash grains, NEC   
5      131   A1100                                 Cotton            Cotton   
6      132   A1300             Tobacco & Tobacco products           Tobacco   

  MultSIC  
0       X  
1       X  
2       X  
3       X  
4       X  
5       X  
6       X  
like image 30
jezrael Avatar answered Mar 15 '23 23:03

jezrael