Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas pd.read_csv does not work for simple sep=','

Tags:

python

pandas

csv

Good afternoon, everybody.

I know that it is quite an easy question, although, I simply do not understand why it does not work the way I expected.

The task is as following:

I have a file data.csv presented in this format:

id,"feature_1","feature_2","feature_3"
00100429,"PROTO","Proprietary","Phone"
00100429,"PROTO","Proprietary","Phone"

The thing is to import this data using pandas. I know that by default pandas read_csv uses comma separator, so I just imported it as following:

data = pd.read_csv('data.csv')

And the result I got is the one I presented at the beginning with no change at all. I mean one column which contains everything.

I tried many other separators using regex, and the only one that made some sort of improvement was:

data = pd.read_csv('data.csv',sep="\,",engine='python')

On the one hand it finally separated all columns, on the other hand the way data is presented is not that convenient to use. In particular:

"id         ""feature_1""   ""feature_2""   ""feature_3"""
"00100429   ""PROTO""       ""Proprietary"" ""Phone"""

Therefore, I think that somewhere must be a mistake, because the data seems to be fine.

So the question is - how to import csv file with separated columns and no triple quote symbols?

Thank you.

like image 314
Kakalukia Avatar asked Sep 12 '25 16:09

Kakalukia


1 Answers

Here's my quick solution for your problem -

import numpy as np
import pandas as pd

### Reading the file, treating header as first row and later removing all the double apostrophe 
df = pd.read_csv('file.csv', sep='\,', header=None).apply(lambda x: x.str.replace(r"\"",""))
df

    0              1           2       3
0   id      feature_1   feature_2   feature_3
1   00100429    PROTO   Proprietary Phone
2   00100429    PROTO   Proprietary Phone

### Putting column names back and dropping the first row.
df.columns = df.iloc[0]
df.drop(index=0, inplace=True)
df

## You can reset the index 
        id  feature_1   feature_2   feature_3
1   00100429    PROTO   Proprietary Phone
2   00100429    PROTO   Proprietary Phone

### Converting `id` column datatype back to `int` (change according to your needs)

df.id = df.id.astype(np.int)
np.result_type(df.id)

dtype('int64')
like image 94
sync11 Avatar answered Sep 15 '25 21:09

sync11



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!