Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a csv file into pandas dataframe with quotation in some entries

Tags:

python

pandas

csv

I have a csv file like

id,body,category,subcategory,number,smstype,smsflag

50043,"Dear Customer,Thank you for registering",,,DM-YEBA,inbox,0

30082,Congrats! Your account has been activated.,,,DM-SBAW,inbox,0

when i'm using pd.read_csv() then the whole first observation is included in the id column and is not separated among other columns due to the double quotes used for the message body, while in second observation the line is properly separated among the columns.

What should I do such that the first observation is seperated among all columns like in this image enter image description here

see what actually pd.read_csv is doing. It's taking the whole observation in id column vhjn

when i am opening the csv file in notepad it's adding extra quotation marks to the whole row which is eventually causing the fiasco and the quotation mark already in the file are escaped with another ' " ' as shown below.

id,body,category,subcategory,number,smstype,smsflag

"50043,""Dear Customer,Thank you for registering"",,,DM-YEBA,inbox,0"

30082,Congrats! Your account has been activated.,,,DM-SBAW,inbox,0

like image 998
Enji Avatar asked Sep 26 '18 12:09

Enji


1 Answers

The main problem lies in the way csv file of microsoft excel is actually saved. When the same csv file is opened in notepad it adds extra quotation marks in the lines which have quotes.

1) It adds quote at the starting and ending of the line.

2) It escapes the existing quotes with one more quote. Hence, when we import our csv file in pandas it takes the whole line as one string and thus it ends up all in the first column.

To tackle this-

I imported the csv file and corrected the csv by applying regex substitution and saved it as text file. Then i imported this text file as pandas dataframe. Problem Solved.

with open('csvdata.csv','r+') as csv_file:
for line in csv_file:
    # removing starting and ending quotes of a line
    pattern1 = re.compile(r'^"|"$',re.MULTILINE)
    line = re.sub(r'^"|"$',"",line)
    # substituting escaped quote with a single quote
    pattern2 = re.compile(r'""')
    line = re.sub(r'""','"',line)

    corrected_csv = open("new_csv.txt",'a')
    corrected_csv.write(line)
    corrected_csv.close()
like image 56
Enji Avatar answered Sep 30 '22 15:09

Enji