Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas read_csv fix columns to read data with newline characters in data

Using pandas to read in large tab delimited file

df = pd.read_csv(file_path, sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, na_values='')

The problem is that there are 200 columns and the 3rd column is text with occasional newline characters. The text is not delimited with any special characters. These lines get chopped into multiple lines with data going into the wrong columns.

There are a fixed number of tabs in each line - that is all I have to go on.

like image 397
Vlad Avatar asked Aug 02 '17 06:08

Vlad


2 Answers

The idea is to use regex to find all instances of stuff separated by a given number of tabs and ending in a newline. Then take all that and create a dataframe.

import pandas as pd
import re

def wonky_parser(fn):
    txt = open(fn).read()
    #                          This is where I specified 8 tabs
    #                                        V
    preparse = re.findall('(([^\t]*\t[^\t]*){8}(\n|\Z))', txt)
    parsed = [t[0].split('\t') for t in preparse]
    return pd.DataFrame(parsed)

Pass a filename to the function and get your dataframe back.

like image 62
piRSquared Avatar answered Nov 14 '22 23:11

piRSquared


name your third column

df.columns.values[2] = "some_name"

and use converters to pass your function.

pd.read_csv("foo.csv", sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, converters={'some_name':lambda x:x.replace('/n','')})

you could use any manipulating function which works for you under lambda.

like image 43
Prakash Palnati Avatar answered Nov 14 '22 22:11

Prakash Palnati