pandas read_csv fix columns to read data with newline characters in data

Question

Using pandas to read in large tab delimited file

df = pd.read_csv(file_path, sep='	', encoding='latin 1', dtype = str, keep_default_na=False, na_values='')

The problem is that there are 200 columns and the 3rd column is text with occasional newline characters. The text is not delimited with any special characters. These lines get chopped into multiple lines with data going into the wrong columns.

There are a fixed number of tabs in each line - that is all I have to go on.

piRSquared · Accepted Answer

The idea is to use regex to find all instances of stuff separated by a given number of tabs and ending in a newline. Then take all that and create a dataframe.

import pandas as pd
import re

def wonky_parser(fn):
    txt = open(fn).read()
    #                          This is where I specified 8 tabs
    #                                        V
    preparse = re.findall('(([^	]*	[^	]*){8}(
|\Z))', txt)
    parsed = [t[0].split('	') for t in preparse]
    return pd.DataFrame(parsed)

Pass a filename to the function and get your dataframe back.

Prakash Palnati · Answer

name your third column

df.columns.values[2] = "some_name"

and use converters to pass your function.

pd.read_csv("foo.csv", sep='	', encoding='latin 1', dtype = str, keep_default_na=False, converters={'some_name':lambda x:x.replace('/n','')})

you could use any manipulating function which works for you under lambda.

pandas read_csv fix columns to read data with newline characters in data

Tags:

python

regex

pandas

Vlad

2 Answers

piRSquared

Prakash Palnati

Recent Activity

Donate For Us

pandas read_csv fix columns to read data with newline characters in data

Tags:

python

regex

pandas

Vlad

2 Answers

piRSquared

Prakash Palnati

Related questions

Recent Activity

Donate For Us