Using pandas to read in large tab delimited file
df = pd.read_csv(file_path, sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, na_values='')
The problem is that there are 200 columns and the 3rd column is text with occasional newline characters. The text is not delimited with any special characters. These lines get chopped into multiple lines with data going into the wrong columns.
There are a fixed number of tabs in each line - that is all I have to go on.
The idea is to use regex to find all instances of stuff separated by a given number of tabs and ending in a newline. Then take all that and create a dataframe.
import pandas as pd
import re
def wonky_parser(fn):
txt = open(fn).read()
# This is where I specified 8 tabs
# V
preparse = re.findall('(([^\t]*\t[^\t]*){8}(\n|\Z))', txt)
parsed = [t[0].split('\t') for t in preparse]
return pd.DataFrame(parsed)
Pass a filename to the function and get your dataframe back.
name your third column
df.columns.values[2] = "some_name"
and use converters to pass your function.
pd.read_csv("foo.csv", sep='\t', encoding='latin 1', dtype = str, keep_default_na=False, converters={'some_name':lambda x:x.replace('/n','')})
you could use any manipulating function which works for you under lambda.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With