Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

import csv with different number of columns per row using Pandas

Tags:

What is the best approach for importing a CSV that has a different number of columns for each row using Pandas or the CSV module into a Pandas DataFrame.

"H","BBB","D","Ajxxx Dxxxs" "R","1","QH","DTR"," "," ","spxxt rixxls, raxxxd","1" 

Using this code:

import pandas as pd data = pd.read_csv("smallsample.txt",header = None) 

the following error is generated

Error tokenizing data. C error: Expected 4 fields in line 2, saw 8 
like image 931
Erich Avatar asked Nov 19 '14 15:11

Erich


People also ask

Can CSV hold a variety of data types?

Unlike other spreadsheet files, CSVs only carry a single sheet, with data fields most often separated by commas. They can store strings of numbers and words but not formulas and formatting styles.


2 Answers

Supplying a list of columns names in the read_csv() should do the trick.

ex: names=['a', 'b', 'c', 'd', 'e']

https://github.com/pydata/pandas/issues/2981

Edit: if you don't want to supply column names then do what Nicholas suggested

like image 92
Bob Haffner Avatar answered Oct 25 '22 07:10

Bob Haffner


You can dynamically generate column names as simple counters (0, 1, 2, etc).

Dynamically generate column names

# Input data_file = "smallsample.txt"  # Delimiter data_file_delimiter = ','  # The max column count a line in the file could have largest_column_count = 0  # Loop the data lines with open(data_file, 'r') as temp_f:     # Read the lines     lines = temp_f.readlines()      for l in lines:         # Count the column count for the current line         column_count = len(l.split(data_file_delimiter)) + 1                  # Set the new most column count         largest_column_count = column_count if largest_column_count < column_count else largest_column_count  # Generate column names (will be 0, 1, 2, ..., largest_column_count - 1) column_names = [i for i in range(0, largest_column_count)]  # Read csv df = pandas.read_csv(data_file, header=None, delimiter=data_file_delimiter, names=column_names) # print(df) 

Missing values will be assigned to the columns which your CSV lines don't have a value for.

like image 31
P-S Avatar answered Oct 25 '22 08:10

P-S