What is the best approach for importing a CSV that has a different number of columns for each row using Pandas or the CSV module into a Pandas DataFrame.
"H","BBB","D","Ajxxx Dxxxs" "R","1","QH","DTR"," "," ","spxxt rixxls, raxxxd","1"
Using this code:
import pandas as pd data = pd.read_csv("smallsample.txt",header = None)
the following error is generated
Error tokenizing data. C error: Expected 4 fields in line 2, saw 8
Unlike other spreadsheet files, CSVs only carry a single sheet, with data fields most often separated by commas. They can store strings of numbers and words but not formulas and formatting styles.
Supplying a list of columns names in the read_csv() should do the trick.
ex: names=['a', 'b', 'c', 'd', 'e']
https://github.com/pydata/pandas/issues/2981
Edit: if you don't want to supply column names then do what Nicholas suggested
You can dynamically generate column names as simple counters (0, 1, 2, etc).
Dynamically generate column names
# Input data_file = "smallsample.txt" # Delimiter data_file_delimiter = ',' # The max column count a line in the file could have largest_column_count = 0 # Loop the data lines with open(data_file, 'r') as temp_f: # Read the lines lines = temp_f.readlines() for l in lines: # Count the column count for the current line column_count = len(l.split(data_file_delimiter)) + 1 # Set the new most column count largest_column_count = column_count if largest_column_count < column_count else largest_column_count # Generate column names (will be 0, 1, 2, ..., largest_column_count - 1) column_names = [i for i in range(0, largest_column_count)] # Read csv df = pandas.read_csv(data_file, header=None, delimiter=data_file_delimiter, names=column_names) # print(df)
Missing values will be assigned to the columns which your CSV lines don't have a value for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With