Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Length mismatch error when assigning new column labels in pandas dataframe

Tags:

python

pandas

The tab file I'm working with is missing the final column name. When I attempt to repair the header by appending the missing value, I get a mismatch error. Here's an example to illustrate the problem:

toy example

There should be a '' as the last element of the first list:

missingcol = [[gene, cell_1, '', cell_2]
               [MYC, 5.0, P, 4.0, A]
               [AKT, 3.0, A, 1.0, P]]

To fix this, I read the first line, appended a '', loaded missingcol into a pandas dataframe with header=None and skipping the first row, and redefined the column names with the modified header, like so:

fullheader = missingcol[0].append('')
fullheader = missingcol[0]

missingcol_dropheader = missingcol[1:]

df = pd.DataFrame(missingcol_dropheader, columns=fullheader)
df

Which gives me the error:

AssertionError: 4 columns passed, passed data had 5 columns

Last I checked, the new fullheader does, in fact, have 5 elements to match the five elements in the data frame. What is causing this continued mismatch and how do I fix it?

real example

I get a similar error when I repeat these same steps, but when using read_csv method with my actual test case. I ignore the header at line 0, and the three blank lines from lines 1-3, and drop an unwanted first column, but otherwise it's similar:

with open('CCLE_Expression_Entrez_2012-10-18.res', 'r') as f:
    header = f.readline().strip().split('\t')
header.append('') # missing empty colname over last A/P col

rnadf = pd.read_csv('CCLE_Expression_Entrez_2012-10-18.res', delimiter='\t', index_col=0, header=None, skiprows=[0,1,2,3])  
rnadf.columns = header
rnadf.drop([], axis=1, inplace=True)
rnadf.columns = header

ValueError: Length mismatch: Expected axis has 2073 elements, new values have 2074 elements

Very similar error to test case. What makes this error different to the test case and how do I fix it?

like image 876
Thomas Matthew Avatar asked Apr 13 '16 19:04

Thomas Matthew


1 Answers

The problem was the argument index_col=0 was beginning column indexing at the gene names:

enter image description here

The above dataframe ended at 2073, which with 1-based indexing with the above argument, was 2073 elements: one element fewer than my repaired header. This generated the following error:

ValueError: Length mismatch: Expected axis has 2073 elements, new values have 2074 elements

While the same read_csv command with index_col=None assigned a separate numerical index, putting the (in this case gene names) back into the dataframe from being just labels:

enter image description here

The above dataframe ended at the column number 2073, which is 2074 elements with zero-based indexing: the same length as my repaired header! Problem solved:

enter image description here

like image 148
Thomas Matthew Avatar answered Nov 07 '22 15:11

Thomas Matthew