Linear regression with pandas dataframe

Tags:

I have a dataframe in pandas that I'm using to produce a scatterplot, and want to include a regression line for the plot. Right now I'm trying to do this with polyfit.

Here's my code:

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from numpy import *

table1 = pd.DataFrame.from_csv('upregulated_genes.txt', sep='\t', header=0, index_col=0)
table2 = pd.DataFrame.from_csv('misson_genes.txt', sep='\t', header=0, index_col=0)
table1 = table1.join(table2, how='outer')

table1 = table1.dropna(how='any')
table1 = table1.replace('#DIV/0!', 0)

# scatterplot
plt.scatter(table1['log2 fold change misson'], table1['log2 fold change'])
plt.ylabel('log2 expression fold change')
plt.xlabel('log2 expression fold change Misson et al. 2005')
plt.title('Root Early Upregulated Genes')
plt.axis([0,12,-5,12])

# this is the part I'm unsure about
regres = polyfit(table1['log2 fold change misson'], table1['log2 fold change'], 1)

plt.show()

But I get the following error:

TypeError: cannot concatenate 'str' and 'float' objects

Does anyone know where I'm going wrong here? I'm also unsure how to add the regression line to my plot. Any other general comments on my code would also be hugely appreciated, I'm still a beginner.

512

asked Oct 15 '13 10:10

TimStuart

1 Answers

Instead of replacing '#DIV/0!' by hand, force the data to be numeric. This does two things at once: it ensures that the result is numeric type (not str), and it substitutes NaN for any entries that cannot be parsed as a number. Example:

In [5]: Series([1, 2, 'blah', '#DIV/0!']).convert_objects(convert_numeric=True)
Out[5]: 
0     1
1     2
2   NaN
3   NaN
dtype: float64

This should fix your error. But, on the general subject of fitting a line to data, I keep handy two ways of doing this that I like better than polyfit. The second of the two is more robust (and can potentially return much more detailed information about the statistics) but it requires statsmodels.

from scipy.stats import linregress
def fit_line1(x, y):
    """Return slope, intercept of best fit line."""
    # Remove entries where either x or y is NaN.
    clean_data = pd.concat([x, y], 1).dropna(0) # row-wise
    (_, x), (_, y) = clean_data.iteritems()
    slope, intercept, r, p, stderr = linregress(x, y)
    return slope, intercept # could also return stderr

import statsmodels.api as sm
def fit_line2(x, y):
    """Return slope, intercept of best fit line."""
    X = sm.add_constant(x)
    model = sm.OLS(y, X, missing='drop') # ignores entires where x or y is NaN
    fit = model.fit()
    return fit.params[1], fit.params[0] # could also return stderr in each via fit.bse

To plot it, do something like

m, b = fit_line2(x, y)
N = 100 # could be just 2 if you are only drawing a straight line...
points = np.linspace(x.min(), x.max(), N)
plt.plot(points, m*points + b)

answered Sep 29 '22 22:09

Dan Allan

Related questions
                            
                                Python "in" does not check for type?
                            
                                Class wrapper around file -- proper way to close file handle when no longer referenced
                            
                                Can't import modules that are there
                            
                                Python connected components
                            
                                Encrypting and Decrypting with python and nodejs
                            
                                Why is the purpose of the "else" clause following a "for" or "while" loop? [duplicate]
                            
                                Is there a way to efficiently invert an array of matrices with numpy?
                            
                                Iterate over a dict or list in Python
                            
                                How can I debug POST requests with python's BaseHTTPServer / SimpleHTTPServer?
                            
                                Efficient scheduling of university courses
                            
                                Bottle.py HTTP Auth?
                            
                                Why is recursion in python so slow?
                            
                                How can I use an app-factory in Flask / WSGI servers and why might it be unsafe?
                            
                                API in Flask--returns JSON but HTML exceptions break my JSON client
                            
                                Django: How can I check the last activity time of user if user didn't log out?
                            
                                AttributeError: 'module' object has no attribute
                            
                                Assigning column names from a list to a table
                            
                                convert a 2D numpy array to a 2D numpy matrix
                            
                                PyImport_Import fails (returns NULL)
                            
                                How can we get tweets from specific country

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Linear regression with pandas dataframe

Tags:

python

pandas

matplotlib

numpy

regression

TimStuart

People also ask

1 Answers

Dan Allan

Recent Activity

Donate For Us