如何使用Python和Numpy来计算r-squared？

我使用Python和Numpy来计算任意度的最佳拟合多项式。我传递一个x值，y值的列表，以及我想要拟合的多项式的次数（线性，二次等）。

这很有用，但我也想计算r（相关系数）和r-squared（决定系数）。我将我的结果与Excel的最佳拟合趋势线能力及其计算的r平方值进行比较。使用这个，我知道我正在为线性最佳拟合（度数等于1）正确计算r平方。但是，我的函数不适用于度数大于1的多项式。

Excel能够做到这一点。如何使用Numpy来计算高阶多项式的r-squared？

这是我的function：

import numpy # Polynomial Regression def polyfit(x, y, degree): results = {} coeffs = numpy.polyfit(x, y, degree) # Polynomial Coefficients results['polynomial'] = coeffs.tolist() correlation = numpy.corrcoef(x, y)[0,1] # r results['correlation'] = correlation # r-squared results['determination'] = correlation**2 return results

从numpy.polyfit文档，它是拟合线性回归。具体而言，具有度“d”的numpy.polyfit适合具有平均函数的线性回归

E（y | x）= p_d * x ** d + p_ {d-1} * x **（d-1）+ … + p_1 * x + p_0

所以你只需要计算出适合的R平方。线性回归的维基百科页面提供了完整的细节。你对R ^ 2感兴趣，你可以用几种方法来计算，最简单的可能就是

 SST = Sum(i=1..n) (y_i - y_bar)^2 SSReg = Sum(i=1..n) (y_ihat - y_bar)^2 Rsquared = SSReg/SST

我在哪里使用“y_bar”作为y的平均值，而“y_ihat”是每个点的合适值。

我不太熟悉numpy（我通常在R中工作），所以可能有一个更加整洁的方法来计算你的R-squared，但是下面的内容应该是正确的

 import numpy # Polynomial Regression def polyfit(x, y, degree): results = {} coeffs = numpy.polyfit(x, y, degree) # Polynomial Coefficients results['polynomial'] = coeffs.tolist() # r-squared p = numpy.poly1d(coeffs) # fit values, and mean yhat = p(x) # or [p(z) for z in x] ybar = numpy.sum(y)/len(y) # or sum(y)/len(y) ssreg = numpy.sum((yhat-ybar)**2) # or sum([ (yihat - ybar)**2 for yihat in yhat]) sstot = numpy.sum((y - ybar)**2) # or sum([ (yi - ybar)**2 for yi in y]) results['determination'] = ssreg / sstot return results

一个非常晚的答复，但以防万一有人需要这个准备好的function：

scipy.stats.stats.linregress

即

 slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)

就像@Adam Marples的回答一样。

从yanl（又一个库） sklearn.metrics有一个r2_square函数;

 from sklearn.metrics import r2_score coefficient_of_dermination = r2_score(y, p(x))

我一直在成功地使用它，其中x和y是类似数组的。

 def rsquared(x, y): """ Return R^2 where x and y are array-like.""" slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y) return r_value**2

我原本以推荐numpy.corrcoef为基准发布了基准，愚蠢地没有意识到原来的问题已经使用了corrcoef ，实际上是在询问更高阶的多项式拟合。我已经使用statsmodels添加了多项式r-squared问题的实际解决scheme，而且我已经保留了原来的基准testing，而这些基准testing对于某个人来说可能是有用的。

statsmodels有能力直接计算多项式拟合的r^2 ，这里有2种方法…

 import statsmodels.api as sm import stasmodels.formula.api as smf # Construct the columns for the different powers of x def get_r2_statsmodels(x, y, k=1): xpoly = np.column_stack([x**i for i in range(k+1)]) return sm.OLS(y, xpoly).fit().rsquared # Use the formula API and construct a formula describing the polynomial def get_r2_statsmodels_formula(x, y, k=1): formula = 'y ~ 1 + ' + ' + '.join('I(x**{})'.format(i) for i in range(1, k+1)) data = {'x': x, 'y': y} return smf.ols(formula, data).fit().rsquared

为了进一步利用statsmodels ，还应该查看拟合的模型摘要，可以在Jupyter / IPython笔记本中打印或显示为丰富的HTML表格。除了rsquared ，结果对象还提供了许多有用的统计指标。

 model = sm.OLS(y, xpoly) results = model.fit() results.summary()

下面是我原来的答案，我基准各种线性回归r ^ 2方法…

问题中使用的corrcoef函数只计算单个线性回归的相关系数r ，因此它不能解决高阶多项式拟合的r^2问题。然而，对于它的价值，我发现对于线性回归来说，它确实是最快和最直接的计算r方法。

 def get_r2_numpy_corrcoef(x, y): return np.corrcoef(x, y)[0, 1]**2

这些是我通过比较1000个随机（x，y）点的一堆方法的时间结果：

纯Python（直接r计算）
- 1000循环，最好是3：每循环1.59毫秒
Numpy polyfit（适用于n阶多项式拟合）
- 1000个循环，最好是3：每个循环326μs
Numpy手册（直接计算）
- 10000个循环，最好是3：每循环62.1μs
numpy corrcoef（直接r计算）
- 10000个循环，最好是3：每循环56.6μs
Scipy（以r为输出的线性回归）
- 1000个循环，每个循环最好3：676μs
Statsmodels（可以做n阶多项式和许多其他适合）
- 1000个循环，最好是3：每个循环422μs

corrcoef方法使用numpy方法窄手工计算r ^ 2“手动”。它比polyfit方法快5倍，比scipy.linregress快12倍。只是为了加强numpy为你做的事情，它比纯Python的速度快了28倍。我不熟悉像numba和pypy这样的东西，所以其他人将不得不填补这些空白，但我认为这对我来说很有说服力，因为corrcoef是计算简单线性回归的最佳工具。

这是我的基准代码。我从一个Jupyter笔记本复制粘贴（很难不称为IPython笔记本…），所以我很抱歉，如果有什么东西坏了。％timeit magic命令需要IPython。

 import numpy as np from scipy import stats import statsmodels.api as sm import math n=1000 x = np.random.rand(1000)*10 x.sort() y = 10 * x + (5+np.random.randn(1000)*10-5) x_list = list(x) y_list = list(y) def get_r2_numpy(x, y): slope, intercept = np.polyfit(x, y, 1) r_squared = 1 - (sum((y - (slope * x + intercept))**2) / ((len(y) - 1) * np.var(y, ddof=1))) return r_squared def get_r2_scipy(x, y): _, _, r_value, _, _ = stats.linregress(x, y) return r_value**2 def get_r2_statsmodels(x, y): return sm.OLS(y, sm.add_constant(x)).fit().rsquared def get_r2_python(x_list, y_list): n = len(x) x_bar = sum(x_list)/n y_bar = sum(y_list)/n x_std = math.sqrt(sum([(xi-x_bar)**2 for xi in x_list])/(n-1)) y_std = math.sqrt(sum([(yi-y_bar)**2 for yi in y_list])/(n-1)) zx = [(xi-x_bar)/x_std for xi in x_list] zy = [(yi-y_bar)/y_std for yi in y_list] r = sum(zxi*zyi for zxi, zyi in zip(zx, zy))/(n-1) return r**2 def get_r2_numpy_manual(x, y): zx = (x-np.mean(x))/np.std(x, ddof=1) zy = (y-np.mean(y))/np.std(y, ddof=1) r = np.sum(zx*zy)/(len(x)-1) return r**2 def get_r2_numpy_corrcoef(x, y): return np.corrcoef(x, y)[0, 1]**2 print('Python') %timeit get_r2_python(x_list, y_list) print('Numpy polyfit') %timeit get_r2_numpy(x, y) print('Numpy Manual') %timeit get_r2_numpy_manual(x, y) print('Numpy corrcoef') %timeit get_r2_numpy_corrcoef(x, y) print('Scipy') %timeit get_r2_scipy(x, y) print('Statsmodels') %timeit get_r2_statsmodels(x, y)

关于r-squareds的维基百科文章表明，它可以用于一般的模型拟合，而不仅仅是线性回归。

R平方是一个仅适用于线性回归的统计量。

从本质上讲，它可以用线性回归来衡量数据的多less变化。

所以，你计算“总平方”，这是你的每个结果variables与他们的平均值的总平方偏差。。。

\ sum_ {i}（y_ {i} – y_bar）^ 2

y_bar是y的平均值。

然后，计算“回归和平方”，即FITTED值与平均值的差异

\ sum_ {i}（yHat_ {i} – y_bar）^ 2

并找出这两者的比例。

现在，你所需要做的多项式拟合就是插入来自该模型的y_hat，但是称之为r-squared并不准确。

这里有一个链接，我发现它有点说话。

下面是一个用Python和Numpy计算加权的 r-squared的函数（大部分代码来自sklearn）：

 from __future__ import division import numpy as np def compute_r2_weighted(y_true, y_pred, weight): sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64) tse = (weight * (y_true - np.average( y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64) r2_score = 1 - (sse / tse) return r2_score, sse, tse

例：

 from __future__ import print_function, division import sklearn.metrics def compute_r2_weighted(y_true, y_pred, weight): sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64) tse = (weight * (y_true - np.average( y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64) r2_score = 1 - (sse / tse) return r2_score, sse, tse def compute_r2(y_true, y_predicted): sse = sum((y_true - y_predicted)**2) tse = (len(y_true) - 1) * np.var(y_true, ddof=1) r2_score = 1 - (sse / tse) return r2_score, sse, tse def main(): ''' Demonstrate the use of compute_r2_weighted() and checks the results against sklearn ''' y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8] weight = [1, 5, 1, 2] r2_score = sklearn.metrics.r2_score(y_true, y_pred) print('r2_score: {0}'.format(r2_score)) r2_score,_,_ = compute_r2(np.array(y_true), np.array(y_pred)) print('r2_score: {0}'.format(r2_score)) r2_score = sklearn.metrics.r2_score(y_true, y_pred,weight) print('r2_score weighted: {0}'.format(r2_score)) r2_score,_,_ = compute_r2_weighted(np.array(y_true), np.array(y_pred), np.array(weight)) print('r2_score weighted: {0}'.format(r2_score)) if __name__ == "__main__": main() #cProfile.run('main()') # if you want to do some profiling

输出：

 r2_score: 0.9486081370449679 r2_score: 0.9486081370449679 r2_score weighted: 0.9573170731707317 r2_score weighted: 0.9573170731707317

这对应于公式（镜像）：

在这里输入图像说明

其中f_i是拟合的预测值，y_ {av}是观测数据的均值，y_i是观测数据值。 w_i是应用于每个数据点的权重，通常w_i = 1。 SSE是由于误差引起的平方和，SST是总平方和。

如果有兴趣，在R代码： https ： //gist.github.com/dhimmel/588d64a73fa4fef02c8f （镜像）

如何使用Python和Numpy来计算r-squared？

用matplotlib / numpy进行线性回归

如何在Python中做指数和对数曲线拟合？我发现只有多项式拟合

将密度曲线拟合到R中的直方图

将多项式模型拟合到R中的数据

Android如何用手指画一条平滑线

如何在R中对我的数据进行平滑曲线拟合？

如何使用Python和Numpy来计算r-squared？

用matplotlib / numpy进行线性回归

如何在Python中做指数和对数曲线拟合？ 我发现只有多项式拟合

将密度曲线拟合到R中的直方图

将多项式模型拟合到R中的数据

Android如何用手指画一条平滑线

如何在R中对我的数据进行平滑曲线拟合？

如何在Python中做指数和对数曲线拟合？我发现只有多项式拟合