scikit-learn linearRegression 1.1.1 普通最小二乘法_sklearn linearregression 最小二乘-程序员宅基地

技术标签：博客机器学习

普通线性回归公式：

$\hat{y}(w, x) = w_0 + w_1 x_1 + ... + w_p x_p$

在这个公式中， $w = (w_1,...,w_p)$ 为权值，有些书籍和文章也称为参数和权重，再线性回归中，通过优化算法求出最佳拟合的w和b（偏值），来进行预测

sklaern实例应用：

LinearRegression 用系数：math:w = (w_1,...,w_p) 来拟合一个线性模型, 使得数据集实际观测数据和预测数据（估计值）之间误差平方和最小，这也是最小二乘法的核心思想。数学形式可表达为:

$\underset{w}{min\,} {|| X w - y||_2}^2$ （Xw：为预测值，y为真实值）

LinearRegression 模型会调用 fit 方法来拟合X,y(X为输入，y为输出把拟).并且会合的线性模型的系数 $w$ 存储到成员变量 coef_ 中

>>> from sklearn import linear_model
>>> clf = linear_model.LinearRegression()
>>> clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
>>> clf.coef_
array([ 0.5,  0.5])

然而，对于普通最小二乘问题，其系数估计依赖模型各项相互独立。当各项是相关的，设计矩阵(Design Matrix) $x$ 的各列近似线性相关，那么，设计矩阵会趋向于奇异矩阵，这会导致最小二乘估计对于随机误差非常敏感，会产生很大的方差。这种多重共线性(multicollinearity) 的情况可能真的会出现，比如未经实验设计收集的数据.

LinearRegression具体使用实例：

本例子利用了sklearn本身自带的数据集datasets中的糖尿病患者的第一个特征，并结合label，训练和绘制出简单的二维图像，散点图，并拟合出一条直线，二维图像点到直线的距离之和最小（y轴距离label），同时还计算了方差偏差等值

../../_images/plot_ols_001.png

Script output:

Coefficients:
 [ 938.23786125]
Residual sum of squares: 2548.07
Variance score: 0.47

这里的Coefficient是系数w，通过最小二乘法拟合出来的数据

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset 加载糖尿病患者数据
diabetes = datasets.load_diabetes()


# Use only one feature 使用一个特征
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets 划分X的训练集和测试集
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets 划分目标label训练集和测试集，数量与x划分对应

diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object 创建一个线性回归模型对象
regr = linear_model.LinearRegression()

# Train the model using the training sets 讲训练集和对应训练集的label放到fit()中训练模型
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients  得出训练模型参数W的值
print('Coefficients: \n', regr.coef_)
# The mean square error 预测值与真实值之间的误差平方和 predict输出预测值

print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction 
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs  运用matplotlib绘制出图像
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
         linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

部分参数和函数使用方法汇总：

属性：
coef_ : array类型，形状为 (n_features, ) 或者 (n_targets, n_features)，这个是表示的线性回归求出来的系数。即方程里面经常见到的w。
Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.
residues_ : array, shape (n_targets,) or (1,) or empty
Sum of residuals. Squared Euclidean 2-norm for each target passed during the fit. If the linear regression problem is under-determined (the number of linearly independent rows of the training matrix is less than its number of linearly independent columns), this
is an empty array. If the target vector passed during the fit is 1-dimensional, this is a (1,) shape array.
New in version 0.18.
intercept_ : 截距，b值

方法：

__init__(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)

fit(X, y, sample_weight=None)

作用：
拟合线性模型
参数：
X : 训练集（自变量），numpy array类型，且形状为[n_samples,n_features]
y : 标签（因变量）numpy array类型，形状为 [n_samples, n_targets]
sample_weight : 每个样本的权重，形状为 [n_samples]

get_params(deep=True)

Get parameters for this estimator.
Parametersdeep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returnsparams : mapping of string to any
Parameter names mapped to their values.

predict(X)

作用：利用这个线性模型来做预测
参数：
X :预测的数据，形状为 (n_samples, n_features)
返回：
array类型,形状为 (n_samples,)

score(X, y, sample_weight=None)
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) **
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
ParametersX : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True values for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returnsscore : float
R^2 of self.predict(X) wrt. y.
29.18. sklearn.linear_model: Generalized Linear Models 1531scikit-learn user guide, Release 0.18.1
set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have
parameters of the form __ so that it’s possible to update each component
of a nested object.
Returnsself :

本文参考了 github上这位大神带来的翻译

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
         linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
         linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

本文链接：https://blog.csdn.net/u010016927/article/details/74909130

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

AVFrame&AVPacket_天天av-程序员宅基地

文章浏览阅读1.5w次。AVFrame：（ This structure describes decoded (raw) audio or video data. AVFrame must be allocated using av_frame_alloc(). Note that this only allocates the AVFrame itself, the buffers for the data mus_天天av

Java经典例题07：用100元人民币兑换10元、5元、1元的纸币_编程把100元换成1元5元10元-程序员宅基地

文章浏览阅读3.5k次，点赞2次，收藏12次。解题思路分析：1.100元兑换10元纸币，可以兑换10张，但每种纸币都要有，所以最多只能兑换9张，最少兑换1张。则初始值为1；循环条件小于10或者小于等于9。2.100元兑换5元纸币，可以兑换20，但每种纸币都要有，所以最多只能兑换19张，最少兑换1张。初始值为1；循环条件小于20或者小于等于19。3.100元兑换1元纸币，可以兑换100张，但每种纸币都要有，所以最多只能兑换99张，最少兑换1张。则初始值为1；循环条件小于100或者小于等于99。_编程把100元换成1元5元10元

猜三次年龄_找人猜三次年龄-程序员宅基地

文章浏览阅读450次。1、允许用户最多尝试三次2、每尝试三次后，如果还没猜对，就问用户是否继续玩，如果回答Y，y，就继续猜三次，以此往复，如果回答N，n，就直接退出times=0count=3while times<=3:age=int(input(‘请输入年龄：’))if age == 18:print(‘猜对了’)breakelif age > 18:print(‘猜大了’)else:print(‘猜小了’)times+=1if times3:choose = input(‘继续猜Y_找人猜三次年龄

SDOI2017 Round2 详细题解-程序员宅基地

文章浏览阅读152次。这套题实在是太神仙了。。做了我好久。。。好多题都是去搜题解才会的 TAT。剩的那道题先咕着，如果省选没有退役就来填吧。「SDOI2017」龙与地下城题意丢 $Y$ 次骰子，骰子有 $X$ 面，每一面的概率均等，取值为 $[0, X)$ ，问最后取值在 $[a, b]$ 之间的概率。一个浮点数，绝对误差不超过 $0.013579$ 为正确。数据范围每组数据有 \...

嵌入式数据库-Sqlite3-程序员宅基地

文章浏览阅读1.1k次，点赞36次，收藏25次。阅读引言：本文将会从环境sqlite3的安装、数据库的基础知识、sqlite3命令、以及sqlite的sql语句最后还有一个完整的代码实例，相信仔细学习完这篇内容之后大家一定能有所收获。

C++ Builder编写WinForm从Web服务器下载文件-程序员宅基地

文章浏览阅读51次。UnicodeString templateSavePath = ChangeFileExt(ExtractFilePath(Application->ExeName),"tmp.doc");IdAntiFreeze1->OnlyWhenIdle = false;//设置使程序有反应.TMemoryStream *templateStream ;templateStre..._c++webserver下载文件