scikit-learn linearRegression 1.1.1 普通最小二乘法_sklearn linearregression 最小二乘-程序员宅基地

技术标签: 博客  机器学习  

普通线性回归公式:

\hat{y}(w, x) = w_0 + w_1 x_1 + ... + w_p x_p

在这个公式中,w = (w_1,...,w_p)为权值,有些书籍和文章也称为参数和权重,再线性回归中,通过优化算法求出最佳拟合的w和b(偏值),来进行预测



sklaern实例应用:

LinearRegression 用系数 :math:w = (w_1,...,w_p) 来拟合一个线性模型, 使得数据集实际观测数据和预测数据(估计值)之间误差平方和最小,这也是最小二乘法的核心思想。数学形式可表达为:

\underset{w}{min\,} {|| X w - y||_2}^2   (Xw:为预测值,y为真实值)

../_images/plot_ols_0011.png

LinearRegression 模型会调用 fit 方法来拟合X,y(X为输入,y为输出把拟).并且会合的线性模型的系数 w 存储到成员变量 coef_ 中

>>> from sklearn import linear_model
>>> clf = linear_model.LinearRegression()
>>> clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
>>> clf.coef_
array([ 0.5,  0.5])
然而,对于普通最小二乘问题,其系数估计依赖模型各项相互独立。当各项是相关的,设计矩阵(Design Matrix)  x  的各列近似线性相关, 那么,设计矩阵会趋向于奇异矩阵,这会导致最小二乘估计对于随机误差非常敏感,会产生很大的方差。这种  多重共线性(multicollinearity)  的情况可能真的会出现,比如未经实验设计收集的数据.


LinearRegression具体使用实例:

本例子利用了sklearn本身自带的数据集datasets中的糖尿病患者的第一个特征,并结合label,训练和绘制出简单的二维图像,散点图,并拟合出一条直线,二维图像点到直线的距离之和最小(y轴距离label),同时还计算了方差  偏差等值

../../_images/plot_ols_001.png

Script output:

Coefficients:
 [ 938.23786125]
Residual sum of squares: 2548.07
Variance score: 0.47
这里的Coefficient是系数w,通过最小二乘法拟合出来的数据


import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset 加载糖尿病患者数据
diabetes = datasets.load_diabetes()


# Use only one feature 使用一个特征
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets 划分X的训练集和测试集
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets 划分目标label训练集和测试集,数量与x划分对应
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object 创建一个线性回归模型对象
regr = linear_model.LinearRegression()

# Train the model using the training sets 讲训练集和对应训练集的label放到fit()中训练模型
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients  得出训练模型参数W的值
print('Coefficients: \n', regr.coef_)
# The mean square error 预测值与真实值之间的误差平方和 predict输出预测值
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction 
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs  运用matplotlib绘制出图像
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
         linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

部分参数和函数使用方法汇总:

属性: 
coef_ : array类型,形状为 (n_features, ) 或者 (n_targets, n_features),这个是表示的线性回归求出来的系数。即方程里面经常见到的w。 
Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features. 
residues_ : array, shape (n_targets,) or (1,) or empty 
Sum of residuals. Squared Euclidean 2-norm for each target passed during the fit. If the linear regression problem is under-determined (the number of linearly independent rows of the training matrix is less than its number of linearly independent columns), this 
is an empty array. If the target vector passed during the fit is 1-dimensional, this is a (1,) shape array. 
New in version 0.18. 
intercept_ : 截距,b值

方法:

__init__(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)

fit(X, y, sample_weight=None)

作用: 
拟合线性模型 
参数: 
X : 训练集(自变量),numpy array类型,且形状为[n_samples,n_features] 
y : 标签(因变量)numpy array类型,形状为 [n_samples, n_targets] 
sample_weight : 每个样本的权重,形状为 [n_samples]

get_params(deep=True)

Get parameters for this estimator. 
Parametersdeep : boolean, optional 
If True, will return the parameters for this estimator and contained subobjects that are estimators. 
Returnsparams : mapping of string to any 
Parameter names mapped to their values.

predict(X)

作用:利用这个线性模型来做预测 
参数: 
X :预测的数据,形状为 (n_samples, n_features) 
返回: 
array类型,形状为 (n_samples,)

score(X, y, sample_weight=None) 
Returns the coefficient of determination R^2 of the prediction. 
The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 
2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score 
is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always 
predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0. 
ParametersX : array-like, shape = (n_samples, n_features) 
Test samples. 
y : array-like, shape = (n_samples) or (n_samples, n_outputs) 
True values for X. 
sample_weight : array-like, shape = [n_samples], optional 
Sample weights. 
Returnsscore : float 
R^2 of self.predict(X) wrt. y. 
29.18. sklearn.linear_model: Generalized Linear Models 1531scikit-learn user guide, Release 0.18.1 
set_params(**params) 
Set the parameters of this estimator. 
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have 
parameters of the form __ so that it’s possible to update each component 
of a nested object. 
Returnsself :




本文参考了 github上这位大神带来的翻译
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
         linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
         linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/u010016927/article/details/74909130

智能推荐

AVFrame&AVPacket_天天av-程序员宅基地

文章浏览阅读1.5w次。AVFrame:( This structure describes decoded (raw) audio or video data. AVFrame must be allocated using av_frame_alloc(). Note that this only allocates the AVFrame itself, the buffers for the data mus_天天av

Java经典例题07:用100元人民币兑换10元、5元、1元的纸币_编程把100元换成1元5元10元-程序员宅基地

文章浏览阅读3.5k次,点赞2次,收藏12次。解题思路分析:1.100元兑换10元纸币,可以兑换10张,但每种纸币都要有,所以最多只能兑换9张,最少兑换1张。则初始值为1;循环条件小于10或者小于等于9。2.100元兑换5元纸币,可以兑换20,但每种纸币都要有,所以最多只能兑换19张,最少兑换1张。初始值为1;循环条件小于20或者小于等于19。3.100元兑换1元纸币,可以兑换100张,但每种纸币都要有,所以最多只能兑换99张,最少兑换1张。则初始值为1;循环条件小于100或者小于等于99。_编程把100元换成1元5元10元

猜三次年龄_找人猜三次年龄-程序员宅基地

文章浏览阅读450次。1、允许用户最多尝试三次2、每尝试三次后,如果还没猜对,就问用户是否继续玩,如果回答Y,y,就继续猜三次,以此往复,如果回答N,n,就直接退出times=0count=3while times<=3:age=int(input(‘请输入年龄:’))if age == 18:print(‘猜对了’)breakelif age > 18:print(‘猜大了’)else:print(‘猜小了’)times+=1if times3:choose = input(‘继续猜Y_找人猜三次年龄

SDOI2017 Round2 详细题解-程序员宅基地

文章浏览阅读152次。这套题实在是太神仙了。。做了我好久。。。好多题都是去搜题解才会的 TAT。剩的那道题先咕着,如果省选没有退役就来填吧。「SDOI2017」龙与地下城题意丢 \(Y\) 次骰子,骰子有 \(X\) 面,每一面的概率均等,取值为 \([0, X)\) ,问最后取值在 \([a, b]\) 之间的概率。一个浮点数,绝对误差不超过 \(0.013579\) 为正确。数据范围每组数据有 \...

嵌入式数据库-Sqlite3-程序员宅基地

文章浏览阅读1.1k次,点赞36次,收藏25次。阅读引言: 本文将会从环境sqlite3的安装、数据库的基础知识、sqlite3命令、以及sqlite的sql语句最后还有一个完整的代码实例, 相信仔细学习完这篇内容之后大家一定能有所收获。

C++ Builder编写WinForm从Web服务器下载文件-程序员宅基地

文章浏览阅读51次。UnicodeString templateSavePath = ChangeFileExt(ExtractFilePath(Application->ExeName),"tmp.doc");IdAntiFreeze1->OnlyWhenIdle = false;//设置使程序有反应.TMemoryStream *templateStream ;templateStre..._c++webserver下载文件

随便推点

JAVA小项目潜艇大战_java潜艇大战-程序员宅基地

文章浏览阅读8.3k次,点赞10次,收藏41次。一、第一天1、创建战舰、侦察潜艇、鱼雷潜艇、水雷潜艇、水雷、深水炸弹类完整代码:package day01;//战舰public class Battleship { int width; int height; int x; int y; int speed; int life; void move(){ System.out.println("战舰移动"); }}package day01;//侦察潜艇_java潜艇大战

02表单校验的基本步骤-程序员宅基地

文章浏览阅读940次。表单校验的基本步骤_表单校验

libOpenBlas.dll缺失依赖解决办法-程序员宅基地

文章浏览阅读4.5k次。libOpenBlas.dll缺失依赖解决办法 intellij idea 1.dll文件缺失依赖,报错:“找不到指定模块”2.下载depends查看dll缺失文件3.下载缺失依赖libopenblas.dll出错起因由于java web项目需要调用openBlas库来进行运算,就下载了预编译的libopenblas文件进行调用,首先遇到路径出错问题、之后又是dll文件缺失依赖问题,以下是解决..._libopenblas.dll

Swoole 实践篇之结合 WebSocket 实现心跳检测机制-程序员宅基地

文章浏览阅读251次,点赞3次,收藏10次。这里实现的心跳检测机制是一个基础版的,心跳包的主要作用是用于检测用户端是否存活,有助于我们及时判断用户端是否存在断线的问题。在我之前开发过的项目中,有一个基于物联网在线直播抓娃娃的项目,其中就有需要实时监控设备在线状态的需求,该需求就是使用心跳包来实现的。实际上心跳检测技术,应用更广泛的是实时通信、或设备管理的场景偏多。

Maven dependency scope_maven dependent scope-程序员宅基地

文章浏览阅读714次。Dependency scope is used to limit the transitivity of a dependency, and also to affect the classpath used for various build tasks.There are 6 scopes available:compileThis is the default scop_maven dependent scope

TCP头部结构信息_tcp头部包含哪些信息-程序员宅基地

文章浏览阅读3.6k次。TCP 头部结构信息_tcp头部包含哪些信息