NLP笔记(3) — 基于机器学习的模型,简单的梯度下降实践 – Python量化投资

NLP笔记(3) — 基于机器学习的模型,简单的梯度下降实践

写在前面

Machine Learning–Gradient Descent(机器学习–梯度下降)

  机器学习是什么,不同的人可能给出不同的定义。我的理解是,使用算法让机器从数据中学习,进而得到比人为设计更好的模型,去做某些诸如分类、预测的事情。
  这里,我们研究波士顿房价预测这一问题,来对机器学习做一个简单的实践。

from sklearn.datasets import load_boston
data = load_boston()
X, y = data['data'], data['target']
X[1]
array([2.7310e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,
       6.4210e+00, 7.8900e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,
       1.7800e+01, 3.9690e+02, 9.1400e+00])
len(y)
506
len(X[:, 0])
506
X_rm = X[:, 5]

上段代码中需要注意的地方有:

  1. y代表着不同房子的房价,X代表着房子的各种变量,如大小,犯罪率等。可以看到,我们一共使用了506栋房子的数据。
  2. 为了简单起见,我们仅仅研究X的第6个参数与房价的关系,所以需要把第六个变量在各个房子上的取值单独拿出来为X_rm

  我们假设自变量与因变量之间是线性关系,即y = kx+bk,b为未知参数,定义price()函数,来计算给定自变量与参数值后的y值。我们的任务就是,找到一个合适的k,b参数值,使得当我们给定一个x,使用上式得到的预测值与真实值之间的差距尽可能的小。 如果我们能够找到比较合适的k,b参数值,那么就有可能得到准确率比较高的预测结果。
  那么我们如何定义我们得到的预测值与真实值之间的差距呢?我们使用如下定义:


图片1

def price(rm, k, b):
    """f(x) = k * x + b"""
    return k * rm + b
def loss(y, y_hat): # to evaluate the performance 
    return sum((y_i - y_hat_i)**2 for y_i, y_hat_i in zip(list(y), list(y_hat))) / len(list(y))
# 也可以使用numpy来更简单的定义损失函数
import numpy as np
def loss(y,y_hat):
    e = np.array(y)-np.array(y_hat)
    return (e@e.T)/len(y)

上段代码中需要注意的地方有:

  1. Python3 zip() 函数 https://www.runoob.com/python3/python3-func-zip.html

  我们的任务就是,找到一个合适的k,b参数值,使得loss尽可能小。那么按照机器学习的思想,我们要做的是先随机生成一个k,b,然后通过数据去让程序自动的去调整k,b,直到迭代多少次或者损失小于某个值。

Gradient Descent(梯度下降)

  我们可以看到,x,y是确定的值,loss其实是以k,b为变量的函数,我们求loss关于k,b的偏导数以及相应代码如下所示:


图片2

def partial_k(x, y, y_hat):
    n = len(y)
    gradient = 0
    
    for x_i, y_i, y_hat_i in zip(list(x), list(y), list(y_hat)):
        gradient += (y_i - y_hat_i) * x_i
    
    return -2 / n * gradient
def partial_b(x, y, y_hat):
    n = len(y)
    gradient = 0
    
    for y_i, y_hat_i in zip(list(y), list(y_hat)):
        gradient += (y_i - y_hat_i)
    
    return -2 / n * gradient

  我们在随机得到k,b后,计算loss以及loss关于k,b的偏导数,一般来说,随机得到的k,b都会使得loss比较大,那么我们应该怎么变化k,b,才能使得loss不断减小呢?偏导数为我们提供了变化的方向,我们定义一个正的学习率\alpha,在计算完偏导数后,我们对k,b的值做如下变化:
k = k -\alpha\times \frac{\partial loss}{\partial k}
k = b -\alpha\times \frac{\partial loss}{\partial b}
  得到新的k,b后,我们带回去计算loss,如果新的到的loss比之前的loss小,那么最小的loss就是新的到的loss,k,b也是比之前的k,b更为合适的取值,接下来再重复上述过程,直到重复了某个次数或者损失小于某个值。注意,k,b一定要同步更新,不能先更新k再用更新了的k去计算函数关于b的偏导数去更新b。代码如下:

import random
trying_times = 2000
min_loss = float('inf') 
current_k = random.random() * 200 - 100
current_b = random.random() * 200 - 100
learning_rate = 1e-04
for i in range(trying_times):
    
    price_by_k_and_b = [price(r, current_k, current_b) for r in X_rm]
    
    current_loss = loss(y, price_by_k_and_b)
    if current_loss < min_loss: # performance became better
        min_loss = current_loss
        
        if i % 50 == 0: 
            print('When time is : {}, get best_k: {} best_b: {}, and the loss is: {}'.format(i, best_k, best_b, min_loss))
    k_gradient = partial_k(X_rm, y, price_by_k_and_b)
    
    b_gradient = partial_b(X_rm, y, price_by_k_and_b)
    
    current_k = current_k + (-1 * k_gradient) * learning_rate
    current_b = current_b + (-1 * b_gradient) * learning_rate

上段代码中需要注意的地方有:

  1. Python中可以用如下方式表示正负无穷:float("inf"), float("-inf"),利用 inf 做加、乘算术运算仍会得到 inf。除了inf外的其他数除以inf,会得到0。
  2. Python random() 函数。https://www.runoob.com/python/func-number-random.html。注意区分random模块中的random和numpy模块中的random。
  3. Python format 格式化函数。https://www.runoob.com/python/att-string-format.html
  4. 1e-04代表1\times10^{-4}

  最后得到的结果如下所示:

When time is : 0, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 575.5349822522099
When time is : 50, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 277.9378161169662
When time is : 100, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 147.24895628021088
When time is : 150, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 89.8572545975801
When time is : 200, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 64.65372567052019
When time is : 250, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 53.58551239815359
When time is : 300, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 48.72477014152337
When time is : 350, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 46.59001559478237
When time is : 400, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 45.65236839246802
When time is : 450, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 45.24042644341104
When time is : 500, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 45.059346031766644
When time is : 550, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.97964764306714
When time is : 600, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.94447083305862
When time is : 650, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.928845550418174
When time is : 700, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.921806290539294
When time is : 750, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.918537593098634
When time is : 800, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.91692476670531
When time is : 850, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.91603915253814
When time is : 900, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.91547293354079
When time is : 950, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.91504701836891
When time is : 1000, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.914682759718445
When time is : 1050, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.91434561990997
When time is : 1100, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.9140204318406
When time is : 1150, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.91370053492356
When time is : 1200, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.91338300417686
When time is : 1250, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.91306655509527
When time is : 1300, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.912750623583214
When time is : 1350, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.912434961909526
When time is : 1400, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.91211946127419
When time is : 1450, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.91180407388745
When time is : 1500, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.9114887787528
When time is : 1550, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.9111735666393
When time is : 1600, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.91085843348287
When time is : 1650, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.91054337748873
When time is : 1700, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.910228397858496
When time is : 1750, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.9099134942312
When time is : 1800, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.909598666438264
When time is : 1850, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.90928391439542
When time is : 1900, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.90896923805536
When time is : 1950, get best_k: 11.431551629413757 best_b: -49.52403584539048, and the loss is: 44.908654637387244

  一个简单的机器学习–梯度下降模型就完成啦,当然这其中还有很多问题,比如初始值的选取、学习率的选取等等,这些就是我们后面探讨的内容啦。

最后,欢迎大家访问我的GitHub查看更多代码:https://github.com/LiuPineapple
欢迎大家访问我的简书主页查看更多文章:https://www.jianshu.com/u/31e8349bd083

https://www.jianshu.com/p/05126890e53b

「点点赞赏,手留余香」

    还没有人赞赏,快来当第一个赞赏的人吧!
0 条回复 A 作者 M 管理员
    所有的伟大,都源于一个勇敢的开始!
欢迎您,新朋友,感谢参与互动!欢迎您 {{author}},您在本站有{{commentsCount}}条评论