项目背景简介:
英国注册的非商店在线零售的所有交易,该公司主要销售独特的全时礼品。公司的许多客户都是批发商。
数据来源:
数据集来自UCI加州大学欧文分校机器学习库:archive.ics.uci.edu/ml/datasets/online+retail#
数据集为xlsx格式,有2个表,一个Year 2009-2010,另一个表Year 2010-2011。数据共计8个字段,541908条。具体字段如下:
InvoiceNo: 发票号码;6位整数,每次交易都有一个不重复的发票号码,若以字母’C’开头,则表示该订单被取消。
StockCode: 产品代码;5为整数,每个单品都有一个唯一的商品代码
Description: 商品名称;例如:CARD I LOVE LONDON
Quantity: 每次交易中每个商品的数量
UnitPrice: 每单位商品价格,以英镑表示: £45.23
InvoiceDate: 每次交易发生的日期和时间
CustomerID: 顾客ID,每个顾客有一个唯一的5位整数
Country: 顾客所在国家/地区名称
“`
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use(‘ggplot’)
plt.rcParams[“font.sans-serif”]=’SimHei’#解决中文乱码
plt.rcParams[‘axes.unicode_minus’] = False#解决负号无法正常显示的问题
# 数据读取
df = pd.read_excel(r”C:\Users\wxw\Downloads\Online Retail.xlsx”,sheet_name=’Online Retail’,dtype=str)
df.head()
“`
“`
# 查看数据情况
df.info()
“`
“`
# 重复值处理
a1 = df.shape[0]
df.drop_duplicates(inplace =True)
a2 = df.shape[0]
print(‘删除后记录量:’,a2,’ 删除记录量:’,a1-a2)
“`
“`
# 重置索引
df.reset_index(drop = True,inplace =True)
# 缺失值处理
df.isnull().sum()
“`
“`
# 一致化处理
# 调整数据类型
df[‘InvoiceDate’] = pd.to_datetime(df[‘InvoiceDate’],errors = ‘coerce’)
df[‘Quantity’] = df[‘Quantity’].astype(‘int32’)
df[‘UnitPrice’] = df[‘UnitPrice’].astype(‘float’)
# 新增时间维度字段
df[‘Date’] = pd.to_datetime(df[‘InvoiceDate’].dt.date,errors =’coerce’)
df[‘Month’] = df[‘InvoiceDate’] .astype(‘datetime64[M]’)
# 新增销售金额字段
df[‘Sales_volume’] = df[‘Quantity’]*df[‘UnitPrice’]
# 异常值处理
df.describe()
“`
“`
# 删除Quantity & UnitPrice 为负值的数据
df =df[(df[‘Quantity’] > 0) & (df[‘UnitPrice’] > 0)]
df.describe()
“`
“`
# 客户的生命周期
max_date = df.groupby([‘CustomerID’])[[‘Date’]].max()
min_date = df.groupby([‘CustomerID’])[[‘Date’]].min()
life_time = max_date – min_date
life_time.describe()
“`
“`
life_time.head()
“`
“`
life_time[‘life_time’] = life_time[‘Date’].dt.days
life_time[‘life_time’].hist( bins=30,figsize =(15,8), color = ‘r’)
plt.xlabel(‘ 生命周期(天)’)
plt.ylabel(‘用户量’)
plt.title(‘用户生命周期图’)
“`
“`
# 去掉消费一次的客户
life_time[life_time[‘life_time’]> 0].life_time.hist(bins =200,figsize =(15,8),color = ‘r’)
plt.xlabel(‘生命周期(天)’)
plt.ylabel(‘用户量’)
plt.title(‘用户生命周期(不含一次消费者)’)
“`
“`
#计算生命周期在300天以上的人数
len(life_time[life_time.life_time > 300])
customer_retention = pd.merge(left = df,right=min_date,on = ‘CustomerID’,how = ‘inner’,suffixes = (”,’_min’))
customer_retention.head()
“`
“`
customer_retention[‘DateDiff’] = ((customer_retention.Date – customer_retention.Date_min)).dt.days
customer_retention.head()
“`
“`
date_bins = [0, 3, 7, 30, 60, 90, 180]
customer_retention[‘DateDiffB’] = pd.cut(customer_retention.DateDiff, bins = date_bins)
customer_retention[‘DateDiffB’].value_counts()
“`
“`
retention_pivot = customer_retention.pivot_table(index = [‘CustomerID’], columns = [‘DateDiffB’], values = [‘Sales_volume’], aggfunc= np.sum)
retention_pivot.head(5)
“`
“`
retention_pivot_trans = retention_pivot.fillna(0).applymap(lambda x:1 if x > 0 else 0)
retention_pivot_trans.head()
“`
“`
(retention_pivot_trans.sum()/ retention_pivot_trans.count()).plot.bar(width=0.5,align=”center”,figsize =(15,8))
plt.xlabel(‘时间跨度(天)’)
plt.ylabel(‘百分数(%)’)
plt.title(‘各个时间段的用户留存率’)
“`
“`
df_cycle = customer_retention.drop_duplicates(subset=[‘CustomerID’, ‘Date’])
df_cycle.sort_values(by = ‘Date’)
“`
“`
def diff(g):
d = g.DateDiff – g.DateDiff.shift()
return d
last_diff = df_cycle.groupby(‘CustomerID’).apply(diff)
last_diff.hist(bins = 70, figsize = (15, 8), color = ‘r’)
plt.xlabel(‘消费周期’)
plt.ylabel(‘频数’)
plt.title(‘用户消费周期分布图’)
“`
“`
last_diff_customer = last_diff.groupby(‘CustomerID’).mean()
last_diff_customer.hist(bins = 70, figsize = (15, 6), color = ‘r’)
plt.xlabel(‘消费周期’)
plt.ylabel(‘用户量’)
plt.title(‘用户消费周期分布图’)
“`
“`
df_order = df.groupby([‘CustomerID’,’Date’]).sum()
df_order.reset_index(inplace = True)
df_order[‘month’] = df_order.Date.astype(‘datetime64[M]’)
pivoted_counts = df_order.pivot_table(index = ‘CustomerID’ ,columns = ‘month’,
values = ‘Date’ , aggfunc = ‘count’).fillna(0)
columns_month = df_order.month.sort_values().unique()
pivoted_counts.columns = columns_month
pivoted_counts.head()
“`
“`
#进行数据转化吧消费2次以上的记为1,消费一次的记为0,没有消费的记为NaN
pivoted_counts_t = pivoted_counts.applymap(lambda x: 1 if x > 1
else np.NaN if x == 0 else 0)
pivoted_counts_t.head()
“`
“`
ax = (pivoted_counts_t.sum() / pivoted_counts_t.count()).plot(figsize = (15,8))
plt.xlabel(‘月份’)
plt.ylabel(‘复购率’)
plt.title(‘用户月度复购率情况’)
“`
https://www.jianshu.com/p/5badb271426f