GoDaddy是现在进行的时序比赛,本文介绍GoDaddy赛题的基础解决方法,包括数据划分、模型验证和几个基础的模型。
比赛地址:https://www.kaggle.com/competitions/godaddy-microbusiness-density-forecasting
步骤1:时序数据划分
参考比赛数据划分的逻辑,我们将数据集划分为训练集、验证集。当然随着比赛逐渐进行,验证的数据也会发生改变。
HOLDOUT_SIZE =4test_set_dates = sorted(train["first_day_of_month"].unique())[-HOLDOUT_SIZE:] df_train = train[~train["first_day_of_month"].isin(test_set_dates)].copy() df_test = train[train["first_day_of_month"].isin(test_set_dates)].copy() df_test_public = df_test.loc[df_test["first_day_of_month"]==df_test["first_day_of_month"].unique()[0]].copy()ifHOLDOUT_SIZE >1: df_test_private = df_test.loc[df_test["first_day_of_month"].isin(df_test["first_day_of_month"].unique()[1:])].copy()
步骤2:定义评价指标
按照比赛评价指标SMAPE,我们可以定义评价指标。
defSMAPE(y_true, y_pred):denominator = (y_true + np.abs(y_pred)) /200.0diff = np.abs(y_true - y_pred) / denominator diff[denominator ==0] =0.0returnnp.mean(diff)
步骤3:历史均值预测
在对未来进行数据进行预测时,直接使用历史均值是一个不错的方法。
y_pred_public = list() y_pred_private = list()forcountyintrain['cfips'].unique(): county_avg = df_train.loc[df_train['cfips']==county,'microbusiness_density'].mean() public_preds = [county_avg]*len(df_test_public[df_test_public['cfips']==county])ifHOLDOUT_SIZE >1: private_preds = [county_avg]*len(df_test_private[df_test_private['cfips']==county]) y_pred_public += public_predsifHOLDOUT_SIZE >1: y_pred_private += private_preds
验证结果:
Leadboard estimate Public: 8.555 Private: 8.696
步骤4:最后点作为预测
对于时序任务,距离最近的点更加重要。因此可以直接考虑使用最后的一个点作为预测结果。
y_pred_public = list() y_pred_private = list()forcountyintrain['cfips'].unique(): county_naive = df_train.loc[(df_train['cfips']==county),'microbusiness_density'].values[-1] public_preds = [county_naive]*len(df_test_public[df_test_public['cfips']==county])ifHOLDOUT_SIZE >1: private_preds = [county_naive]*len(df_test_private[df_test_private['cfips']==county]) y_pred_public += public_predsifHOLDOUT_SIZE >1: y_pred_private += private_preds
验证结果:
Leadboard estimate Public: 2.355 Private: 2.809
步骤5:按照季节进行预测
在对未来进行预测时,需要考虑季节因素。因此可以取上一年同月的值并用作预测。
y_pred_public = list() y_pred_private = list()forcountyintrain['cfips'].unique(): county_public_naive_seasonal = df_train.loc[(df_train['cfips']==county) & (df_train['first_day_of_month'].dt.year == df_test_public['first_day_of_month'].dt.year.unique()[0]-1) & (df_train['first_day_of_month'].dt.month.isin(df_test_public['first_day_of_month'].dt.month)),'microbusiness_density'].valuesifHOLDOUT_SIZE > 1: county_private_naive_seasonal = df_train.loc[(df_train['cfips']==county) & (df_train['first_day_of_month'].dt.year == df_test_private['first_day_of_month'].dt.year.unique()[0]-1) & (df_train['first_day_of_month'].dt.month.isin(df_test_private['first_day_of_month'].dt.month)),'microbusiness_density'].values public_preds = list(county_public_naive_seasonal) private_preds = list(county_private_naive_seasonal) y_pred_public += public_preds y_pred_private += private_preds
验证结果:
Leadboard estimate Public: 7.331 Private: 7.006
步骤6:均值漂移预测
允许预测值随着时间的推移增大或减小,假定单位时间改变量(称作 “漂移”)等于历史数据的平均改变量。
y_pred_public = list() y_pred_private = list() change_in_time = df_train['first_day_of_month'].nunique()-1forcountyintrain['cfips'].unique(): first_month = df_train['first_day_of_month'].unique()[0] last_month = df_train['first_day_of_month'].unique()[-1] first_value = df_train.loc[(df_train['cfips']==county) & (df_train['first_day_of_month']==first_month),'microbusiness_density'].values[0] last_value = df_train.loc[(df_train['cfips']==county) & (df_train['first_day_of_month']==last_month),'microbusiness_density'].values[0] slope = (last_value-first_value)/change_in_time public_preds = last_value + slope y_pred_public.append(public_preds)ifHOLDOUT_SIZE > 1:forstepinrange(2,len(df_test_private['first_day_of_month'].unique())+2): private_preds = last_value + step*slope y_pred_private.append(private_preds)
验证结果:
Leadboard estimate Public: 2.536 Private: 3.323
步骤7:线性回归预测
将历史时序数据作为训练数据,将月份作为特征构建线性回归模型。这里建议对所有的地区的数据一起训练。
def simple_lr_preprocess(df): days_conversion = list()fordateindf["first_day_of_month"]: days_conversion.append((date - pd.to_datetime(date.today())).days) X = pd.concat([df['cfips'].reset_index(drop=True), pd.Series(days_conversion), df['microbusiness_density'].reset_index(drop=True)], axis=1) X.columns = ['cfips','days_since','microbusiness_density'] y = X.pop('microbusiness_density')returnX, y
验证结果:
Leadboard estimate Public: 6.279 Private: 7.212
如下是部分预测结果,可以看出模型预测结果与验证集差异不大。由于数据集不复杂,建议尝试其他简单的模型。
本文源码:https://www.kaggle.com/code/michaelbryantds/godaddy-simple-forecasting-methods