Model Evaluating 模型评估与选择

A Stellar Hiker

2018-07-19

模型种类比较

No Free Lunch Theorem
Here is no method which outperforms all others for all tasks.

Decision Boundaries

The most powerful methods are Gradient Boosted Decision Trees and Neural Networks.
But you shouldn’t underestimate the others

Linear models split space into 2 subspaces.
Tree-based methods splits space into boxes.
k-NN methods heavily rely on how to measure points “closeness”.
Feed-forward NNs produce smooth non-linear decision boundary.

Image Loading

经验误差与过拟合

训练时的误差是经验误差 empirical error；新样本上的误差是泛化误差 generalization error。我们需要泛化误差小的学习器。
overfitting / underfitting
Image Loading

Underfitting refers to not capturing enough patterns in the data.
Generally, overfitting refers to.
a. capturing noize.
b. capturing patterns which do not generalize to test data.

评估方法

此处只考虑了泛化误差，现实任务中往往还会考虑时间开销、存储开销、可解释性等方面的因素。
基于验证集 Validation Set 来进行模型选择和调参。
Image Loading

方法

Holdout 留出法
Cross Validation 交叉验证法
- K-fold
- Leave-one-out 留一法
Bootstrapping 自助法
Parameter Tuning 调参

Causes of validation problems:

Too little data.
Too diverse and inconsistent data.
Incorrect train/test split.
Different distributions in train and test.
Overfitting.

Holdout 留出法

Image Loading
Holdout: ngroups = 1
sklearn.model_selection.ShuffleSplit

分割数据集时可能有多种方法：

Random, rowwise
Timewise - Moving window
By id

Cross Validation 交叉验证法

K-fold

Image Loading
K-fold: ngroups = k
sklearn.model_selection.Kfold

需要注意的是，training set 和 test set 应该尽可能保持数据分布的一致性。如果从采样 sampling 的角度来看待数据集的划分过程，则保留类别比例的采样方式通常称为分层采样 stratification sampling。
Stratification preserve the same target distribution over different folds.
Image Loading

Stratification is useful for:

Small datasets.
Unbalanced datasets.
Multiclass classification.

Leave-one-out

特例，test set 里只有一个样本。
优点：training set 与实际使用的类似，可能比其他方式更准确。
缺点：在数据集比较大时，开销很大。

Bootstrapping 自助法

为了减少训练样本规模不同造成的影响，同时还能比较高效地进行实验估计。
以自助采样方式 bootstrap sampling 为基础：每次随机从初始训练集中挑选一个样本，将其拷贝放入训练集，然后再将该样本放回初始训练集中，使得该样本在下次采样时仍有可能被采到；重复m次。
做一个简单的估计，样本在m次采样中始终不被采到的概率为 $(1- /frac{1}{m})^m$ ，取极限得到0.368，即约有36.8%的样本未出现在采样数据集中。

$\lim_{m \rightarrow \infty} (1- \frac{1}{m})^m = \frac{1}{e} \approx 0.368$

自助法在数据集较小，难以有效划分训练/测试集时很有用。对集成学习也有好处。但改变了初始数据集的分布，可能有估计误差。所以如果数据集够用就不用此方法。

Parameter Tuning 调参

对每个参数选定一个范围和变化步长，然后从候选值中产生选定值进行训练。选择太多，调参工作量很大。

性能度量 Metrics

Chosen metric determines optimal decision boundary.

Regression
MSE, RMSE, R-squared
MAE
®MSPE, MAPE
®MSLE
Classification:
Accuracy, LogLoss, AUC, Confusion Matrix, Precision, Recall, F1 Score
Cohen’s (Quadratic weighted) Kappa
代价敏感错误率与代价曲线

注意 Loss 和 Metric 的区别：

Target metric is what we want to optimize.
Optimization loss is what model optimizes.

Regression metrics

MSE: Mean Square Error

$MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat y_i)^2$

Best constant for $\hat y_i$ is target mean.
Image Loading

TMSE: Root Mean Square Error

$RMSE = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat y_i)^2} = \sqrt{MSE}$

R-squared

$R^2 = 1- \frac{\frac{1}{N} \sum_{i=1}^N (y_i - \hat y_i)^2}{\frac{1}{N} \sum_{i=1}^N (y_i - \overline y)^2} = 1 - \frac{MSE}{\frac{1}{N} \sum_{i=1}^N (y_i - \overline y)^2}$

$\overline y = \frac{1}{N} \sum_{i=1}^{N} y_i$

MAE: Mean Absolute Error

$MAE = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat y_i|$

Best constant for $\hat y_i$ is target median.
Image Loading
Derivatives:

From MSE and MAE to MSPE and MAPE

Image Loading

MSPE, MAPE

$MSPE = \frac{100\%}{N} \sum_{i=1}^N (\frac{y_i - \hat y_i}{y_i})^2$

Best constant: weighted target mean.
Image Loading

$MAPE = \frac{100\%}{N} \sum_{i=1}^N |\frac{y_i - \hat y_i}{y_i}|$

Best constant: weighted target mean.
Image Loading

®MSLE: Root Mean Square Logarithmic Error

Image Loading
Best constant in log space is mean target value, exponentiate it to get an answer.

Comparison

比较

MSE, RMSE, R-squared
They are the same from optimization perspective.
MAE
Robust to outliers.
®MSPE
Weighted version of MSE.
MAPE
Weighted version of MAE.
®MSLE
MSE in log space.

MAE vs. MSE

Do you have outliers in the data?
Use MAE.
Are you sure they are outliers?
Use MAE.
Or they are just unexpected values we should still care about?
Use MSE.

To Optimize

Image Loading

Classification Metrics

Accuracy Score

How frequently our class prediction is correct.

$Accuracy = \frac{1}{N} \sum_{i=1}^{N} [\hat y_i = y_i]$

Best constant: predict the most frequent class.

Confusion Matrix, Precision, Recall, F1 Score

Image Loading
Precision，查准率，所有预测为 true 的里面有多少是正确的。
Recall，查全率，所有真正为 true 的里面有多少我预测到了。
Precision 和 Recall 是一对矛盾的度量。为了调和，引入 F1 Score。

$\frac{1}{F1} = \frac{1}{2} \cdot (\frac{1}{P} + \frac{1}{R})$

即

$F1 = \frac{2 \times P \times R}{P + R} = \frac{2 \times TP}{All + TP - TN}$

Logarithmic Loss (logloss)

Binary

$LogLoss = - \frac{1}{N} \sum_{i=1}^{N} y_i \log(\hat y_i) + (1 - y_i) \log (1 - \hat y_i)$

$y_i \in \mathbb{R}, \hat y_i \in \mathbb{R}$

Multiclass

$LogLoss = - \frac{1}{N} \sum_{i=1}^{N} \sum_{l=1}^L y_{il} \log(\hat y_{il})$

$y_i \in \mathbb{R}^L, \hat y_i \in \mathbb{R}^L$

In practice

$LogLoss = - \frac{1}{N} \sum_{i=1}^{N} \sum_{l=1}^L y_{il} log(\min(\max(\hat y_{il}, 10^{-15}), 1 - 10^{-15}))$

Image Loading
Logloss strongly penalizes completely wrong answers.

Best constant: set $\alpha_i$ to frequency of $i$ -th class.

Area Under Curve (AUC ROC)

ROC（Receiver Operating Characteristic）曲线和AUC常被用来评价一个二值分类器（binary classifier）的优劣。横坐标为false positive rate（FPR），纵坐标为true positive rate（TPR）。
Image Loading

接下来我们考虑ROC曲线图中的四个点和一条线。第一个点，(0,1)，即FPR=0, TPR=1，这意味着FN（false negative）=0，并且FP（false positive）=0。Wow，这是一个完美的分类器，它将所有的样本都正确分类。第二个点，(1,0)，即FPR=1，TPR=0，类似地分析可以发现这是一个最糟糕的分类器，因为它成功避开了所有的正确答案。第三个点，(0,0)，即FPR=TPR=0，即FP（false positive）=TP（true positive）=0，可以发现该分类器预测所有的样本都为负样本（negative）。类似的，第四个点（1,1），分类器实际上预测所有的样本都为正样本。经过以上的分析，我们可以断言，ROC曲线越接近左上角，该分类器的性能越好。
Image Loading

ROC 曲线生成过程：根据每个测试样本属于正样本的概率值 Score从大到小排序。接下来，我们从高到低，依次将 Score 值作为阈值threshold，当测试样本属于正样本的概率大于或等于这个threshold时，我们认为它为正样本，否则为负样本。每次选取一个不同的threshold，我们就可以得到一组FPR和TPR，即ROC曲线上的一点。迭代。

AUC (Area Under Curve) 为 ROC 曲线下的面积。使用AUC值作为评价标准是因为很多时候ROC曲线并不能清晰的说明哪个分类器的效果更好，而作为一个数值，对应AUC更大的分类器效果更好。
Random predictions lead to AUC = 0.5

ROC的优点：当测试集中的正负样本的分布变化的时候，ROC曲线能够保持不变。
Image Loading
在上图中，(a)和©为ROC曲线，(b)和(d)为Precision-Recall曲线。(a)和(b)展示的是分类其在原始测试集（正负样本分布平衡）的结果，©和(d)是将测试集中负样本的数量增加到原来的10倍后，分类器的结果。可以明显的看出，ROC曲线基本保持原貌，而Precision-Recall曲线则变化较大。