1. 1. Introduction
    1. 1.1. Welcome
    2. 1.2. What is machine learning
    3. 1.3. Supervised Learning
    4. 1.4. Unsupervised Learning
  2. 2. Linear Regression with One Variable
    1. 2.1. Model representation
    2. 2.2. Cost function
    3. 2.3. Cost function intuition
    4. 2.4. Gradient descent
    5. 2.5. Gradient descent for linear regression
  3. 3. Linear Regression with Multiple Variables
    1. 3.1. Multiple features
    2. 3.2. Gradient descent for multiple variables
    3. 3.3. Gradient descent in practice I: Feature Scaling
    4. 3.4. Gradient descent in practice II: Learning rate
    5. 3.5. Features and polynomial regression
    6. 3.6. Normal equation
    7. 3.7. Normal equation and non-invertibility
  4. 4. Logistic Regression
    1. 4.1. Classification
    2. 4.2. Hypothesis Representation
    3. 4.3. Decision boundary
    4. 4.4. Cost function
    5. 4.5. Simplified cost function and gradient descent
    6. 4.6. Advanced optimization
    7. 4.7. Multi-class classification: One-vs-all
  5. 5. Regularization
    1. 5.1. The problem of overfitting
    2. 5.2. Cost function
    3. 5.3. Regularized linear regression
    4. 5.4. Regularized logistic regression
  6. 6. Neural Networks Representation
    1. 6.1. Non-linear hypotheses
    2. 6.2. Neurons and the brain
    3. 6.3. Model representation
    4. 6.4. Examples and intuitions
    5. 6.5. Multi-class classification
  7. 7. Neural Networks Learning
    1. 7.1. Cost function
    2. 7.2. Backpropagation algorithm
    3. 7.3. Backpropagation intuition
    4. 7.4. Implementation note: Unrolling parameters
    5. 7.5. Gradient checking
    6. 7.6. Random initialization
    7. 7.7. Putting it together
    8. 7.8. Backpropagation example: Autonomous driving
  8. 8. Advice for Applying Machine Learning
    1. 8.1. Deciding what to try next
    2. 8.2. Evaluating a hypothesis
    3. 8.3. Model selection and training/validation/test sets
    4. 8.4. Diagnosing bias vs. variance
    5. 8.5. Regularization and bias/variance
    6. 8.6. Learning curves
    7. 8.7. Deciding what to try next(revisited)
  9. 9. Machine Learning System Design
    1. 9.1. Prioritizing what to work on: Spam classification example
    2. 9.2. Error analysis
    3. 9.3. Error metrics for skewed classes
    4. 9.4. Trading off precision and recall
    5. 9.5. Data for machine learning
  10. 10. Support Vector Machines
    1. 10.1. Optimization objective
    2. 10.2. Large Margin Intuition
    3. 10.3. The mathematics behind large margin classification
    4. 10.4. Kernals
    5. 10.5. Using an SVM
  11. 11. Clustering
    1. 11.1. Unsupervised learning introduction
    2. 11.2. K-means algorithm
    3. 11.3. Optimization Objective
    4. 11.4. Random initialization
    5. 11.5. Choosing the number of clusters
  12. 12. Dimensionality Reduction
    1. 12.1. Motivation I: Data Compression
    2. 12.2. Motivation II: Data Visualization
    3. 12.3. Principal Component Analysis problem formulation
    4. 12.4. Principal Component Analysis algorithm
    5. 12.5. Reconstruction from compressed representation
    6. 12.6. Choosing the Number of Principal Components
    7. 12.7. Advice for applying PCA
  13. 13. Anomaly Detection
    1. 13.1. Problem motivation
    2. 13.2. Gaussian distribution
    3. 13.3. Algorithm
    4. 13.4. Developing and evaluating an anomaly detection system
    5. 13.5. Anomaly detection vs. supervised learning
    6. 13.6. Choosing what features to use
    7. 13.7. Multivariate Gaussian distribution
    8. 13.8. Anomaly detection using the multivariate Gaussian distribution
  14. 14. Recommender Systems
    1. 14.1. Problem formulation
    2. 14.2. Content-based recommendations
    3. 14.3. Collaborative filtering
    4. 14.4. Collaborative filtering algorithm
    5. 14.5. Vectorization: Low rank matrix factorization
    6. 14.6. Finding related movies
    7. 14.7. Implementational detail: Mean normalization
  15. 15. Large Scale Machine Learning
    1. 15.1. Learning with large datasets
    2. 15.2. Stochastic gradient descent
    3. 15.3. Mini-batch gradient descent
    4. 15.4. Stochastic gradient descent convergence
    5. 15.5. Online learning
    6. 15.6. Map-reduce and data parallelism
  16. 16. Application Example Photo OCR
    1. 16.1. Problem description and pipeline
    2. 16.2. Sliding windows
    3. 16.3. Getting lots of data: Artificial data synthesis
    4. 16.4. Ceiling analysis: What part of the pipeline to work on next
  17. 17. Conclusion
    1. 17.1. Summary and Thank you

Machine Learning 课程笔记

Coursera上的机器学习公开课,据说是实际授课的弱化版本。
Image Loading


Introduction

Welcome

Machine Learning

  • Grew out of work in AI
  • New capability for computers

Examples:

  • Database mining
    Large datasets from growth of automation/web.
    E.g., Web click data, medical records, biology, engineering.
  • Applications can’t program by hand
    E.g., Autonomous helicopter, handwriting recognition, most of Natural Language Processing(NLP), Computer Vision.
  • Self-customizing programs
    E.g., Amazon, Netflix product recommendations.
  • Understanding human learning(brain, real AI).

What is machine learning

Machine Learning definition

  • Arthur Samuel(1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
  • Tom Mitchell(1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Machine learning algorithms:

  • Supervised learning
  • Unsupervised learning

Others: Reinforcement learning, recommender systems.
Also talk about: Practical advice for applying learning algorithms.

Supervised Learning

“right answers” given.
Regression
Regression: Predict continuous valued output.(E.g. Housing price prediction.)
Classification
Classification: Discrete valued output(0 or 1).(E.g. Breast cancer(malignant, benign).)

Unsupervised Learning

E.g., Related news, Genes, Organize computing clusters, Social network analysis, Market segmentation, Astronomical data analysis, Cocktail party problem(分辨混合音源中的不同声音).


Linear Regression with One Variable

Model representation

Supervised Learning: Given the “right answer” for each example in the data.
Regression Problem: Predict real-valued output.
Notation
Notation:
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable
Process

Cost function

Hypothesis: hΘ(x)=Θ0+Θ1xh_{\Theta}(x)={\Theta}_0+{\Theta}_1x
Θi{\Theta}_i's: Parameters

Image Loading

Idea:
Choose Θ0\Theta0, Θ1\Theta1 so that hΘ(x)h_{\Theta}(x) is close to yy for our training examples (x,y)(x,y).

Cost function intuition

Image Loading
用很多不同的两值计算出误差,画一张误差表,可以看出走势图。
误差图的三维表示
可以用Contour figure来表示。同一条曲线上误差值相同。
Image Loading
Image Loading

Gradient descent

Have some function J(Θ0,Θ1)J({\Theta}_0,{\Theta}_1)

Want minΘ0,Θ1J(Θ0,Θ1){min}_{ {\Theta}_0,{\Theta}_1 } J({\Theta}_0,{\Theta}_1)

Outline:

  • Start with some Θ0,Θ1{\Theta}_0, {\Theta}_1
  • Keep changing Θ0,Θ1{\Theta}_0,{\Theta}_1 to reduce J(Θ0,Θ1)J({\Theta}_0,{\Theta}_1) until we hopefully end up at a minimum

起点稍有不一样就可能结果不同,有Local optima问题。
Image Loading
Image Loading

Algorithm:
每个参数单独一点点聚合。
α\alpha表示Learning rate,是一个调节变量。注意要先同时计算差值,然后再同时更新参数。
Image Loading
Image Loading

Gradient descent can converge to a local minimum, even with the learning rate α\alpha fixed.因为随着误差减小,即使α\alpha固定,每次调整的数量也会变小。
As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α\alpha over time.

Gradient descent for linear regression

Image Loading
Image Loading
Image Loading

“Batch” Gradient Descent: Each step of gradient descent uses all the training examples.


Linear Regression with Multiple Variables

Multiple features

之前X只有一个维度,现在考虑X有多维的情况。
Image Loading
也称作Multivariate linear regression。

Gradient descent for multiple variables

(注意以下公式的vectorize。有利计算,要多尝试把这些都看做向量、矩阵来计算。)
Image Loading
一维只是多维的一个特例。
Image Loading

Gradient descent in practice I: Feature Scaling

Feature Scaling: Get every feature into approximately a 1xi1-1 \leqslant x_i \leqslant 1 range.
E.g. :
原来的取值范围,画出来的J的Coutour figure会是一个拉的很长的细长形状,converge的时候会花费很多时间才能走到中心minimum。
x1=size(02000feet2)x_1 = size(0-2000 {feet}^2)x2=numberofbedrooms(15)x_2 = number of bedrooms(1-5)
将取值均一化。
x_1 = \cfrac{size({feet}^2)}{2000} 和 x_2 = \cfrac{number of bedrooms}{5}

Mean normalization: Replace xix_i with xiμix_i- {\mu}_i to make features have approximately zero mean (Do not apply to x0=1x_0=1).
E.g.:
x_1= \cfrac{size-1000}{2000} 范围变为 0.5x10.5-0.5 \leqslant x_1 \leqslant 0.5
x_2= \cfrac{bedrooms-2}{5} 范围变为 0.5x20.5-0.5 \leqslant x_2 \leqslant 0.5

Gradient descent in practice II: Learning rate

Debugging: Making sure gradient descent is working correctly.
确定每次迭代J(θ)J(\theta)都是变小的。当J(θ)J(\theta)小于某个值比如10310^{-3}时停止迭代,但是这个值很难把握,一般也可以还是看图说话。

How to choose learning rate α\alpha.
If α\alpha is too small: slow convergence.
If α\alpha is too large: J(θ)J(\theta) may not decrease on every iteration; may not converge. (Slow converge also possible.)
Too choose α\alpha, try: …, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …

Features and polynomial regression

多项式回归
对于房子的价格来说,可能有多项影响因素,如长、宽等。
从图中看到,size和price的关系并不是线性关系,如果是二次方的关系曲线末端会变为下降也不太可能,所以可能是三次方的关系。
Image Loading
size3{size}^3的取值范围与sizesize的取值范围相差太大了,不是最好的选择。这时我们想到,可以用开根号。
于是可以hθ(x)=θ0+θ1(size)+θ2(size)h_{\theta}(x)={\theta}_0+{\theta}_1(size)+{\theta}_2 \sqrt{(size)}

Normal equation

除了用Gradient Descent逐步聚合求结果的方法,还有一种用数学公式求解θ\theta的方法。
Normal equation: Method to solve for θ\theta analytically.
由于目的是求使J(θ)J(\theta)最小的θ\theta值,所以在该处J(θ)J(\theta)的斜率、微分是0,可以据此求解。
Image Loading

示例:
Image Loading

公式就是:θ=(XTX)1XTy\theta = {(X^TX)}^{-1}X^Ty
在Octave中的写法是pinv(X'*X)*X'*y

与Gradient Descent做比较
Image Loading

Normal equation and non-invertibility

如果(XTX)(X^TX)是不可逆的(singular/degenerate),在Octave中使用pinv(X'*X)*X'*y仍然可以求出来。
但这时我们最好检查一下自己的数据。可能的原因有:

  • Redundant features (linearly dependent). 如x1是size在平方英尺下的值,x2是size在平方米下的值。
  • Too many features. 比如mnm \leqslant n,此时应该删除一些feature,或者使用regularization。

Logistic Regression

Classification

例子:

  • Email: Spam / Not Spam?
  • Online Transactions: Fraudulent(Yes / No)?
  • Tumor: Malignant / Benign?

y{0,1}y \in \{ 0,1 \}
0: “Negative Class” (e.g., benign tumor)
1: “Positive Class” (e.g., malignant tumor)

不适合用linear regression解决,应该用Logistic Regression。

Hypothesis Representation

Logistic Regression Model
Want 0hθ(x)10 \leqslant h_{\theta}(x) \leqslant 1

Sigmoid function/Logistic function: h_{\theta}(x) = g({\theta}^Tx) = \cfrac{1}{1+e^{ -{\theta}^Tx} }

它的曲线图类似
Image Loading
Interpretation of Hypothesis Output
hθ(x)h_{\theta}(x) = estimated probability that y=1 on input x

Example:
If x=[x0x1]=[1tumorSize]x = \begin{bmatrix} x_0 \\ x_1 \end{bmatrix} = \begin{bmatrix} 1 \\ tumorSize \end{bmatrix}

hθ(x)=0.7h_{\theta}(x) = 0.7
Tell patient that 70% chance of tumor being malignant.

"probability that y = 1, given x, parameterized by θ\theta"

hθ(x)=P(y=1x;θ)h_{\theta}(x) = P(y=1|x; \theta)
P(y=0x;θ)=1P(y=0x;θ)P(y=0|x; \theta) = 1 - P(y=0|x; \theta)

Decision boundary

Logistic regression
hθ(x)=g(θTx)h_{\theta}(x) = g({\theta}^Tx) 和 g(z) = \cfrac{1}{1+e^{-z} }。

Image Loading

Suppose predict “y = 1” if hθ(x)0.5h_{\theta}(x) \geqslant 0.5。g(z)大于等于0.5,是当z大于等于0时。带入公式,即是当θTx0{\theta}^Tx \geqslant 0时。

同理,predict “y = 0” if hθ(x)<0.5h_{\theta}(x) < 0.5 也既是当θTx<0{\theta}^T x < 0时。

Decision Boundary
Image Loading
θ0,θ1,θ2{\theta}_0, {\theta}_1, {\theta}_2分别取值-3,1,1。画出x1+x2=3x_1+x_2=3的图,这就是Decision boundary。

Non-linear decision boundaries
Image Loading
对于复杂一些的公式也是一样,分别取值-1,0,0,1,1。画出曲线图,即是Decisioin boundary。

Cost function

Image Loading

Logistic regression cost function
Image Loading
x轴是hθ(x)h_{\theta}(x),y轴是Cost。
y=1的情况
y=0的情况

Simplified cost function and gradient descent

Logistic regression cost function
Image Loading
其中的J(θ)J(\theta)作为比较,Linear regression的是这样的:
Linear regression J
而Logistic regression的J(θ)J(\theta)可以简化为:
Image Loading
目标就是求得使J(θ)J(\theta)最小的θ\theta

Gradient Descent
Image Loading

Image Loading
Algorithm looks identical to linear regression!

Advanced optimization

Optimization algorithm
Image Loading
这些高级函数在octave里已有实现。
一个例子,在octave里:
Image Loading
Image Loading

Multi-class classification: One-vs-all

Multiclass classification
Examples:
Email foldering/tagging: Work, Friends, Family, Hobby
Medical diagrams: Not ill, Cold, Flu
Weather: Sunny, Cloudy, Rain, Snow
Image Loading
处理多class要用到one-vs-all方法。

One-vs-all (one-vs-rest):
每次针对其中一组和其他组。如下图就分成3个Classifier。
Image Loading
Train a logistic regression classifier hθ(i)(x)h_{\theta}^{(i)}(x) for each class ii to predict the probability that y=iy=i.

On a new input xx, to make a prediction, pick the class ii that maximazes.
Image Loading


Regularization

The problem of overfitting

Overfitting: If we have too many features, the learned hypothesis may fit the training set very well (J(θ)=0J(\theta)=0), but fail to generalize to new examples (predict prices on new examples).

Example: Linear regression (housing prices)
Image Loading
Example: Logistic regression
Image Loading

Addressing overfitting:
如果feature很少,可以画出图。如果feature太多,有两个选择:

  1. Reduce number of features.
    Manually select which features to keep.
    Model selection algorithm (later in course).
  2. Regularization.
    Keep all the features, but reduce magnitude/values of parameters θj{\theta}_j.
    Works well when we have a lot of features, each of which contributes a bit to predicting yy.

Cost function

Intuition
Image Loading
如果想让这个式子的值很小那么θ3{\theta}_3θ4{\theta}_4必须很小。

Regularization
Small values for parameters θ0,θ1,...,θn{\theta}_0, {\theta}_1, ..., {\theta}_n

  • “Simpler” hypothesis
  • Less prone to overfitting

Housing:

  • Features: x1,x2,...,x100x_1, x_2, ..., x_{100}
  • Parameters: θ0,θ1,θ2,...,θ100{\theta}_0, {\theta}_1, {\theta}_2, ..., {\theta}_100

由于不知道应该减少哪些θ\theta,所以全部减少。
Image Loading
如果λ\lambda特别大,那么从θ1{\theta}_1θn{\theta}_n几乎都为0,那么h(θ)=θ0h(\theta)={\theta}_0,是一条横线。不能消除overfitting,反而导致underfitting。对算法本身的执行没影响。Gradient descent会无法converge。

Regularized linear regression

Regularized linear regression
Image Loading

Gradient descent
Image Loading

Normal equation
Image Loading

Non-invertibility (optional/advanced)
Image Loading

Regularized logistic regression

Regularized logistic regression
Image Loading

Gradient descent
Image Loading

Advanced optimization
Image Loading


Neural Networks Representation

Non-linear hypotheses

当feature很多的时候,计算起来简直麻烦上天。
Image Loading

一个Computer Vision例子:
Image Loading
Image Loading
从车的图片和非车的图片固定的地方取两个像素,将其值画在坐标轴上,多次之后可以得到下图左。然而一张图片像素太多了,都以这种方式比较无从计算。
Image Loading

Neurons and the brain

Neural Networks
Origins: Algorithms that try to mimic the brain.
Was very widely used in 80s and early 90s; popularity diminished in late 90s.
Recent resurgence: State-of-the-art technique for many applications.
现在又火起来是因为计算机的计算能力提高了。

The “one learning algorithm” hypothesis
如果切断耳朵与Auditory Cortex之间的神经,把通往眼睛的接上,听觉中枢会学会看。同样,切断手与Somatosensory Cortex之间的神经,把通往眼睛的接上,触觉中枢也会学会看。

Sensor representations in the brain
Image Loading

Model representation

Neuron in the brain
Image Loading
Image Loading

Neuron model: Logistic unit
Image Loading
bias unit / input / output / weights = parameters

Neural Network
Image Loading
input layer / hidden layer / output layer
Image Loading

Forward propagation: Vectorized implementation
Image Loading

Neural Network learning its own features
Image Loading

可以看出hΘ(x)h_{\Theta}(x)与Logistic Regression的很相似,

hθ(x)h_{\theta}(x),只是Θ\Theta不同。

Other network architectures
Image Loading

Examples and intuitions

Non-linear classification example: XOR/XNOR
Image Loading

Simple example: AND
Image Loading

Example: OR function
Image Loading

Negation:
Image Loading

Putting it together: x1XNORx2x_1 XNOR x_2
Image Loading

Neural Network intuition
Image Loading

Handwritten digit classification
Image Loading

Multi-class classification

Multiple output units: One-vs-all
Image Loading
Image Loading


Neural Networks Learning

Cost function

Neural Network (Clasification)
Image Loading

Cost function
Image Loading

Backpropagation algorithm

Gradient computation
Image Loading
Image Loading

Gradient computation: Backpropagation algorithm
Image Loading

Backpropagation algorithm
Image Loading
Δ\Delta表示全部。

Backpropagation intuition

Forward Propagation
Image Loading

What is backpropagation doing?
简化一下想一下。
Image Loading
Image Loading

Implementation note: Unrolling parameters

Advanced optimization
Image Loading

Example
Image Loading

Learning Algorithm
Image Loading

Matrix的好处:forward propagation和back propagation时更方便。
Vector的好处:用advanced optimization algorithms时需要。

Gradient checking

可以排除几乎所有可能bug。
Numerical estimation of gradients
假设θ\theta是实数。
Image Loading
ε\varepsilon的取值也不能太小,运算上可能有numerical problem。通常用104{10}^{-4}

Parameter vector θ\theta
Image Loading
Image Loading
Image Loading

Random initialization

Initial value of Θ\Theta
For gradient descent and advanced optimization method, need initial value for Θ\Theta.
optTheta = fminunc(@costFunction, initialTheta, options)

Zero initialization
Image Loading
不适用于neural network,会造成每一层里各值都相等,浪费了每层这许多节点。

Random initialization: Symmetry breaking
Image Loading

Putting it together

Training a neural network
1.Pick a network architecture (connectivity pattern between neurons)
Image Loading
2.Start Training.
Image Loading
Image Loading

Backpropagation example: Autonomous driving

Dean Pomerleau
左下角是汽车看到的视角。左上第一条是人类司机选择的方向steering direction,第二条是算法选择的方向。
Image Loading

Advice for Applying Machine Learning

Deciding what to try next

Debugging a learning algorithm:
Image Loading
不能盲目尝试下面这些可行策略,应该实现一个Diagnostic。

Machine learning diagnostic:
Diagnostic: A test that you can run to gain insight what is/isn’t working with a learning algorithm, and gain guidance as to how best to improve its performance.
Diagnostics can take time to implement, but doing so can be a very good use of your time.

Evaluating a hypothesis

Evaluating your hypothesis
虽然training error很小,然而可能会overfitting。
如果feature很多,又没法plot。此时将数据分成两部分,比例大约7/3,一部分作为Training Set,一部分作为Test Set。(从后面得知这并不是最好的做法,应该三分。)

Training/testing procedure for linear regression
Image Loading

Training/testing procedure for logistic regression
Image Loading

Model selection and training/validation/test sets

Overfitting example
Once parameters θ0{\theta}_0, θ1{\theta}_1, …, θ4{\theta}_4 were fit to some set of data (training set), the error of the parameters as measured on that data (the training error J(θ)J(\theta) is likely to be lower than the actual generalization error.

Model selection
Image Loading
d = degree of polynomial
对每种model进行训练,得到vector θ\theta,然后将其用于test data,得到J,看看哪种model中的J最小。这样的问题是d可能会对test data过拟合,所以得到的错误率过于乐观,并不是很好的做法。

Evaluating your hypothesis
Image Loading
将数据分成三部分,比例大约6/2/2,分别为Training Set/Cross Validation Set/Test Set。
mcvm_{cv}表示number of cross validation examples.

Training/validation/test error
Image Loading

Model selection
Image Loading
得到θ\theta后在cross validataion set上进行测试,计算J。取其中最小的d。
然后再在test set上运行,计算错误率。

Diagnosing bias vs. variance

不是underfitting就是overfitting。

Bias/variance
Image Loading
Image Loading
随着d值变大,training error会越来越小,而cross validation error会有一个最小值,从underfitting到overfitting。

Diagnosing bias vs. variance
Image Loading
看两者是都高还是一高一低区分under还是over。

Regularization and bias/variance

Linear regression with regularization
Image Loading

Choosing the regularization parameter λ\lambda
Image Loading
还是分成三组。
Image Loading
λ\lambda每次乘以2,计算θ\theta,在cross validation组测试计算J,选择最小J的λ\lambda。然后用其在test set里计算test error。

Bias/variance as a function of the regularization parameter λ\lambda
Image Loading
λ\lambda小的时候容易overfitting,大的时候容易underfitting。
实际曲线会noisy一点。

Learning curves

Learning curves
Image Loading
横轴m是training set size。J{train}会逐渐变大,J{cv}会越来越小,这样就对了。

High bias
Image Loading

High variance
Image Loading
J{train}过小,J{cv}过大,中间有个大gap。并且持续变化,不会平稳。

Deciding what to try next(revisited)

Debugging a learning algorithm

  • Get more training examples - fixes high variance,有利于在learning curve中发现问题。
  • Try smaller sets of features - fixes high variance.
  • Try getting additional features - fixes high bias.
  • Try adding polynomial features - fixes high bias.
  • Try decreasing λ\lambda - fixes high bias.
  • Try increaing λ\lambda - fixes high variance.

Neural networks and overfitting
Image Loading

Machine Learning System Design

Prioritizing what to work on: Spam classification example

Building a spam classifier
Example.
Image Loading
Supervised learning.
Image Loading
How to spend your time to make it have low error?

  • Collect lots of data. E.g. “honeypot” project.
  • Develop sophisticated features based on email routing information (from email header).
  • Develop sophisticated features for message body, e.g. should “discount” and “discounts” be treated as the same word? How about “deal” and “Dealer”? Features about punctuation?
  • Develop sophisticated algorithm to detect misspellings (e.g. m0rtgage, med1cine, w4tches.)

Error analysis

Recommended approach

  • Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
  • Plot learning curves to decide if more data, more features, etc. are likely to help.
  • Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.

Error Analysis
Image Loading

The importantance of numerical evaluation
Should discount/discounts/discouted/discounting be treated as the same word?
Can use “stemming” software (E.g. “Porter stemmer”)
universe/university (不好的情况。)

Error analysis may not be helpful for deciding if this is likely to improve performance. Only solution is to try it and see if it works.
Need numerical evaluation (e.g., cross validation error) of algorithm’s performance with and without stemming.
Without stemming: 5% error
With stemming: 3% error
Distinguish upper vs. lower case(Mom/mom): 3.2%

Error metrics for skewed classes

Cancer classification example
Image Loading
Logistic regression得到1%的错误率。而下面的代码之间全部都改成y=0没cancer反而错误率只有0.5%。
这种两种情况出现比例相差极多的情况就叫skewed classes. More positive examples than negative examples.
用数据衡量这种情况时要用到Precision/Recall。
Precision/Recall
Image Loading
这两值都是越高越好。

Trading off precision and recall

Trading off precision and recall
Image Loading
通过修改threshold来操纵precision和recall值来满足自己的要求。
0.9对应第一种情况,0.3对应第二种情况。
曲线可能有多种形状。

F1F_1 Score (F score)
Image Loading
直接求平均不好。还如之前所有直接y=1的情况,recall会很高,影响结果判断。
应该计算F1F_1 Score。相当于将两者的权重都降低了,其中任何一个小,整体都小。

Data for machine learning

Designing a high accuracy learning system
Image Loading
看图得知数据很少时,算法表现差异很大;当数据很多时,它们都差不多了。

Large data rationale
Image Loading
检验上述结论,想想本图的couter example和useful test。
Image Loading
参数多数据多,于是bias和variance都小,结果就是不错。

满足1.人类可以从此数据推断。2.有很多数据。这两点,基本可以确定会有一个好结果。

Support Vector Machines

SVM

Optimization objective

Alternative view of logistic regression
Image Loading
Image Loading
原来的曲线变成多段折线。x轴是z,y轴是cost。

Support vector machine
Image Loading
Cost Function
C可以认为是1/λ1/\lambda

SVM hypothesis
Image Loading

Large Margin Intuition

Support Vector Machine
Image Loading
x轴是z,y轴是cost。

SVM Decision Boundary
Image Loading
当C特别大时,我们倾向于让框住的那一部分接近0.

SVM Decision Boundary: Linearly separable case
Image Loading
The black decision boundary has a larger distance. That distance is called the margin of the support vector machine. Larger minimum distance from any of the training examples.

Large margin classifier in presence of outliers
Image Loading
当有outliers时。如果C特别大,会是藕荷色那条线。如果C的值理智一点,就是黑色那条线。C的作用和之前的λ\lambda差不多。

The mathematics behind large margin classification

Vector Inner Product
Image Loading

SVM Decision Boundary
Image Loading
Image Loading
margin就是p。如果不假设θ0{\theta}_0为0,绿线SVM Decision Boundary就是不过原点。

Kernals

Non-linear Decision Boundary
Image Loading

Kernal
Image Loading
Similarity Functions. Gaussian kernal.

Kernals and Similarity
Image Loading

Example
Image Loading
Image Loading
对于藕荷色的点,离l1近,所以f1约等于1;离其他点远,其他f都约等于0。
对于青色的点,都远,所以f都约等于0。
由于Theta3为0,所以boundary就是离l1和l2近的点,如图。

Choosing the landmarks
Image Loading
l就是landmarks。就放在training examples上。

SVM with Kernels
Image Loading
右边的f是feature vector。
Image Loading

SVM parameters
Image Loading

Using an SVM

Image Loading

Kernel(similarity) functions:
Image Loading

Other choices of kernel
Image Loading

Multi-class classification
Image Loading

Logistic regression vs. SVMs
Image Loading

Clustering

从这里开始Unsupervised.

Unsupervised learning introduction

Supervised learning
Image Loading
Unsupervised learning
Image Loading
区别是数据有没有label。让algorithm自己找特性。
Applications of clustering
Image Loading
一些clustering算法的应用示例。

K-means algorithm

Image Loading
Image Loading
Image Loading
Image Loading
K Means算法是一个重复迭代的算法:

  1. cluster assignment step. 将所有点根据所离最近的centroid分组。(最初的centroid是随机的。)
  2. move centroid step.将centroid放到自己所在组的中心点。
  3. 重复上述两步,直到稳定。

K-means algorithm
Image Loading
Image Loading

K-means for non-separated clusters
Image Loading

Optimization Objective

K-means optimization objective
好处是检查是否正常工作和找到更加cluster和避开local optima。
Image Loading
K-means算法的两步就相当于:先固定μ\mu最小化C,再固定C最小化μ\mu

Random initialization

Random initialization
Should have K < m.
Randomly pick K training examples.
Set μ1,...,μk{\mu}_1,...,{\mu}_k equal to these K examples.
可能效果很好也可能很差。
Image Loading
可能会出现Local optima。
Image Loading
所以应该运行K-means很多次。
Image Loading

Choosing the number of clusters

最常用的还是手动,比如看图。

一种方法叫做Elbow method:
用不同的K跑很多遍,计算J,看图,找胳膊肘。有时这个胳膊肘并不明显,如右图。
Image Loading

或者K-means的结果之后给别人用,那么就根据之后别人的表现来判断K。
Sometimes, you’re running K-means to get clusters to use for some later/downstream purpose. Evaluate K-means based on a metric for how well it performs for that later porpose.
Image Loading

Dimensionality Reduction

Motivation I: Data Compression

Image Loading
Image Loading
projection.

Motivation II: Data Visualization

Image Loading
Image Loading
Image Loading

Principal Component Analysis problem formulation

Principal Component Analysis(PCA) problem formulation
Image Loading
projection error就是从原始位置到投射点的距离。
PCA is not linear regression
Image Loading
linear regression试图缩小的是点与线之间y方向的距离,因为是要争取让x的表达式最接近y。
PCA试图缩小的是点与线之间的垂直距离,没有y,所有x之间是平等的。

Principal Component Analysis algorithm

Data preprocessing
Image Loading
Principal Component Analysis(PCA) algorithm
Image Loading
Image Loading
Principal Component Analysis(PCA) algorithm summary
Image Loading

Reconstruction from compressed representation

Image Loading

Choosing the Number of Principal Components

Image Loading
Image Loading
Image Loading

Advice for applying PCA

Supervised learning speedup
Image Loading
只在training data上运行PCA,计算出UreduceU_{reduce}后再应用于另两组。

Application of PCA
Compression

  • Reduce memory/disk needed to store data.
  • Speed up learning algorithm.

Visualization

Bad use of PCA: To prevent overfitting
Image Loading
因为PCA还是会损失一些信息。

PCA is sometimes used where it shouldn’t be
应该先用原始的进行计算,如果效果不好需要降维才用PCA。
Image Loading

Anomaly Detection

Problem motivation

Anomaly detection example
Image Loading
Image Loading
Image Loading
得到feature,建立model,用model计算test数据的数值,看看是否在正常范围内。

Gaussian distribution

Gaussian (Normal) distribution
Image Loading

Gaussian distribution example
Image Loading
阴影面积总为1。

Parameter extimation
Image Loading
看出数据是经过Gaussian Distribution生成的,计算μ\muσ\sigma

Algorithm

Density estimation
Image Loading

Anomaly detection algorithm
Image Loading

Anomaly detection example
Image Loading

Developing and evaluating an anomaly detection system

The importance of real-number evaluation
Image Loading

Aircraft engines motivating example
Image Loading

Algorithm evaluation
Image Loading

Anomaly detection vs. supervised learning

Image Loading
Image Loading

Choosing what features to use

Non-gaussian features
Image Loading
看看直方图像不像Gaussian Distribution。像就可以用。如果不像的话就对数据做一些转换,让它像Gaussian。
Image Loading

Error analysis for anomaly detection
Image Loading
针对这个问题,争取从数据中分析出一个新feature如x2。

Monitoring computers in a data center
Choose features that might take on unusually large or small values in the event of an anomaly. 后两个在有错误时会很大,因为网络负载小时CPU还很忙,一定是有问题了。

  • x1x_1 = memory use of computer
  • x2x_2 = number of disk accesses/sec
  • x3x_3 = CPU load
  • x4x_4 = network traffic
  • x5x_5 = CPU load / network traffic
  • x6x_6 = (CPU load)^2 / network traffic

Multivariate Gaussian distribution

Motivating example: Monitoring machines in a data center
Image Loading
这种情况下,使用传统Gaussian,绿X本该有错,却还在正常范围内。

Multivariate Gaussian(Normal) distributioin
Image Loading
Octave命令det(Sigma)。

Multivariate Gaussian(Normal) examples
Image Loading
Image Loading
Image Loading
Image Loading
Image Loading
Image Loading

Anomaly detection using the multivariate Gaussian distribution

Multivariate Gaussian(Normal) distribution
Image Loading

Anomaly detectioin with the multivariate Gaussian
Image Loading

Relationship to original model
Image Loading
original model对应只有Σ\Sigma主对角线有值的multivariate Gaussian。
Image Loading

Recommender Systems

Problem formulation

Example: Predicting movie ratings
Image Loading
feature很重要。

Content-based recommendations

之所以叫这个名字是因为我们知道features of the content,比如movie有多romance多action。
Content-based recommender systems
Image Loading
nun_u是User数,nmn_m是Movie数。按惯例添加feature之x0=1x_0=1。下面的例子是预测Alice对Cute puppies of love的预测。

Problem formulation
Image Loading
是一个linear regression problem.

Optimization objective:
Image Loading

Optimization algorithm:
Image Loading

Collaborative filtering

这种不需要知道features of content。它会feature learning。每个用户都让算法的表现更好一点,所以叫Collaborative。
Problem motivation
Image Loading
每个用户告诉我们她们对各种电影的喜好,就是θ\thetaθ0{\theta}_0都是0.

Optimization algorithm
Image Loading

Collaborative filtering
Image Loading

Collaborative filtering algorithm

Collaborative filtering optimization objective
Image Loading

Collaborative filtering algorithm
Image Loading

Vectorization: Low rank matrix factorization

Image Loading
Image Loading

Image Loading

Implementational detail: Mean normalization

Users who have not rated any movies
Image Loading
只有最后一部分会起效果,而对其最小化,θ\theta$就全是0,预测也就全是0。这样就不是太好。可以推荐一些平均值评价高的,使用mean normalization。

Mean Normalization:
Image Loading

Large Scale Machine Learning

Learning with large datasets

Machine learning and data
之前见过这个例子。It’s not who has the best algorithm that wins. It’s who has the most data.

Learning with large datasets
Image Loading
这样一步就要计算m次求和。

Stochastic gradient descent

Linear regression with gradient descent
Image Loading
复习一下。数据很大时,这种gradient descent被叫做Batch gradient descent。
以linear regression为例,也适用于其他algorithms that are based on training gradient descent on a specific training set, like logistic regression, neural networks, etc.

Image Loading
不再求和。随机排序,每次就看一个example。

Stochastic gradient descnet
Image Loading

Mini-batch gradient descent

Mini-batch gradient descent
Batch gradient descent: Use all m examples in each iteration.
Stochastic gradient descent: Use 1 example in each iteration.
Mini-batch gradient descent: Use b examples in each iteration.
Image Loading

Stochastic gradient descent convergence

Checking for convergence
Image Loading
Image Loading
α\alpha小一点,抖动会小一点,曲线更平滑。
如果每5000而不是每1000画一次,曲线更平滑。
如果cost曲线在上升,使用小一点的α\alpha

Stochastic gradient descent
Image Loading

Online learning

Online learning
Image Loading
x是起始位置、我们开的价格等feature。y是1或0表示顾客是否使用我们的shipping service。每次只看一个顾客。

Other online learning example:
Image Loading

Map-reduce and data parallelism

Map-reduce
Image Loading
Image Loading

Map-reduce and summation over the training set
Image Loading

Multi-core machines
Image Loading

Application Example Photo OCR

Problem description and pipeline

The Photo OCR problem
Image Loading
Photo Optical Character Recognition.

Photo OCR pipeline
Image Loading
Image Loading
每一个部分有1-5个engineer。

Sliding windows

Image Loading
因为图片中的文字区域aspect ratio可能不同,行人识别要比其容易一些。

Supervised learning for pedestrian detection
Image Loading

Sliding window detection
Image Loading
定义一个window,然后一点点移动,寻找行人。移动的距离叫做step-size/stride。
Image Loading
逐渐用更大一点的window。这个window在处理时也会缩小为82*36.
Image Loading
直到找到所有行人。

Text detection
Image Loading
Image Loading
白色区域是算法认为的文字处,灰色是疑似处。然后进行expansion。去掉aspect radio不对劲的,比如垂直细长条。

1D Sliding window for character segmentation
Image Loading
图中间是否有一个空白patch。

Getting lots of data: Artificial data synthesis

Character recognition
Image Loading

Artificial data synthesis for photo OCR
Image Loading
Image Loading
想要更多更多training set,找字体库,粘贴到随机背景上。这种方法是无中生有。

Synthesizing data by introducing distortions
Image Loading
将原有字体变形。这种方法是从小样本生产大样本。

或者声音的例子:
Image Loading

Image Loading

Discussion on getting more data
Image Loading

A sanity check with learning curves, that having more data would help. Ask yourself seriously: what it takes to get ten times much creative data as you currently have, and not always, but sometimes, you may be surprised by how easy that turns out to be.

Ceiling analysis: What part of the pipeline to work on next

不要浪费时间,事先估算每阶段工作量,安排好时间。
Estimating the errors due to each component(ceiling analysis)
Image Loading
除了待测定的阶段,手动给每一个之前的阶段正确答案,然后看看总体系统的accuracy如何,这样就大概知道每个阶段大概需要做多少工作。

Another ceiling analysis example
Image Loading
Image Loading

Conclusion

Summary and Thank you

Image Loading


(¦3[▓▓]  好开心!

Image Loading