Natural Language Processing 课程笔记

A Stellar Hiker

2015-09-05

Courses, Natural Language Processing

Coursera上Natural Language Processing课程的笔记。
Michael Collins, Columbia University
PS. 老师的PPT做得好有品位！简单、有效、节制美。声音也好听！

中间停了3个月，重新开始学习。 $\dagger$

授课大纲为：
Topics covered include:

Language modeling.
Hidden Markov models, and tagging problems.
Probabilistic context-free grammars, and the parsing problem.
Statistical approaches to machine translation.
Log-linear models, and their application to NLP problems.
Unsupervised and semi-supervised learning in NLP.

Introduction to Natural Language Processing

What is Natural Language Processing(NLP)?

Image Loading

Key Applications

Machine Translation: e.g., Google Translation from Arabic
Information Extraction:
Text Summarization:
Dialogue Systems

Basic NLP Problems

Tagging

n是none，v是verb，等等。
na不是entity，sc是start company，cc是continued company，sl是start location，cl是continued location。
Parsing

Why is NLP Hard?

[example from L.Lee]
“At last, a computer that understands you like your mother”

Ambiguity

(*) It understands you as well as your mother understands you
It understands (that) you like your mother
It understands you as well as it understands your mother

1 and 3: Does this mean well, or poorly?

Ambiguity at Many Levels
At the acoustic level (speech recognition):

“…a computer that understands you like your mother”
“…a computer that understands you lie cured mother”

At the syntactic level:
Image Loading
Different structures lead to different interpretations.
More Syntactic Ambiguity

At the semantic (meaning) level:
Two definitions of “mother”

a woman who has given birth to a child
a stringy slimy substance consisting of yeast cells and bacteria; is added to cider or wine to produce vinegar

This is an instance of word sense ambiguity

They put money in the bank = buried in mud?
I saw her duck with a telescope

At the discourse (multi-clause) level:

Alice says they’ve built a computer that understands you like your mother
But she … 有两种选择
… doesn’t know any details
… doesn’t understand me at all

This is an instance of anaphora, where she co-referees to some other discourse entity.

What will this course be about?

Course Coverage

NLP sub-problems: part-of-speech tagging, parsing, word-sense disambiguation, etc.
Machine learning techniques: probabilistic context-free grammars, hidden markov models, estimation/smoothing techniques, the EM algorithm, log-linear models, etc.
Applications: information extraction, machine translation, natural language interfaces…

A Syllabus

Language modeling, smoothed estimation
Tagging, hidden Markov models
Statistical parsing
Machine translation
Log-linear models, discriminative methods
Semi-supervised and unsupervised learning for NLP

Prerequisites

Basic linear algebra, probability, algorithms
Programming skills 本课需要Python或Java

Books
Comprehensive notes for the course will be provided at
http://www.cs.columbia.edu/~mcollins

Additional useful background:
Jurafsky and Martin:
Speech and Language Processing (2nd Edition)

The Language Modeling Problem

Introduction to the language modeling problem

Image Loading

Why on earth would we want to do this?

Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.)
比如识别有两种可能，recognize speech和wreck a nice beach。如果有language model，我们就知道哪个的可能性大一些。
The estimation techniques developed for this problem will be VERY useful for other problems in NLP.

A Naive Method
Image Loading
最简单的一个方法，统计每句出现次数，求概率。

Markov Processes

Markov Processes
Trigram models的实现基于Markov Processes。

Consider a sequence of random variables $X_1, X_2, ..., X_n$ . Each random variable can take any value in a finite set $V$ . For now we assume the length $n$ is fixed (e.g., $n = 100$ ).
Our goal: model
$P(X_1 = x_1, X_2 = x_2, ..., X_n = x_n)$

First-Order Markov Processes
Image Loading
第一步是exact的，例如： $P(A,B,C) = P(A) \times P(B|A) * P(C|A,B)$ 。
第二步是markov assumption。

Second-Order Markov Processes
Image Loading

Modeling Variable Length Sequences
Image Loading

Trigram Language Models

很好用又易于实现。
Trigram Language Models
Image Loading

An Example
Image Loading

The Trigram Estimation Problem
Image Loading
它的问题是Sparse Data Problems，parameters太多。

Evaluating language models: perplexity

Perplexity
Image Loading

Some Intuition about Perplexity
Image Loading
Perplexity有点像所有这些词中有效的词有多少。从而可以看出这个model好不好，值越小越好。

Typical Values of Perplexity
Image Loading

Some History: “Syntactic Structures”

Parameter Estimation in Language Models

Linear Interpolation

The Bias-Variance Trade-Off
Image Loading
Linear Interpolation

证明：

How to estimate the $\lambda$ values?
Image Loading

Allowing the $\lambda$ 's to vary
Image Loading
不同情况下用不同 $\lambda$ 。

Discounting Methods

Discounting Methods
Image Loading

Katz Back-Off Models (Bigrams)
Image Loading

Katz Back-Off Models (Trigrams)
Image Loading

Summary

Image Loading

Tagging Problems, and Hidden Markov Models

The Tagging Problem

Part-of-Speech Tagging
Image Loading
主要问题还是ambiguity。

Named Entity Recognition
Image Loading

Named Entity Extraction as Tagging
Image Loading

Our Goal
Image Loading

Two Types of Constraints
Image Loading

Generative models, and the noisy-channel model, for supervised learning

Supervised Learning Problems - Conditional models
Image Loading
本例中x,y大概为：
x(i) = the dog laughs, y(i) = DT NN VB
也叫Discriminative Model。

Generative Models
Image Loading
Note那一行就是Bayes rule。
Generative Model比也叫Discriminative Model更灵活通用。

Decoding with Generative Models
Image Loading
由于p(x)并不受y的变化影响，被约去。

Hidden Markov Model (HMM) taggers

Basic definitions

Hidden Markov Models
Image Loading

Trigram Hidden Markov Models(Trigram HMMs)
Image Loading
q代表Trigram Parameters，e代表Emission Parameters。

An Example
Image Loading

Why the Name?
Image Loading
即 $p(x,y)=p(y) \times p(x|y)$

Parameter estimation

Smoothed Estimation
Image Loading
回顾前面的linear interprelation method。
q包含trigram ml estimate/bigram ml estimate/unigram ml estimate。
这个方式的问题是：e(x|y)=0 for all y if x is never seen in the training data.有一个词没有出现过的话，整个p都是0了。

Dealing with Low-Frequency Words
Example:

Profits soared at Boeing Co. , easily topping forecasts on Wall Street, as their CEO Alan Mulally announced first quarter results.

其中Mulally这个词在training set中可能没有出现过。那么
e(Mulally|y)=0 for all tags y 所以 p(x1, …, xn, y1, …, yn+1) = 0 for all tag sequences y1, …, yn+1。

解决方法：
Image Loading

Example:
Image Loading

这样就有类似e(firstword|NA)或e(initcap|SC)这样的e了。缺点是需要人手动整理分类。

The Viterbi algorithm

The Problem
Image Loading
Brute Force Search is Hopelessly Inefficient

The Viterbi Algorithm Defination
Image Loading
Example:

k是1到n。u是Sk-1。v是Sk。

A Recursive Definition
Image Loading
动态规划。
Justification for the Recursive Definition:

The Viterbi Algorithm
Image Loading

The Viterbi Algorithm with Backpointers
Image Loading
只是动态规划的具体实现，记录tag。

The Viterbi Algorithm: Running Time
Image Loading

Summary

Pros and Cons
Image Loading

Parsing, and Context-Free Grammars

An Introduction to the Parsing Problem

Parsing(Syntactic Structure)
Image Loading

Syntactic Formalisms

Work in formal syntax goes back to Chomsky’s PhD thesis in the 1950s.
Book “Syntactic Structures”
Examples of current formalisms: minimalism, lexical functional grammer(LFG), head-driven phrase-structure grammar(HPSG), tree adjoining grammars(TAG), categorial grammars.
今天学的context-free grammar是这些modern grammar的基础。

Data for Parsing Experiments
Image Loading
是supervised learning，Training data是人手动标的，Penn WSJ Treebank。

The Information Conveyed by Parse Trees
1.Part of speech for each word
Image Loading
2.Phrases

3.Useful Relationships

An Example Application: Machine Translation
Image Loading

有Syntactic Structure后任务容易了很多。

Context Free Grammars

Context-Free Grammars Definition
Image Loading

A Context-Free Grammar for English
Image Loading

Left-Most Derivations
Image Loading
An Example:

Properties of CFGs
Image Loading
CFG - Context Free Grammars

An Example of Ambiguity
in the car修饰drove。
in the car修饰the street。意义不同了。

The Problem with Parsing: Ambiguity
Image Loading
有14种解释。

A Brief(!) Sketch of the Syntax of English

A Brief Overview of English Syntax
Image Loading

A Fragment of a Noun Phrase Grammar
Image Loading
可以试着组合出很多种tree。

Prepositions, and Prepositional Phrases
Image Loading

An Extended Grammar
Image Loading

Verbs, Verb Phrases, and Sentences
Image Loading

PPs Modifying Verb Phrases
Image Loading

Complementizers, and SBARs
Image Loading

More Verbs
Image Loading

Coordination
Image Loading

We’ve Only Scratched the Surface…
Image Loading

Examples of Ambiguous Structures

Sources of Ambiguity - Part-of-Speech ambiguity
Image Loading

Sources of Ambiguity - Prepositional Phrase Attachment
Image Loading

前文有提到。
另一个例子：Two analyses for: John was believed to have been shot by Bill。by Bill修饰shot或believed。
人类会更倾向于让prepositions去修饰the most recent verb in a sentence.

Sources of Ambiguity - Noun Premodifiers
Image Loading
左边那个fast相当于修饰car mechanic了。右边是fast car修饰mechanic。

Probabilistic Context-Free Grammars (PCFGs)

为每个rule添加概率，这样生成不同parse tree的几率也不一样了。
又旧又简单效果也不好，但是别人的基础，稍加改进之后表现就很好了。

Probabilistic Context-Free Grammars (PCFGs)

A Probabilistic Context-Free Grammar (PCFG)
Image Loading
An Example:

Properties of PCFGs
Image Loading

Deriving a PCFG from a Treebank
Treebank是什么：
Image Loading
用Treebank计算PCFG

PCFGs
Booth and Thompson (1973) showed that a CFG with rule probabilities correctly defines a distribution over the set of derivations provided that:

The rule probabilities define conditional distributions over the different ways of rewriting each non-terminal.
A technical condition on the rule probabilities ensuring that the probability of the derivation terminating in a finite number of steps is 1. (This condition is not really a practical concern.)

The CKY Algorithm for parsing with PCFGs

** Parsing with a PCFG**
Image Loading
Brute Force还是完全不可能的。

Chomsky Normal Form
Image Loading
R那行的意思就是要么分成两个非终端节点，要么就是终端节点。如果多于两个，就定义变量分成几条Rule，每条都是两个。
有了这个标准形式就可以用动态规划了。

A Dynamic Programming Algorithm
Image Loading
词i到词j之间形成X形式的可能组成方式中概率最大的那个。

s是split point。

The Full Dynamic Programming Algorithm
Image Loading

A Dynamic Programming Algoritm for the Sum
Image Loading

Summary

PCFGs augments CFGs by including a probability for each rule in the grammar.
The probability for a parse tree is the product of probabilities for the rules in the tree.
To build a PCFG-parsed parser:
1.Learn a PCFG from a treebank.
2.Given a test data sentence, use the CKY algorithm to compute the highest probability tree for the sentence under the PCFG.

Weeknesses of PCFGs

PCFGs = Probabilistic Context-Free Grammars
只有大约72%的准确率。而现代parser大约有92%的准确率。

Lack of sensitivity to lexical information
Lack of sensitivity to structural frequencies

Lack of Sensitivity to Lexical Information

每个假设都是独立的。且和word完全无关。

PP Attachment Ambiguity
Image Loading

本例中粗体那条一条就决定了整个树的成败。

Coordination Ambiguity
Image Loading

本例中两棵树的概率完全相等，不分伯仲，因为只是顺序不同而已。

Lack of Sensitivity to Structural Frequencies

Structural Preferences: Close Attachment
Image Loading
in Africa可以修饰president或company。就近原则。

Previous example: John was believed to have been shot by Bill
Here the low attachment analysis (Bill does the shooting) contains same rules as the high attachment analysis (Bill does the believing), so the two analyses receive same probability.

Lexicalized PCFGs

Lexicalization of a treebank

Heads in Context-Free Rules
Image Loading

Rules which Recover Heads
原始的Treebank里没有Head信息，要用这些Rule来Recover。
An Example for NPs:
Image Loading
An Example for VPs:

Adding Headwords to Trees
Image Loading

Lexicalized probabilistic context-free grammars

Chomsky Normal Form
前文有讲过。
Image Loading

Lexicalized Context-Free Grammars in Chomsky Normal Form
Image Loading
增加了标明head从哪儿来。
Example:

Parameters in a Lexicalized PCFG
Image Loading
参数变多。

Parsing with Lexicalized CFGs
Image Loading

Parameter estimation in lexicalized probabilistic context-free grammars

An Example:
Image Loading

A Model from Charniak (1997)
Image Loading

ML = maximum likelihood
qML = count(A) / count(B)
一组 $\lambda$ 之和为1。
逐步一般化以平滑。

Other Important Details
Image Loading
binalize。

Accuracy of lexicalized probabilistic context-free grammars

Method 1: Constituents/Precision and Recall
Evaluation: Representing Trees as Constituents
Image Loading
Precision and Recall

Results

Method 2: Dependency Accuracies
Example:
Image Loading
三列分别是：head word(h)/modifier word(w)/rule。
Dependency Accuracies:

Strengths and Weaknesses of Modern Parsers:

Summary

Key weakness of PCFGs: lack of sensitivity to lexical information.
Lexicalized PCFGs:
Lexicalize a treebank using head rules.
Estimate the parameters of a lexicalized PCFG using smoothed estimation.
Accuracy of lexicalized PCFGs: around 88% in recovering constituents or dependencies.

Introduction to Machine Translation(MT)

Image Loading

Challenges in machine translation

Challenges: Lexical Ambiguity
Image Loading
从英语翻译到西班牙语，英语同词不同义在翻译时的不同选择。

Challenges: Differing Word Orders
Image Loading
老例子，英语与日语的语法中词的顺序不同。

Syntactics Structure is not Preserved Across Translations
Example from Dorr et. al, 1999
Image Loading
float这个动词在下面不再那么动词了。此例中floated对应floating，into对应entered。

Syntactic Ambiguity Causes Problems
Example from Dorr et. al, 1999
Image Loading
是John用棍子打狗，还是狗自己有棍子。with the stick是修饰hit还是dog。

Pronoun Resolution
Example from Dorr et. al, 1999
Image Loading
it可以是The computer，也可以是the data。

Classical machine translation

都是Rule-based。
Direct Machine Translation

Translation is word-by-word.
Very little analysis of the source text(e.g., no syntactic or semantic analysis).
Relies on a large bilingual directionary. For each word in the source language, the dictionary specifies a set of rules for translating that word.
After the words are translated, simple reordering rules are applied (e.g., move adjectives after nouns when translating from English to French).

An Example:
Image Loading

Some Problems with Direct Machine Translation:
Image Loading

Transfer-Based Approaches
Three phases in translation:

Analysis: Analyze the source language sentence; for example, build a syntactic analysis of the source language sentence.
Transfer: Convert the source-language parse tree to a target-language parse tree.
Generation: Convert the target-language parse tree to an output sentence.

The “parse trees” involved can vary from shallow analyses to much deeper analyses (even semantic representations).
The transfer rules might look quite similar to the rules for direct translation systems. But they can now operate on syntactic structures.
It’s easier with these approaches to handle long-distance reorderings.
The Systran systems are a classic example of this approach.

Image Loading

Interlingua-Based Translation
Two phases in translation:

Analysis: Analyze the source language sentence into a (language-independent) representataion of its meaning.
Generation: Convert the meaning representation into an output sentence.

One Advantage: If we want to build a translation system that translates between $n$ languages, we need to develop $n$ analysis and generation systems. With a transfer based system, we’d need to develop $O(n^2)$ sets of translation rules.
Disadvantage: What would a language-independent representation look like?

How to represent different concepts in a interlingua?
Different languages break down concepts in quite different ways:
German has two words for wall: one for an internal wall, one for a wall that is outside.
Japanese has two words for brother: one for an elder brother, one for a younger brother.
Spanish has two words for leg: pierna for a human’s leg, pata for an animal’s leg, or the leg of a table.
An interlingua might end up simple being an intersection of these different ways of breaking down concepts, but that doesn’t seem very satisfactory…

A Pyramid Represnts These Three
Image Loading

A brief introduction to statistical MT

Parallel corpora are available in several language pairs.
Basic idea: use a parallel corpus as a training set of translation examples.
Classic example: IBM work on French-English translation, using the Canadian Hansards. (1.7 million sentences of 30 words or less in length). (Early 1990s)
Google Translation用的就是这个。
Idea goes back to Warren Weaver (1949): suggested applying statistical and cryptanalytic techniques to translation.
Warren Weaver, 1949, in a letter to Norbert Wiener:
"… one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”

The Noisy Channel Model
Image Loading
e是一个英文句子。f是一个法文句子。
p(e): a probability to any sentence in English.
p(f|e): the conditional probability of any French sentence given an English sentence.

Example from Koehn and Knight tutorial:

The IBM Translation Models

First Generation Translation Systems.
IBM Model做到了1-5，2就已经很不错了。

IBM Model 1

IBM Model 1: Alignments - Introduction

How do we model $p(f|e)$ ?
English sentence $e$ has $l$ words $e_1 \cdots e_l$ ，French sentence $f$ has $m$ words $f_1 \cdots f_m$ .
An alignment $a$ identifies which English word each French word originated from.
Formally, an alignment $a$ is $\{ a_1, \cdots a_m \}$ , where each $a_j \in \{ 0 \cdots l \}$ .
There are $(l+1)^m$ possible alignments.
e.g., $l=6$ , $m=7$
$e$ = And the program has been implemented
$f$ = Le programme a ete mis en application
One alignment is {2,3,4,5,6,6,6}
Another (bad!) alignment is {1,1,1,1,1,1,1}

Alignments in the IBM Models
Image Loading
例如 p(le chien aboie,{1,2,3} | the dog barks, 3)

A By-Product: Most Likely Alignments
Image Loading
IBM Model现在不做翻译，反而主要来找最可能的Alignment了。

An Example Alignment
Image Loading

IBM Model 1: Alignments
Step 1:
Image Loading

IBM Model 1: Translation Probabilities
Step 2:
Image Loading
例如：
e = NULL the dog barks 于是l=3
a = {1, 2, 3}
f = le chian aboie 于是m=3
p(f|a,e,m) = t(le|the) x t(chien|dog) x t(aboie|barks)

又一个例子：
Image Loading

IBM Model 1: The Generative Process
Generation:
Image Loading

An Example Lexical Entry
Image Loading

IBM Model 2

** IBM Model 2 - Introduction**
Image Loading
只是a部分的模型不同。

An Example
Image Loading

IBM Model 2: The Generative Process
Image Loading

Recovering Alignments
Image Loading

EM Training of Models 1 and 2

Parameter estimation in models 1 and 2.
The Parameter Estimation Problem
Image Loading

Parameter Estimation if the Alignments are Observed
Image Loading
在已有a关系的情况下，只需要计算count相除就可以。

Parameter Estimation with the EM Algorithm
Image Loading
迭代改进。

Justification for the Algorithm
Image Loading

Summary

Image Loading

Phrase-based Translation Models

Second Generation Translation Systems.
Late 1990s.
Google Translate.

Learning Phrases From Alignments

Phrase-Based Models
Image Loading

An Example (from tutorial by Koehn and Knight):
Calculate PB Lexicon Pairs using IBM Model 2.
Image Loading

Representation as Alignment Matrix
Image Loading

Finding Alignment Matrices
问题：

Alignments are often “noisy”.
“Many to One” instead of “Many to Many”.

解决方法：

Step 1: train IBM model 2 for p(f|e), and come up with most likely alignment for each (e,f) pair.
Step 2: train IBM model 2 for p(e|f) and come up with most likely alignment for each (e,f) pair.
We now have two alignments:
take intersection of the two alignments as a starting point.

Image Loading

Heuristics for Growing Alignments

Only explore alignment in union of p(f|e) and p(e|f) alignments.
Add one alignment point at a time.
Only add alignment points which align a word that currently has no alignment.
At first, restrict ourselves to alignment points that are “neighbors” (adjacent or diagonal) of current alignment points.
Later, consider other alignment points.

Image Loading

Extracting Phrase Pairs from the Alignment Matrix
Image Loading

Probabilities for Phrase Pairs
Image Loading

An Example Phrase Translation Table
Image Loading

A Phrase-Based Model

A Sketch
Image Loading

6是因为跳过了6个词，那是一个负值，为了不让两种语言句子结构跳跃太大的penalty。

Decoding of Phrase-based Translation Models

Definition of the Decoding Problem

Phrase-based Translation
Image Loading

Phrase-based Models: Definitions
Image Loading

Phrased-based Translation: Definitions
Image Loading
s: start point.
t: end point.
e: sequence of 1 or more english words.

Definitions
Image Loading

Valid Derivations
Image Loading
第三条：distortion limit: a hard constraint on how far phrases can move.
两个(s,t,e)组之间的距离不能离太远。前一组的尾与后一组的头之间。
这个方法很僵化，已被淘汰。
y(x)数量会随着句子长度指数增长。

Scoring Derivations
Image Loading
An Example:

We must also take this critic seriously.

Langauge Model. 这个英语句子有多可能
log q(we|**)+log(must|*,we)+log(also|we,must)+log(the|must,also)+…
Phrase Scores. 这个英语句子与德语有多匹配
g(1,3,we must also)+g(7,7,take)+g(4,5,the critic)+…
Distortion Scores. 对语法结构差太多的惩罚
|1-1|x{zeta}+|3+1-7|x{zeta}+|7+1-4|x{zeta}+|5+1-6|x{zeta}

The Decoding Algorithm

上面的问题NP hard，需要用到heuristic的算法。

Decoding Algorithm: Definitions
Image Loading
例如(we, must, 1100000, 2, -2.5)
中间的int有点像占位，哪些词被翻译了。

States, and the Search Space
Image Loading
可以表现为这样的树。

Transitions
Image Loading
ph(q)就是每个state的可行选择。

The next function
We define next(q,p) to be the state formed by combining state q with phrase p.
Image Loading

The Equality Function
Image Loading
判断两个state是否相同。

The Decoding Algorithm
Image Loading

Definition of Add(Q, q’, q, p)
Image Loading
如果走到了同一个state，谁分数高留谁。

Definition of beam(Q)
Image Loading
图中不应该有arg，错误。
比最高分低太多的直接删了。

Log-linear Models

Two Example Problems

The Language Modeling Problem
Image Loading

Trigram Models:
Image Loading

利用的信息太少了。

A Naive Approach:
Image Loading
一个Smooth Model，加权考虑每个信息。很快就大到无法控制。

A Second Example: Part-of_Speech Tagging
Image Loading
那么问题来了，??该是什么？

同样的，有很多没有利用到的信息。

Features in Log-Linear Models

For Problem 1

The General Problem

We have some input domain X
Have a finite label set Y
Aim is to provide a conditional probability $p(y|x)$ for any x,y where $x \in X, y \in Y$

Language Modeling

x is a “history” w1, w2, …, $w_{i-1}$ e.g.,
Third, the notion “grammatical in English” cannot be identified in any way with the notion “hign order of statistical approximation to English”. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical
y is an “outcome” $w_i$

Feature Vector Representations
Image Loading

Language Modeling
Image Loading

f1相当于unigram feature，f2相当于bigram feature，f3相当于trigram feature，f4相当于skip bigram feature。

Defining Features in Practice
Image Loading
Do not include trigrams not seen in training data.
Lead to:
Too many features;
Many not worth including in the model.

For Problem 2

The POS-Tagging Example
Image Loading
i: the position being tagged.

The Full Set of Features in Ratnaparkhi, 1996
Image Loading

The Final Result
Image Loading
绿色是x，红色是y。
Often sparse: relatively few 1’s than 0’s.

Definition of Log-linear Models

Parameter Vectors
前文介绍了feature，这里再引入parameter vector的概念。
Image Loading
两个vector点乘。

Language Modeling Example
Image Loading
p(model|x;v) = e^5.6 / (e^5.6 + e^1.5 + e^4.5 + e^-3.2 + …)
p(the|x;v) = e^-3.2 / (e^5.6 + e^1.5 + e^4.5 + e^-3.2 + …)

Log-Linear Models
总结定义：
Image Loading
经过这样处理后就满足条件p都大于0，且和为1了。

Why the name?
Image Loading

Parameter Estimation in Log-linear Models

求v。
Maximum-Likelihood Estimation
Image Loading
选让整个training example总值最大的。naive。
L(v)是一个倒吊钟形的函数，concave。比较容易优化，可以用gradient ascent。

Calculating the Maximum-Likelihood Estimates
Image Loading

Gradient Ascent Methods
Image Loading

Conjugate Gradient Methods
Image Loading
一般情况下工具包已经实现此算法。

Smoothing/Regularization in Log-linear Models

Smoothing in Log-Linear Models
Image Loading

Regularization
Image Loading

Experiments with Regularization
Image Loading

Experiments with Gaussian Priors
Image Loading

Log-linear Models for Tagging(MEMMs)

Applications for log-linear models.
MEMMs : Maximum-entropy(熵) Markov Models

Recap: The Tagging Problem

有以下几种：
Part-of-Speech Tagging
Image Loading

Named Entity Recognition
Image Loading

Named Entity Extraction as Tagging
Image Loading

Our Goal
Image Loading
已经讲过的方法有：
Hidden Markov Models 和 Viterbi Algorithm。

Log-Linear Taggers

Independence Assumptions in Log-linear Taggers

Log-Linear Models for Tagging
Image Loading

Features in Log-Linear Taggers

Representation: Histories
Image Loading

Recap: Feature Vector Representations in Log-Linear Models
Image Loading

An Example
Image Loading

f2对于HMM来说很难。

The Full Set of Features in [(Ratnaparkhi, 96)]
Image Loading

Parameters in Log-linear Models

Recap: Log-Linear Models
Image Loading

Training the Log-Linear Model
Image Loading

The Viterbi Algorithm for Log-linear Taggers

The Viterbi Algorithm
Image Loading

A Recursive Definition
Image Loading
与HMM的区别只在q那里。

The Viterbi Algorithm with Backpointers
Image Loading

An Example Application

FAQ Segmentation: McCallum et. al

McCallum et. al compared HMM and log-linear taggers on a FAQ Segmentation task
Main point: in an HMM, modeling $p(word|tag)$ is difficult in this domain

从文本中自动tag哪些行是head、question、answer。

<head>X-NNTP-POSTER: NewsHound v1.33
<head>
<head>Archive name: acorn/faq/part2
<head>Frequency: monthly
<head>
<question>2.6) What configuration of serial cable should I use
<answer>
<answer> Here follows a diagram of the necessary connections
<answer>programs to work properly. They are as far as I know t
<answer>agreed upon by commercial comms software developers fo
<answer>
<answer> Pins 1, 4, and 8 must be connected together inside
<answer>is to avoid the well known serial port chip bugs. The

FAQ Segmentation: Line Features

begins-with-number
begins-with-ordinal
begins-with-punctuation
begins-with-question-word
begins-with-subject
blank
contains-alphanum
contains-bracketed-number
contains-http
contains-non-space
contains-number
contains-pipe
contains-question-mark
ends-with-question-mark
first-alpha-is-capitalized
indented-1-to-4

FAQ Segmentation: The Log-Linear Tagger
得到feature。

“tag=question;prev=head;begins-with-number”
“tag=question;prev=head;contains-alphanum”
“tag=question;prev=head;contains-nonspace”
“tag=question;prev=head;contains-number”
“tag=question;prev=head;prev-is-blank”

FAQ Segmentation: An HMM Tagger
对于：

1	<question>2.6) What configuration of serial cable should I use

得到

First solution
把一句话分成很多词。

i.e. have a language model for each tag
Second solution

FAQ Segmentation: Results

Method	Precision	Recal
ME-Stateless(Maximum Entropy Stateless)	0.038	0.362
TokenHMM	0.276	0.140
FeatureHMM	0.413	0.529
MEMM	0.867	0.681

Precision and recall results are for recovering segments
ME-stateless is a log-linear model that treats every sentence seperately (no dependence between adjacent tags)
TokenHMM is an HMM with first solution we’ve just seen
FeatureHMM is an HMM with second solution we’ve just seen
MEMM is a log-linear trigram tagger (MEMM stands for “Maximum-Entropy Markov Model”)

Summary

Image Loading

Log-Linear Models for History-based Parsing

这个方法的应用范围更广。

Conditional History-based Models

Recap: Log-Linear Taggers
Image Loading

A General Approach: (Conditional) History-Based Models

We’ve shown how to define p(t[1:n]|w[1:n]) where t[1:n] is a tag sequence.
How do we define p(T|S) if T is a parse tree (or another structure)? (We use the notation S = w[1:n])

Image Loading
并不使用Markov Assumption，所以不能用动态规划。

Representing Trees as Decision Sequences(关于Step 1)

An Example Tree

Ratnaparkhi’s Parser: Three Layers of Structure

Part-of-speech tags
Chunks
Remaining structure

Layer 1: Part-of-Speech Tags
Image Loading

Layer 2: Chunks
Image Loading
QP: numeric phrases.

Layer 3: Remaining Structure
Image Loading
前两步之后到这里：

选择最左边的：

因为句子并没有结束，check为no：

继续，省略数步：

当前的constituent已结束，所以Check为yes：

然后再选择最左边的没有Join或Start的项：

Check又选yes：

继续选择最左边：
Image Loading
Check为yes：

The Final Sequence of decisions
Image Loading

Features(关于Step 3)

Applying a Log-Linear Model
Image Loading

Layer 3: Join of Start

Looks at head word, constituent (or POS) label, and start/join annotation of n’th tree relative to the decision, where n = −2, −1
Looks at head word, constituent (or POS) label of n’th tree relative to the decision, where n = 0, 1, 2
Looks at bigram features of the above for (-1,0) and (0,1)
Looks at trigram features of the above for (-2,-1,0), (-1,0,1) and (0, 1, 2)
The above features with all combinations of head words excluded
Various punctuation features

Layer 3: Check=NO or Check=YES

A variety of questions concerning the proposed constituent

Beam Search(关于Step 4)

The Search Problem
Image Loading
同理，不能用动态规划了。
得到很多可能的parse tree，每个有自己的概率。

Unsupervised Learning: Brown Clustering

本课讲的首个Unsupervised learning algorithm。

Word Cluster Representations

The Brown Clustering Algorithm

Input: a (large) corpus of words
Output 1: a partition of words into word clusters
Output 2 (generalization of 1): a hierarchichal word clustering

Example Clusters (from Brown et al, 1992)
论文里给出的cluster结果，效果很好啊！

Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays
June March July April January December October November September August
people guys folks fellows CEOs chaps doubters commies unfortunates blokes
down backwards ashore sideways southward northward overboard aloft downwards adrift
water gas coal liquid acid sand carbon steam shale iron
great big vast sudden mere sheer gigantic lifelong scant colossal
man woman boy girl lawyer doctor guy farmer teacher citizen
American Indian European Japanese German African Catholic Israeli Italian Arab
pressure temperature permeability density porosity stress velocity viscosity gravity tension
mother wife father son husband brother daughter sister boss uncle
machine device controller processor CPU printer spindle subsystem compiler plotter
John George James Bob Robert Paul William Jim David Mike
anyone someone anybody somebody
feet miles pounds degrees inches barrels tons acres meters bytes
director chief professor commissioner commander treasurer founder superintendent dean custodian

A Sample Hierarchy (from Miller et al., NAACL 2004)
论文给出的Hierarchy结果。
哈夫曼树，左1右0。可以表示Hierarchy。

lawyer 1000001101000
newspaperman 100000110100100
stewardess 100000110100101
toxicologist 10000011010011
slang 1000001101010
babysitter 100000110101100
conspirator 1000001101011010
womanizer 1000001101011011
mailman 10000011010111
salesman 100000110110000
bookkeeper 1000001101100010
troubleshooter 10000011011000110
bouncer 10000011011000111
technician 1000001101100100
janitor 1000001101100101
saleswoman 1000001101100110
...
Nike 1011011100100101011100
Maytag 10110111001001010111010
Generali 10110111001001010111011
Gap 1011011100100101011110
Harley-Davidson 10110111001001010111110
Enfield 101101110010010101111110
genus 101101110010010101111111
Microsoft 10110111001001011000
Ventritex 101101110010010110010
Tractebel 1011011100100101100110
Synopsys 1011011100100101100111
WordPerfect 1011011100100101101000
....
John 101110010000000000
Consuelo 101110010000000001
Jeffrey 101110010000000010
Kenneth 10111001000000001100
Phillip 101110010000000011010
WILLIAM 101110010000000011011
Timothy 10111001000000001110
Terrence 101110010000000011110
Jerald 101110010000000011111
Harold 101110010000000100
Frederic 101110010000000101
Wendell 10111001000000011

The Brown Clustering Algorithm

The Intuition

Similar words appear in similar contexts
More precisely: siilar words have similar distributions of words to their imediate left and right

The Formulation
Image Loading
An Example

The Brown Clustering Model
Image Loading

Measuring the Quality of C
Image Loading
例如：

1	2	3	1	2
the	dog	saw	the	cat

n(1,2) = 2
n(2) = 2

A First Algorithm
Image Loading

A Second Algorithm
Image Loading

Clusters in NE Recognition

Miller et al, NAACL 2004
"Name Tagging with Word Clusters and Discriminative Training"
Scott Miller, Jethran Guinness, Alex Zamanian

At a recent meeting, we presented name-tagging technology to a potential user. The technology had performed well in formal evaluations, had been applied successfully by several research groups, and required only annotated training examples to configure for new name classes. Nevertheless, it did not meet the user’s needs.

To achieve reasonable performance, the HMM-based technology we presented required roughly 150,000 words of annotated examples, and over a million words to achieve peak accuracy. Given a typical annotation rate of 5,000 words per hour, we estimated that setting up a name finder for a new problem would take four person days of annotation work – a period we considered reasonable. However, this user’s problems were too dynamic for that much setup time. To be useful, the system would have to be trainable in minutes or hours, not days or weeks.

可行的feature种类：

1. Tag + PrevTag
2. Tag + CurWord
3. Tag + CapAndNumFeatureOfCurWord
4. ReducedTag + CurWord
//collapse start and continue tags
5. Tag + PrevWord
6. Tag + NextWord
7. Tag + DownCaseCurWord
8. Tag + Pref8ofCurrWord	// 当前词的feature们的某8bit
9. Tag + Pref12ofCurrWord
10. Tag + Pref16ofCurrWord
11. Tag + Pref20ofCurrWord
12. Tag + Pref8ofPrevWord
13. Tag + Pref12ofPrevWord
14. Tag + Pref16ofPrevWord
15. Tag + Pref20ofPrevWord
16. Tag + Pref8ofNextWord
17. Tag + Pref12ofNextWord
18. Tag + Pref16ofNextWord
19. Tag + Pref20ofNextWord

结果评估：
Image Loading

Active Learning -> Choose to label the most “useful” examples at each stage.

Global Linear Models(GLMs)

Techiniques

So far:
Smoothed estimation
Probabilistic context-free grammars
Log-linear models
Hidden markov models
The EM Algorithm
History-based models
Today
Global linear models

A Brief Review of History-based Methods

Supervised Learning in Natural Language
Image Loading

The Models so far
Image Loading
第一个比如HMMs和PCFGs，第二个如Log-Linear Tagging。

Example 1: PCFGs
Image Loading
S
S->NP VP
q(S->NP VP) = p(S->NP VP|S)

Example 2: Log-linear Taggers
Image Loading

A New Set of Techniques: Global Linear Models

A New Framework: Global Linear Models

Intuition

We’ll move away from history-based models
No idea of a “derivation”, or attaching probabilities to “decisions”
Instead, we’ll have feature vectors over entire structures “Global features”
“Global features”
First piece of motivation:
Freedom in defining features

A Need for Flexible Features
Image Loading

Three Components of Global Linear Models

Image Loading
例如，x是sentence，y是parse tree。candidates是一堆parse tree。

Component 1: $f$
Image Loading
Features:

Feature Vectors:

feature的设计很重要，将来要送进learning algorithm的。

Component 2: $GEN$
Image Loading

$GEN$ enumerates a set of candidates for an input x
Some examples of how $GEN(x)$ can be defined:
Parsing: $GEN(x)$ is the set of parses for $x$ under a grammar
Any task: $GEN(x)$ is the top N most probable parses under a history-based model
Tagging: $GEN(x)$ is the set of all possible tag sequences with the same length as $x$
Translation: $GEN(x)$ is the set of all possible English translations for the French sentence $x$

Component 3: $v$
Image Loading

Putting is all Together
Image Loading

Parsing Problems in This Framework: Reranking Problems

Reranking Approaches to Parsing
Image Loading
baseline parser比如lexicalized PCFG。

The Representation $f$
Image Loading
Some Features in the Paper:
From [Colins and Koo, 2005]:
The following types of features were included in the model. We will use the rule VP -> PP VBD NP NP SBAR with head VBD as an example. Note that the output of out baseline parser produces syntactic trees with headword annotations.

Rules These include all context-free rules in the tree, for example VP -> PP VBD NP NP SBAR
Image Loading
Bigrams These are adjacent pairs of non-terminals to the left and right of the head. As shown, the example rule would contribute the bigrams (Right, VP, NP, NP), (Right, VP, NP, SBAR), (Right, VP, SBAR, STOP), and (Left, VP, PP, STOP) to the left of the head.

Grandparent Rules Same as Rules, but also including the non-terminal above the rule.
Image Loading
Two-level Rules Same as Rules, but also including the entire rule above the rule.

Parameter Estimation Method 1: A Variant of the Perceptron Algorithm

A Variant of the Perceptron Algorithm
Image Loading
T: number of iterations, small 5, 10, 20
Any features which are seen in the true candidate structure have their parameter value increased, whereas any features seen in the incorrectly proposed structure have their parameter values decreased. Kind of Reinforcing Step.

Perceptron Experiments: Parse Reranking
Parsing the Wall Street Journal Treebank
Training set = 40,000 sentences, test = 2,416 sentences
Generative model(Collins 1999): 88.2% F-measure
Reranked model: 89.5% F-measure(11% relative error reduction)

Results from Charniak and Johnson, 2005
Improvement from 89.7% (baseline generative model) to 91.0% accuracy
Gains from improved n-best lists, better features, better baseline model

Summary

A new framework: global linear models
GEN, f, v
There are several ways to train the parameters v:
Perceptron
Boosting
Log-linear models (maximum-likelihood)
Applications:
Parsing
Generation
Machine translation
Tagging problems
Speech recognition

GLMs for Tagging

Global Linear Models Part 2: The Perceptron Algorithm for Tagging
针对Tagging，我们之前已经学过HMMs和Log-Linear Taggers，现在学习第三种GLMs and Perceptron。
Tagging using Global Linear Models
Image Loading
例如：
n = 3
w = the dog barks
T = {D, N ,V}
GEN(the dog barks) = {DDD, DDN, DND, DNN, …} 共3^3 = 27个
f(the dog barks, D N V) = (1, 0, 0, 1)

Representation: Histories
Image Loading

Local Feature-Vector Representations
Image Loading
A tagged sentence with n words has n history/tag pairs.

global feature vector = the sum of local feature vectors

Global and Local Features
Image Loading

Putting it all Together
Image Loading

Recap: A Variant of the Perceptron Algorithm
Image Loading

Training a Tagger Using the Perceptron Algorithm
Image Loading

An Example
Image Loading

Experiments

Wall Street Journal part-of-speech tagging data
Perceptron = 2.89% error, Log-linear tagger = 3.28% error
[Ramshaw and Marcus, 1995] NP chunking data
Perceptron = 93.63% accuracy, Log-linear tagger = 93.29% accuracy

GLMs for Dependency Parsing

Global Linear Models Part 3: The Perceptron Algorithm for Dependency Parsing

Dependency Parsing

Unlabeled Dependency Parses
Image Loading
labeled就是标上每个arc代表的关系。

**All Dependency Parses for John saw Mary **
Image Loading
正确的是左列第二个。

Conditions on Dependency Structures
Image Loading

The dependency arcs from a directed tree, with the root symbol at the root of the tree.
(Definition: A directed tree rooted at root is a tree, where for every word w other than the root, there is a directed path from root to w.)
There are no “crossing dependencies”.
Dependency structures with no crossing dependencies are sometimes referred to as projective structures.

没有环，也没有交叉。
有交叉的叫做Non-projective Structures，在本课中认为是不允许的。
Image Loading

Dependency Parsing Resources

CoNLL 2006 conference had a “shared task” with dependency parsing of 12 languages (Arabic, Chinese, Czech, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish, Turkish). 19 different groups developed dependency parsing systems. (See also CoNLL 2007).
PhD thesis on the topic: Ryan McDonald, Discriminative Training and Spanning Tree Algorithms for Dependency Parsing, University of Pennsylvania.
For some languages, e.g., Czech, there are “dependency banks” available which contain training data in the form of sentences paired with dependency structures.
For other languages, we can extract dependency structures from treebanks.

图上右下角的S应该是S(was,Vt)
head word一层层往下找，这个过程就是之前讲过的lexicalization。

Efficiency of Dependency Parsing

PCFG parsing is $O(n^3G^3)$ where n is the length of the sentence, G is the number of non-terminals in the grammar.
Lexicalized PCFG parsing is $O(n^5G^3)$ where n is the length of the sentence, G is the number of non-terminals in the grammar.
Unlabeled dependency parsing is $O(n^3)$ .因为用了dynamic programming，看Jason Eisner的paper。

very efficient at parsing, very useful representations.

GLMs for Dependency Parsing

GLMs for Dependency parsing
Image Loading
size of GEN is exponentional in n.

x是句子，y是一种Dependency结构。

Definition of Local Feature Vectors
Image Loading

例子：针对前文的John saw a movie，可以有如下feature。
g1(x,h,m) = 1 if xh=saw, and xm=movie; 0 otherwise
g2(x,h,m) = 1 if POS(h)=VBD, and POS(m)=NN; 0 otherwise
g3(x,h,m) = 1 if |h-m|>10; 0 otherwise

Results from McDonald(2005)

Image Loading
G = Num of non-terminals

Extensions

2nd-order dependency parsing, 3rd-order dependency parsing
不再只孤立看一个dependency arc。能得到93-94% accuracy。
Non-projective dependency structures
spanning tree algorithms

Summary

Image Loading

(｢･ω･)｢
春假前一天顺利完成！