1. 1. Introduction to Natural Language Processing
    1. 1.1. What is Natural Language Processing(NLP)?
    2. 1.2. Why is NLP Hard?
    3. 1.3. What will this course be about?
  2. 2. The Language Modeling Problem
    1. 2.1. Introduction to the language modeling problem
    2. 2.2. Markov Processes
    3. 2.3. Trigram Language Models
    4. 2.4. Evaluating language models: perplexity
    5. 2.5. Parameter Estimation in Language Models
      1. 2.5.1. Linear Interpolation
      2. 2.5.2. Discounting Methods
    6. 2.6. Summary
  3. 3. Tagging Problems, and Hidden Markov Models
    1. 3.1. The Tagging Problem
    2. 3.2. Generative models, and the noisy-channel model, for supervised learning
    3. 3.3. Hidden Markov Model (HMM) taggers
      1. 3.3.1. Basic definitions
      2. 3.3.2. Parameter estimation
      3. 3.3.3. The Viterbi algorithm
    4. 3.4. Summary
  4. 4. Parsing, and Context-Free Grammars
    1. 4.1. An Introduction to the Parsing Problem
    2. 4.2. Context Free Grammars
    3. 4.3. A Brief(!) Sketch of the Syntax of English
    4. 4.4. Examples of Ambiguous Structures
  5. 5. Probabilistic Context-Free Grammars (PCFGs)
    1. 5.1. Probabilistic Context-Free Grammars (PCFGs)
    2. 5.2. The CKY Algorithm for parsing with PCFGs
    3. 5.3. Summary
  6. 6. Weeknesses of PCFGs
    1. 6.1. Lack of Sensitivity to Lexical Information
    2. 6.2. Lack of Sensitivity to Structural Frequencies
  7. 7. Lexicalized PCFGs
    1. 7.1. Lexicalization of a treebank
    2. 7.2. Lexicalized probabilistic context-free grammars
    3. 7.3. Parameter estimation in lexicalized probabilistic context-free grammars
    4. 7.4. Accuracy of lexicalized probabilistic context-free grammars
    5. 7.5. Summary
  8. 8. Introduction to Machine Translation(MT)
    1. 8.1. Challenges in machine translation
    2. 8.2. Classical machine translation
    3. 8.3. A brief introduction to statistical MT
  9. 9. The IBM Translation Models
    1. 9.1. IBM Model 1
    2. 9.2. IBM Model 2
    3. 9.3. EM Training of Models 1 and 2
    4. 9.4. Summary
  10. 10. Phrase-based Translation Models
    1. 10.1. Learning Phrases From Alignments
    2. 10.2. A Phrase-Based Model
    3. 10.3. Decoding of Phrase-based Translation Models
      1. 10.3.1. Definition of the Decoding Problem
      2. 10.3.2. The Decoding Algorithm
  11. 11. Log-linear Models
    1. 11.1. Two Example Problems
    2. 11.2. Features in Log-Linear Models
      1. 11.2.1. For Problem 1
      2. 11.2.2. For Problem 2
    3. 11.3. Definition of Log-linear Models
    4. 11.4. Parameter Estimation in Log-linear Models
    5. 11.5. Smoothing/Regularization in Log-linear Models
  12. 12. Log-linear Models for Tagging(MEMMs)
    1. 12.1. Recap: The Tagging Problem
    2. 12.2. Log-Linear Taggers
      1. 12.2.1. Independence Assumptions in Log-linear Taggers
      2. 12.2.2. Features in Log-Linear Taggers
      3. 12.2.3. Parameters in Log-linear Models
      4. 12.2.4. The Viterbi Algorithm for Log-linear Taggers
      5. 12.2.5. An Example Application
    3. 12.3. Summary
  13. 13. Log-Linear Models for History-based Parsing
    1. 13.1. Conditional History-based Models
    2. 13.2. Representing Trees as Decision Sequences(关于Step 1)
    3. 13.3. Features(关于Step 3)
    4. 13.4. Beam Search(关于Step 4)
  14. 14. Unsupervised Learning: Brown Clustering
    1. 14.1. Word Cluster Representations
    2. 14.2. The Brown Clustering Algorithm
    3. 14.3. Clusters in NE Recognition
  15. 15. Global Linear Models(GLMs)
    1. 15.1. A Brief Review of History-based Methods
    2. 15.2. A New Set of Techniques: Global Linear Models
      1. 15.2.1. A New Framework: Global Linear Models
        1. 15.2.1.1. Intuition
        2. 15.2.1.2. Three Components of Global Linear Models
      2. 15.2.2. Parsing Problems in This Framework: Reranking Problems
      3. 15.2.3. Parameter Estimation Method 1: A Variant of the Perceptron Algorithm
    3. 15.3. Summary
  16. 16. GLMs for Tagging
  17. 17. GLMs for Dependency Parsing
    1. 17.1. Dependency Parsing
    2. 17.2. GLMs for Dependency Parsing
    3. 17.3. Results from McDonald(2005)
    4. 17.4. Summary

Natural Language Processing 课程笔记

Coursera上Natural Language Processing课程的笔记。
Michael Collins, Columbia University
PS. 老师的PPT做得好有品位!简单、有效、节制美。声音也好听!

中间停了3个月,重新开始学习。\dagger

授课大纲为:
Topics covered include:

  1. Language modeling.
  2. Hidden Markov models, and tagging problems.
  3. Probabilistic context-free grammars, and the parsing problem.
  4. Statistical approaches to machine translation.
  5. Log-linear models, and their application to NLP problems.
  6. Unsupervised and semi-supervised learning in NLP.

Introduction to Natural Language Processing

What is Natural Language Processing(NLP)?

Image Loading

Key Applications

  • Machine Translation: e.g., Google Translation from Arabic
  • Information Extraction:
    Image Loading
    Image Loading
  • Text Summarization:
    Image Loading
  • Dialogue Systems
    Image Loading

Basic NLP Problems

  • Tagging
    Image Loading
    n是none,v是verb,等等。
    na不是entity,sc是start company,cc是continued company,sl是start location,cl是continued location。
  • Parsing
    Image Loading

Why is NLP Hard?

[example from L.Lee]
“At last, a computer that understands you like your mother”

Ambiguity

  1. (*) It understands you as well as your mother understands you
  2. It understands (that) you like your mother
  3. It understands you as well as it understands your mother

1 and 3: Does this mean well, or poorly?

Ambiguity at Many Levels
At the acoustic level (speech recognition):

  1. “…a computer that understands you like your mother”
  2. “…a computer that understands you lie cured mother”

At the syntactic level:
Image Loading
Different structures lead to different interpretations.
More Syntactic Ambiguity

At the semantic (meaning) level:
Two definitions of “mother”

  • a woman who has given birth to a child
  • a stringy slimy substance consisting of yeast cells and bacteria; is added to cider or wine to produce vinegar

This is an instance of word sense ambiguity

  • They put money in the bank = buried in mud?
  • I saw her duck with a telescope

At the discourse (multi-clause) level:

  • Alice says they’ve built a computer that understands you like your mother
  • But she … 有两种选择
    … doesn’t know any details
    … doesn’t understand me at all

This is an instance of anaphora, where she co-referees to some other discourse entity.

What will this course be about?

Course Coverage

  • NLP sub-problems: part-of-speech tagging, parsing, word-sense disambiguation, etc.
  • Machine learning techniques: probabilistic context-free grammars, hidden markov models, estimation/smoothing techniques, the EM algorithm, log-linear models, etc.
  • Applications: information extraction, machine translation, natural language interfaces…

A Syllabus

  • Language modeling, smoothed estimation
  • Tagging, hidden Markov models
  • Statistical parsing
  • Machine translation
  • Log-linear models, discriminative methods
  • Semi-supervised and unsupervised learning for NLP

Prerequisites

  • Basic linear algebra, probability, algorithms
  • Programming skills 本课需要Python或Java

Books
Comprehensive notes for the course will be provided at
http://www.cs.columbia.edu/~mcollins

Additional useful background:
Jurafsky and Martin:
Speech and Language Processing (2nd Edition)


The Language Modeling Problem

Introduction to the language modeling problem

Image Loading
Image Loading

Why on earth would we want to do this?

  • Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.)
    比如识别有两种可能,recognize speech和wreck a nice beach。如果有language model,我们就知道哪个的可能性大一些。
  • The estimation techniques developed for this problem will be VERY useful for other problems in NLP.

A Naive Method
Image Loading
最简单的一个方法,统计每句出现次数,求概率。

Markov Processes

Markov Processes
Trigram models的实现基于Markov Processes。

  • Consider a sequence of random variables X1,X2,...,XnX_1, X_2, ..., X_n. Each random variable can take any value in a finite set VV. For now we assume the length nn is fixed (e.g., n=100n = 100).
  • Our goal: model
    P(X1=x1,X2=x2,...,Xn=xn)P(X_1 = x_1, X_2 = x_2, ..., X_n = x_n)

First-Order Markov Processes
Image Loading
第一步是exact的,例如:P(A,B,C)=P(A)×P(BA)P(CA,B)P(A,B,C) = P(A) \times P(B|A) * P(C|A,B)
第二步是markov assumption。

Second-Order Markov Processes
Image Loading

Modeling Variable Length Sequences
Image Loading

Trigram Language Models

很好用又易于实现。
Trigram Language Models
Image Loading

An Example
Image Loading

The Trigram Estimation Problem
Image Loading
它的问题是Sparse Data Problems,parameters太多。
Image Loading

Evaluating language models: perplexity

Perplexity
Image Loading

Some Intuition about Perplexity
Image Loading
Perplexity有点像所有这些词中有效的词有多少。从而可以看出这个model好不好,值越小越好。

Typical Values of Perplexity
Image Loading

Some History: “Syntactic Structures”

Parameter Estimation in Language Models

Linear Interpolation

The Bias-Variance Trade-Off
Image Loading
Linear Interpolation
Image Loading
证明:
Image Loading

How to estimate the λ\lambda values?
Image Loading

Allowing the λ\lambda's to vary
Image Loading
不同情况下用不同λ\lambda

Discounting Methods

Discounting Methods
Image Loading
Image Loading
Image Loading

Katz Back-Off Models (Bigrams)
Image Loading

Katz Back-Off Models (Trigrams)
Image Loading

Summary

Image Loading


Tagging Problems, and Hidden Markov Models

The Tagging Problem

Part-of-Speech Tagging
Image Loading
主要问题还是ambiguity。

Named Entity Recognition
Image Loading

Named Entity Extraction as Tagging
Image Loading

Our Goal
Image Loading

Two Types of Constraints
Image Loading

Generative models, and the noisy-channel model, for supervised learning

Supervised Learning Problems - Conditional models
Image Loading
本例中x,y大概为:
x(i) = the dog laughs, y(i) = DT NN VB
也叫Discriminative Model。

Generative Models
Image Loading
Note那一行就是Bayes rule。
Generative Model比也叫Discriminative Model更灵活通用。

Decoding with Generative Models
Image Loading
由于p(x)并不受y的变化影响,被约去。

Hidden Markov Model (HMM) taggers

Basic definitions

Hidden Markov Models
Image Loading

Trigram Hidden Markov Models(Trigram HMMs)
Image Loading
q代表Trigram Parameters,e代表Emission Parameters。

An Example
Image Loading

Why the Name?
Image Loading
p(x,y)=p(y)×p(xy)p(x,y)=p(y) \times p(x|y)

Parameter estimation

Smoothed Estimation
Image Loading
回顾前面的linear interprelation method。
q包含trigram ml estimate/bigram ml estimate/unigram ml estimate。
这个方式的问题是:e(x|y)=0 for all y if x is never seen in the training data.有一个词没有出现过的话,整个p都是0了。

Dealing with Low-Frequency Words
Example:

Profits soared at Boeing Co. , easily topping forecasts on Wall Street, as their CEO Alan Mulally announced first quarter results.

其中Mulally这个词在training set中可能没有出现过。那么
e(Mulally|y)=0 for all tags y 所以 p(x1, …, xn, y1, …, yn+1) = 0 for all tag sequences y1, …, yn+1。

解决方法:
Image Loading

Example:
Image Loading
Image Loading
这样就有类似e(firstword|NA)或e(initcap|SC)这样的e了。缺点是需要人手动整理分类。

The Viterbi algorithm

The Problem
Image Loading
Brute Force Search is Hopelessly Inefficient

The Viterbi Algorithm Defination
Image Loading
Example:
Image Loading
k是1到n。u是Sk-1。v是Sk。

A Recursive Definition
Image Loading
动态规划。
Justification for the Recursive Definition:
Image Loading

The Viterbi Algorithm
Image Loading

The Viterbi Algorithm with Backpointers
Image Loading
只是动态规划的具体实现,记录tag。

The Viterbi Algorithm: Running Time
Image Loading

Summary

Pros and Cons
Image Loading


Parsing, and Context-Free Grammars

An Introduction to the Parsing Problem

Parsing(Syntactic Structure)
Image Loading

Syntactic Formalisms

  • Work in formal syntax goes back to Chomsky’s PhD thesis in the 1950s.
    Book “Syntactic Structures”
  • Examples of current formalisms: minimalism, lexical functional grammer(LFG), head-driven phrase-structure grammar(HPSG), tree adjoining grammars(TAG), categorial grammars.
    今天学的context-free grammar是这些modern grammar的基础。

Data for Parsing Experiments
Image Loading
是supervised learning,Training data是人手动标的,Penn WSJ Treebank。

The Information Conveyed by Parse Trees
1.Part of speech for each word
Image Loading
2.Phrases
Image Loading
3.Useful Relationships
Image Loading

An Example Application: Machine Translation
Image Loading
Image Loading
有Syntactic Structure后任务容易了很多。

Context Free Grammars

Context-Free Grammars Definition
Image Loading

A Context-Free Grammar for English
Image Loading

Left-Most Derivations
Image Loading
An Example:
Image Loading

Properties of CFGs
Image Loading
CFG - Context Free Grammars

An Example of Ambiguity
in the car修饰drove。
in the car修饰the street。意义不同了。

The Problem with Parsing: Ambiguity
Image Loading
有14种解释。

A Brief(!) Sketch of the Syntax of English

A Brief Overview of English Syntax
Image Loading

A Fragment of a Noun Phrase Grammar
Image Loading
可以试着组合出很多种tree。

Prepositions, and Prepositional Phrases
Image Loading

An Extended Grammar
Image Loading

Verbs, Verb Phrases, and Sentences
Image Loading

PPs Modifying Verb Phrases
Image Loading

Complementizers, and SBARs
Image Loading

More Verbs
Image Loading

Coordination
Image Loading

We’ve Only Scratched the Surface…
Image Loading

Examples of Ambiguous Structures

Sources of Ambiguity - Part-of-Speech ambiguity
Image Loading

Sources of Ambiguity - Prepositional Phrase Attachment
Image Loading
Image Loading
前文有提到。
另一个例子:Two analyses for: John was believed to have been shot by Bill。by Bill修饰shot或believed。
人类会更倾向于让prepositions去修饰the most recent verb in a sentence.

Sources of Ambiguity - Noun Premodifiers
Image Loading
左边那个fast相当于修饰car mechanic了。右边是fast car修饰mechanic。


Probabilistic Context-Free Grammars (PCFGs)

为每个rule添加概率,这样生成不同parse tree的几率也不一样了。
又旧又简单效果也不好,但是别人的基础,稍加改进之后表现就很好了。

Probabilistic Context-Free Grammars (PCFGs)

A Probabilistic Context-Free Grammar (PCFG)
Image Loading
An Example:
Image Loading

Properties of PCFGs
Image Loading

Deriving a PCFG from a Treebank
Treebank是什么:
Image Loading
用Treebank计算PCFG
Image Loading

PCFGs
Booth and Thompson (1973) showed that a CFG with rule probabilities correctly defines a distribution over the set of derivations provided that:

  1. The rule probabilities define conditional distributions over the different ways of rewriting each non-terminal.
  2. A technical condition on the rule probabilities ensuring that the probability of the derivation terminating in a finite number of steps is 1. (This condition is not really a practical concern.)

The CKY Algorithm for parsing with PCFGs

** Parsing with a PCFG**
Image Loading
Brute Force还是完全不可能的。

Chomsky Normal Form
Image Loading
R那行的意思就是要么分成两个非终端节点,要么就是终端节点。如果多于两个,就定义变量分成几条Rule,每条都是两个。
有了这个标准形式就可以用动态规划了。

A Dynamic Programming Algorithm
Image Loading
词i到词j之间形成X形式的可能组成方式中概率最大的那个。
Image Loading
s是split point。

The Full Dynamic Programming Algorithm
Image Loading

A Dynamic Programming Algoritm for the Sum
Image Loading

Summary

  • PCFGs augments CFGs by including a probability for each rule in the grammar.
  • The probability for a parse tree is the product of probabilities for the rules in the tree.
  • To build a PCFG-parsed parser:
    1.Learn a PCFG from a treebank.
    2.Given a test data sentence, use the CKY algorithm to compute the highest probability tree for the sentence under the PCFG.

Weeknesses of PCFGs

PCFGs = Probabilistic Context-Free Grammars
只有大约72%的准确率。而现代parser大约有92%的准确率。

  • Lack of sensitivity to lexical information
  • Lack of sensitivity to structural frequencies

Lack of Sensitivity to Lexical Information

每个假设都是独立的。且和word完全无关。

PP Attachment Ambiguity
Image Loading
Image Loading
本例中粗体那条一条就决定了整个树的成败。

Coordination Ambiguity
Image Loading
Image Loading
本例中两棵树的概率完全相等,不分伯仲,因为只是顺序不同而已。

Lack of Sensitivity to Structural Frequencies

Structural Preferences: Close Attachment
Image Loading
in Africa可以修饰president或company。就近原则。

Previous example: John was believed to have been shot by Bill
Here the low attachment analysis (Bill does the shooting) contains same rules as the high attachment analysis (Bill does the believing), so the two analyses receive same probability.


Lexicalized PCFGs

Lexicalization of a treebank

Heads in Context-Free Rules
Image Loading
Image Loading

Rules which Recover Heads
原始的Treebank里没有Head信息,要用这些Rule来Recover。
An Example for NPs:
Image Loading
An Example for VPs:
Image Loading

Adding Headwords to Trees
Image Loading

Lexicalized probabilistic context-free grammars

Chomsky Normal Form
前文有讲过。
Image Loading

Lexicalized Context-Free Grammars in Chomsky Normal Form
Image Loading
增加了标明head从哪儿来。
Example:
Image Loading

Parameters in a Lexicalized PCFG
Image Loading
参数变多。

Parsing with Lexicalized CFGs
Image Loading

Parameter estimation in lexicalized probabilistic context-free grammars

An Example:
Image Loading

A Model from Charniak (1997)
Image Loading
Image Loading
ML = maximum likelihood
qML = count(A) / count(B)
一组λ\lambda之和为1。
逐步一般化以平滑。

Other Important Details
Image Loading
binalize。

Accuracy of lexicalized probabilistic context-free grammars

Method 1: Constituents/Precision and Recall
Evaluation: Representing Trees as Constituents
Image Loading
Precision and Recall
Image Loading
Results
Image Loading

Method 2: Dependency Accuracies
Example:
Image Loading
三列分别是:head word(h)/modifier word(w)/rule。
Dependency Accuracies:
Image Loading
Strengths and Weaknesses of Modern Parsers:
Image Loading

Summary

  • Key weakness of PCFGs: lack of sensitivity to lexical information.
  • Lexicalized PCFGs:
    Lexicalize a treebank using head rules.
    Estimate the parameters of a lexicalized PCFG using smoothed estimation.
  • Accuracy of lexicalized PCFGs: around 88% in recovering constituents or dependencies.

Introduction to Machine Translation(MT)

Image Loading

Challenges in machine translation

Challenges: Lexical Ambiguity
Image Loading
从英语翻译到西班牙语,英语同词不同义在翻译时的不同选择。

Challenges: Differing Word Orders
Image Loading
老例子,英语与日语的语法中词的顺序不同。

Syntactics Structure is not Preserved Across Translations
Example from Dorr et. al, 1999
Image Loading
float这个动词在下面不再那么动词了。此例中floated对应floating,into对应entered。

Syntactic Ambiguity Causes Problems
Example from Dorr et. al, 1999
Image Loading
是John用棍子打狗,还是狗自己有棍子。with the stick是修饰hit还是dog。

Pronoun Resolution
Example from Dorr et. al, 1999
Image Loading
it可以是The computer,也可以是the data。

Classical machine translation

都是Rule-based。
Direct Machine Translation

  • Translation is word-by-word.
  • Very little analysis of the source text(e.g., no syntactic or semantic analysis).
  • Relies on a large bilingual directionary. For each word in the source language, the dictionary specifies a set of rules for translating that word.
  • After the words are translated, simple reordering rules are applied (e.g., move adjectives after nouns when translating from English to French).

An Example:
Image Loading

Some Problems with Direct Machine Translation:
Image Loading

Transfer-Based Approaches
Three phases in translation:

Analysis: Analyze the source language sentence; for example, build a syntactic analysis of the source language sentence.
Transfer: Convert the source-language parse tree to a target-language parse tree.
Generation: Convert the target-language parse tree to an output sentence.

  • The “parse trees” involved can vary from shallow analyses to much deeper analyses (even semantic representations).
  • The transfer rules might look quite similar to the rules for direct translation systems. But they can now operate on syntactic structures.
  • It’s easier with these approaches to handle long-distance reorderings.
  • The Systran systems are a classic example of this approach.

Image Loading
Image Loading

Interlingua-Based Translation
Two phases in translation:

  • Analysis: Analyze the source language sentence into a (language-independent) representataion of its meaning.
  • Generation: Convert the meaning representation into an output sentence.

One Advantage: If we want to build a translation system that translates between nn languages, we need to develop nn analysis and generation systems. With a transfer based system, we’d need to develop O(n2)O(n^2) sets of translation rules.
Disadvantage: What would a language-independent representation look like?

  • How to represent different concepts in a interlingua?
  • Different languages break down concepts in quite different ways:
    German has two words for wall: one for an internal wall, one for a wall that is outside.
    Japanese has two words for brother: one for an elder brother, one for a younger brother.
    Spanish has two words for leg: pierna for a human’s leg, pata for an animal’s leg, or the leg of a table.
  • An interlingua might end up simple being an intersection of these different ways of breaking down concepts, but that doesn’t seem very satisfactory…

A Pyramid Represnts These Three
Image Loading

A brief introduction to statistical MT

  • Parallel corpora are available in several language pairs.
  • Basic idea: use a parallel corpus as a training set of translation examples.
  • Classic example: IBM work on French-English translation, using the Canadian Hansards. (1.7 million sentences of 30 words or less in length). (Early 1990s)
  • Google Translation用的就是这个。
  • Idea goes back to Warren Weaver (1949): suggested applying statistical and cryptanalytic techniques to translation.
    Warren Weaver, 1949, in a letter to Norbert Wiener:
    "… one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”

The Noisy Channel Model
Image Loading
e是一个英文句子。f是一个法文句子。
p(e): a probability to any sentence in English.
p(f|e): the conditional probability of any French sentence given an English sentence.
Image Loading
Example from Koehn and Knight tutorial:
Image Loading
Image Loading


The IBM Translation Models

First Generation Translation Systems.
IBM Model做到了1-5,2就已经很不错了。

IBM Model 1

IBM Model 1: Alignments - Introduction

  • How do we model p(fe)p(f|e)?
  • English sentence ee has ll words e1ele_1 \cdots e_l,French sentence ff has mm words f1fmf_1 \cdots f_m.
  • An alignment aa identifies which English word each French word originated from.
  • Formally, an alignment aa is {a1,am}\{ a_1, \cdots a_m \}, where each aj{0l}a_j \in \{ 0 \cdots l \}.
  • There are (l+1)m(l+1)^m possible alignments.
  • e.g., l=6l=6, m=7m=7
    ee = And the program has been implemented
    ff = Le programme a ete mis en application
    One alignment is {2,3,4,5,6,6,6}
    Another (bad!) alignment is {1,1,1,1,1,1,1}

Alignments in the IBM Models
Image Loading
例如 p(le chien aboie,{1,2,3} | the dog barks, 3)

A By-Product: Most Likely Alignments
Image Loading
IBM Model现在不做翻译,反而主要来找最可能的Alignment了。

An Example Alignment
Image Loading

IBM Model 1: Alignments
Step 1:
Image Loading

IBM Model 1: Translation Probabilities
Step 2:
Image Loading
例如:
e = NULL the dog barks 于是l=3
a = {1, 2, 3}
f = le chian aboie 于是m=3
p(f|a,e,m) = t(le|the) x t(chien|dog) x t(aboie|barks)

又一个例子:
Image Loading

IBM Model 1: The Generative Process
Generation:
Image Loading

An Example Lexical Entry
Image Loading

IBM Model 2

** IBM Model 2 - Introduction**
Image Loading
只是a部分的模型不同。

An Example
Image Loading
Image Loading

IBM Model 2: The Generative Process
Image Loading

Recovering Alignments
Image Loading

EM Training of Models 1 and 2

Parameter estimation in models 1 and 2.
The Parameter Estimation Problem
Image Loading

Parameter Estimation if the Alignments are Observed
Image Loading
在已有a关系的情况下,只需要计算count相除就可以。
Image Loading

Parameter Estimation with the EM Algorithm
Image Loading
迭代改进。
Image Loading
Image Loading

Justification for the Algorithm
Image Loading

Summary

Image Loading


Phrase-based Translation Models

Second Generation Translation Systems.
Late 1990s.
Google Translate.

Learning Phrases From Alignments

Phrase-Based Models
Image Loading

An Example (from tutorial by Koehn and Knight):
Calculate PB Lexicon Pairs using IBM Model 2.
Image Loading
Image Loading

Representation as Alignment Matrix
Image Loading

Finding Alignment Matrices
问题:

  • Alignments are often “noisy”.
  • “Many to One” instead of “Many to Many”.

解决方法:

  • Step 1: train IBM model 2 for p(f|e), and come up with most likely alignment for each (e,f) pair.
  • Step 2: train IBM model 2 for p(e|f) and come up with most likely alignment for each (e,f) pair.
  • We now have two alignments:
    take intersection of the two alignments as a starting point.

Image Loading
Image Loading

Heuristics for Growing Alignments

  • Only explore alignment in union of p(f|e) and p(e|f) alignments.
  • Add one alignment point at a time.
  • Only add alignment points which align a word that currently has no alignment.
  • At first, restrict ourselves to alignment points that are “neighbors” (adjacent or diagonal) of current alignment points.
  • Later, consider other alignment points.

Image Loading

Extracting Phrase Pairs from the Alignment Matrix
Image Loading

Probabilities for Phrase Pairs
Image Loading

An Example Phrase Translation Table
Image Loading

A Phrase-Based Model

A Sketch
Image Loading
Image Loading
Image Loading
6是因为跳过了6个词,那是一个负值,为了不让两种语言句子结构跳跃太大的penalty。
Image Loading
Image Loading

Decoding of Phrase-based Translation Models

Definition of the Decoding Problem

Phrase-based Translation
Image Loading

Phrase-based Models: Definitions
Image Loading

Phrased-based Translation: Definitions
Image Loading
s: start point.
t: end point.
e: sequence of 1 or more english words.

Definitions
Image Loading

Valid Derivations
Image Loading
第三条:distortion limit: a hard constraint on how far phrases can move.
两个(s,t,e)组之间的距离不能离太远。前一组的尾与后一组的头之间。
这个方法很僵化,已被淘汰。
y(x)数量会随着句子长度指数增长。

Scoring Derivations
Image Loading
An Example:
Image Loading
We must also take this critic seriously.

  1. Langauge Model. 这个英语句子有多可能
    log q(we|**)+log(must|*,we)+log(also|we,must)+log(the|must,also)+…
  2. Phrase Scores. 这个英语句子与德语有多匹配
    g(1,3,we must also)+g(7,7,take)+g(4,5,the critic)+…
  3. Distortion Scores. 对语法结构差太多的惩罚
    |1-1|x{zeta}+|3+1-7|x{zeta}+|7+1-4|x{zeta}+|5+1-6|x{zeta}

The Decoding Algorithm

上面的问题NP hard,需要用到heuristic的算法。

Decoding Algorithm: Definitions
Image Loading
例如(we, must, 1100000, 2, -2.5)
中间的int有点像占位,哪些词被翻译了。

States, and the Search Space
Image Loading
可以表现为这样的树。

Transitions
Image Loading
ph(q)就是每个state的可行选择。

The next function
We define next(q,p) to be the state formed by combining state q with phrase p.
Image Loading

The Equality Function
Image Loading
判断两个state是否相同。

The Decoding Algorithm
Image Loading

Definition of Add(Q, q’, q, p)
Image Loading
如果走到了同一个state,谁分数高留谁。

Definition of beam(Q)
Image Loading
图中不应该有arg,错误。
比最高分低太多的直接删了。


Log-linear Models

Two Example Problems

The Language Modeling Problem
Image Loading

Trigram Models:
Image Loading
Image Loading
利用的信息太少了。

A Naive Approach:
Image Loading
一个Smooth Model,加权考虑每个信息。很快就大到无法控制。

A Second Example: Part-of_Speech Tagging
Image Loading
那么问题来了,??该是什么?
Image Loading
Image Loading
同样的,有很多没有利用到的信息。

Features in Log-Linear Models

For Problem 1

The General Problem

  • We have some input domain X
  • Have a finite label set Y
  • Aim is to provide a conditional probability p(yx)p(y|x) for any x,y where xX,yYx \in X, y \in Y

Language Modeling

  • x is a “history” w1, w2, …, wi1w_{i-1} e.g.,
    Third, the notion “grammatical in English” cannot be identified in any way with the notion “hign order of statistical approximation to English”. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical
  • y is an “outcome” wiw_i

Feature Vector Representations
Image Loading

Language Modeling
Image Loading
Image Loading
f1相当于unigram feature,f2相当于bigram feature,f3相当于trigram feature,f4相当于skip bigram feature。

Defining Features in Practice
Image Loading
Do not include trigrams not seen in training data.
Lead to:
Too many features;
Many not worth including in the model.

For Problem 2

The POS-Tagging Example
Image Loading
i: the position being tagged.

The Full Set of Features in Ratnaparkhi, 1996
Image Loading
Image Loading

The Final Result
Image Loading
绿色是x,红色是y。
Often sparse: relatively few 1’s than 0’s.

Definition of Log-linear Models

Parameter Vectors
前文介绍了feature,这里再引入parameter vector的概念。
Image Loading
两个vector点乘。

Language Modeling Example
Image Loading
p(model|x;v) = e^5.6 / (e^5.6 + e^1.5 + e^4.5 + e^-3.2 + …)
p(the|x;v) = e^-3.2 / (e^5.6 + e^1.5 + e^4.5 + e^-3.2 + …)

Log-Linear Models
总结定义:
Image Loading
经过这样处理后就满足条件p都大于0,且和为1了。

Why the name?
Image Loading

Parameter Estimation in Log-linear Models

求v。
Maximum-Likelihood Estimation
Image Loading
选让整个training example总值最大的。naive。
L(v)是一个倒吊钟形的函数,concave。比较容易优化,可以用gradient ascent。

Calculating the Maximum-Likelihood Estimates
Image Loading

Gradient Ascent Methods
Image Loading

Conjugate Gradient Methods
Image Loading
一般情况下工具包已经实现此算法。

Smoothing/Regularization in Log-linear Models

Smoothing in Log-Linear Models
Image Loading

Regularization
Image Loading

Experiments with Regularization
Image Loading

Experiments with Gaussian Priors
Image Loading


Log-linear Models for Tagging(MEMMs)

Applications for log-linear models.
MEMMs : Maximum-entropy(熵) Markov Models

Recap: The Tagging Problem

有以下几种:
Part-of-Speech Tagging
Image Loading

Named Entity Recognition
Image Loading

Named Entity Extraction as Tagging
Image Loading

Our Goal
Image Loading
已经讲过的方法有:
Hidden Markov Models 和 Viterbi Algorithm。

Log-Linear Taggers

Independence Assumptions in Log-linear Taggers

Log-Linear Models for Tagging
Image Loading

How to model p(t[1:n]|w[1:n])?
Image Loading
例子:the dog barks **DNV
P(D N V|the dog barks) = P(D|the dog barks, * *) x P(N|the dog barks, * D) x P(V|the dog barks, D N)

Features in Log-Linear Taggers

Representation: Histories
Image Loading

Recap: Feature Vector Representations in Log-Linear Models
Image Loading

An Example
Image Loading
Image Loading
f2对于HMM来说很难。

The Full Set of Features in [(Ratnaparkhi, 96)]
Image Loading
Image Loading

Parameters in Log-linear Models

Recap: Log-Linear Models
Image Loading

Training the Log-Linear Model
Image Loading

The Viterbi Algorithm for Log-linear Taggers

The Viterbi Algorithm
Image Loading
Image Loading

A Recursive Definition
Image Loading
与HMM的区别只在q那里。

The Viterbi Algorithm with Backpointers
Image Loading

An Example Application

FAQ Segmentation: McCallum et. al

  • McCallum et. al compared HMM and log-linear taggers on a FAQ Segmentation task
  • Main point: in an HMM, modeling p(wordtag)p(word|tag) is difficult in this domain

从文本中自动tag哪些行是head、question、answer。

1
2
3
4
5
6
7
8
9
10
11
12
13
<head>X-NNTP-POSTER: NewsHound v1.33
<head>
<head>Archive name: acorn/faq/part2
<head>Frequency: monthly
<head>
<question>2.6) What configuration of serial cable should I use
<answer>
<answer> Here follows a diagram of the necessary connections
<answer>programs to work properly. They are as far as I know t
<answer>agreed upon by commercial comms software developers fo
<answer>
<answer> Pins 1, 4, and 8 must be connected together inside
<answer>is to avoid the well known serial port chip bugs. The

FAQ Segmentation: Line Features

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
begins-with-number
begins-with-ordinal
begins-with-punctuation
begins-with-question-word
begins-with-subject
blank
contains-alphanum
contains-bracketed-number
contains-http
contains-non-space
contains-number
contains-pipe
contains-question-mark
ends-with-question-mark
first-alpha-is-capitalized
indented-1-to-4

FAQ Segmentation: The Log-Linear Tagger
得到feature。

1
2
3
4
5
“tag=question;prev=head;begins-with-number”
“tag=question;prev=head;contains-alphanum”
“tag=question;prev=head;contains-nonspace”
“tag=question;prev=head;contains-number”
“tag=question;prev=head;prev-is-blank”

FAQ Segmentation: An HMM Tagger
对于:

1
<question>2.6) What configuration of serial cable should I use

得到

  • First solution
    把一句话分成很多词。
    Image Loading
    i.e. have a language model for each tag
  • Second solution
    Image Loading

FAQ Segmentation: Results

Method Precision Recal
ME-Stateless(Maximum Entropy Stateless) 0.038 0.362
TokenHMM 0.276 0.140
FeatureHMM 0.413 0.529
MEMM 0.867 0.681
  • Precision and recall results are for recovering segments
  • ME-stateless is a log-linear model that treats every sentence seperately (no dependence between adjacent tags)
  • TokenHMM is an HMM with first solution we’ve just seen
  • FeatureHMM is an HMM with second solution we’ve just seen
  • MEMM is a log-linear trigram tagger (MEMM stands for “Maximum-Entropy Markov Model”)

Summary

Image Loading


Log-Linear Models for History-based Parsing

这个方法的应用范围更广。

Conditional History-based Models

Recap: Log-Linear Taggers
Image Loading

A General Approach: (Conditional) History-Based Models

  • We’ve shown how to define p(t[1:n]|w[1:n]) where t[1:n] is a tag sequence.
  • How do we define p(T|S) if T is a parse tree (or another structure)? (We use the notation S = w[1:n])

Image Loading
并不使用Markov Assumption,所以不能用动态规划。

Representing Trees as Decision Sequences(关于Step 1)

An Example Tree

Ratnaparkhi’s Parser: Three Layers of Structure

  1. Part-of-speech tags
  2. Chunks
  3. Remaining structure

Layer 1: Part-of-Speech Tags
Image Loading

Layer 2: Chunks
Image Loading
QP: numeric phrases.
Image Loading

Layer 3: Remaining Structure
Image Loading
前两步之后到这里:
Image Loading
选择最左边的:
Image Loading
因为句子并没有结束,check为no:
Image Loading
继续,省略数步:
Image Loading
当前的constituent已结束,所以Check为yes:
Image Loading
然后再选择最左边的没有Join或Start的项:
Image Loading
Check又选yes:
Image Loading
继续选择最左边:
Image Loading
Check为yes:
Image Loading

The Final Sequence of decisions
Image Loading

Features(关于Step 3)

Applying a Log-Linear Model
Image Loading
Image Loading

Layer 3: Join of Start

  • Looks at head word, constituent (or POS) label, and start/join annotation of n’th tree relative to the decision, where n = −2, −1
  • Looks at head word, constituent (or POS) label of n’th tree relative to the decision, where n = 0, 1, 2
  • Looks at bigram features of the above for (-1,0) and (0,1)
  • Looks at trigram features of the above for (-2,-1,0), (-1,0,1) and (0, 1, 2)
  • The above features with all combinations of head words excluded
  • Various punctuation features

Layer 3: Check=NO or Check=YES

  • A variety of questions concerning the proposed constituent

Beam Search(关于Step 4)

The Search Problem
Image Loading
同理,不能用动态规划了。
得到很多可能的parse tree,每个有自己的概率。


Unsupervised Learning: Brown Clustering

本课讲的首个Unsupervised learning algorithm。

Word Cluster Representations

The Brown Clustering Algorithm

  • Input: a (large) corpus of words
  • Output 1: a partition of words into word clusters
  • Output 2 (generalization of 1): a hierarchichal word clustering

Example Clusters (from Brown et al, 1992)
论文里给出的cluster结果,效果很好啊!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays
June March July April January December October November September August
people guys folks fellows CEOs chaps doubters commies unfortunates blokes
down backwards ashore sideways southward northward overboard aloft downwards adrift
water gas coal liquid acid sand carbon steam shale iron
great big vast sudden mere sheer gigantic lifelong scant colossal
man woman boy girl lawyer doctor guy farmer teacher citizen
American Indian European Japanese German African Catholic Israeli Italian Arab
pressure temperature permeability density porosity stress velocity viscosity gravity tension
mother wife father son husband brother daughter sister boss uncle
machine device controller processor CPU printer spindle subsystem compiler plotter
John George James Bob Robert Paul William Jim David Mike
anyone someone anybody somebody
feet miles pounds degrees inches barrels tons acres meters bytes
director chief professor commissioner commander treasurer founder superintendent dean custodian

A Sample Hierarchy (from Miller et al., NAACL 2004)
论文给出的Hierarchy结果。
哈夫曼树,左1右0。可以表示Hierarchy。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
lawyer 1000001101000
newspaperman 100000110100100
stewardess 100000110100101
toxicologist 10000011010011
slang 1000001101010
babysitter 100000110101100
conspirator 1000001101011010
womanizer 1000001101011011
mailman 10000011010111
salesman 100000110110000
bookkeeper 1000001101100010
troubleshooter 10000011011000110
bouncer 10000011011000111
technician 1000001101100100
janitor 1000001101100101
saleswoman 1000001101100110
...
Nike 1011011100100101011100
Maytag 10110111001001010111010
Generali 10110111001001010111011
Gap 1011011100100101011110
Harley-Davidson 10110111001001010111110
Enfield 101101110010010101111110
genus 101101110010010101111111
Microsoft 10110111001001011000
Ventritex 101101110010010110010
Tractebel 1011011100100101100110
Synopsys 1011011100100101100111
WordPerfect 1011011100100101101000
....
John 101110010000000000
Consuelo 101110010000000001
Jeffrey 101110010000000010
Kenneth 10111001000000001100
Phillip 101110010000000011010
WILLIAM 101110010000000011011
Timothy 10111001000000001110
Terrence 101110010000000011110
Jerald 101110010000000011111
Harold 101110010000000100
Frederic 101110010000000101
Wendell 10111001000000011

The Brown Clustering Algorithm

The Intuition

  • Similar words appear in similar contexts
  • More precisely: siilar words have similar distributions of words to their imediate left and right

The Formulation
Image Loading
An Example

The Brown Clustering Model
Image Loading

Measuring the Quality of C
Image Loading
例如:

1 2 3 1 2
the dog saw the cat

n(1,2) = 2
n(2) = 2

A First Algorithm
Image Loading

A Second Algorithm
Image Loading

Clusters in NE Recognition

Miller et al, NAACL 2004
"Name Tagging with Word Clusters and Discriminative Training"
Scott Miller, Jethran Guinness, Alex Zamanian

At a recent meeting, we presented name-tagging technology to a potential user. The technology had performed well in formal evaluations, had been applied successfully by several research groups, and required only annotated training examples to configure for new name classes. Nevertheless, it did not meet the user’s needs.

To achieve reasonable performance, the HMM-based technology we presented required roughly 150,000 words of annotated examples, and over a million words to achieve peak accuracy. Given a typical annotation rate of 5,000 words per hour, we estimated that setting up a name finder for a new problem would take four person days of annotation work – a period we considered reasonable. However, this user’s problems were too dynamic for that much setup time. To be useful, the system would have to be trainable in minutes or hours, not days or weeks.

可行的feature种类:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1. Tag + PrevTag
2. Tag + CurWord
3. Tag + CapAndNumFeatureOfCurWord
4. ReducedTag + CurWord
//collapse start and continue tags
5. Tag + PrevWord
6. Tag + NextWord
7. Tag + DownCaseCurWord
8. Tag + Pref8ofCurrWord // 当前词的feature们的某8bit
9. Tag + Pref12ofCurrWord
10. Tag + Pref16ofCurrWord
11. Tag + Pref20ofCurrWord
12. Tag + Pref8ofPrevWord
13. Tag + Pref12ofPrevWord
14. Tag + Pref16ofPrevWord
15. Tag + Pref20ofPrevWord
16. Tag + Pref8ofNextWord
17. Tag + Pref12ofNextWord
18. Tag + Pref16ofNextWord
19. Tag + Pref20ofNextWord

结果评估:
Image Loading
Image Loading
Active Learning -> Choose to label the most “useful” examples at each stage.


Global Linear Models(GLMs)

Techiniques

  • So far:
    Smoothed estimation
    Probabilistic context-free grammars
    Log-linear models
    Hidden markov models
    The EM Algorithm
    History-based models
  • Today
    Global linear models

A Brief Review of History-based Methods

Supervised Learning in Natural Language
Image Loading

The Models so far
Image Loading
第一个比如HMMs和PCFGs,第二个如Log-Linear Tagging。

Example 1: PCFGs
Image Loading
S
S->NP VP
q(S->NP VP) = p(S->NP VP|S)

Example 2: Log-linear Taggers
Image Loading

A New Set of Techniques: Global Linear Models

A New Framework: Global Linear Models

Intuition
  • We’ll move away from history-based models
    No idea of a “derivation”, or attaching probabilities to “decisions”
  • Instead, we’ll have feature vectors over entire structures “Global features”
    “Global features”
  • First piece of motivation:
    Freedom in defining features

A Need for Flexible Features
Image Loading
Image Loading

Three Components of Global Linear Models

Image Loading
例如,x是sentence,y是parse tree。candidates是一堆parse tree。

Component 1: ff
Image Loading
Features:
Image Loading
Feature Vectors:
Image Loading
feature的设计很重要,将来要送进learning algorithm的。

Component 2: GENGEN
Image Loading

  • GENGEN enumerates a set of candidates for an input x
  • Some examples of how GEN(x)GEN(x) can be defined:
    Parsing: GEN(x)GEN(x) is the set of parses for xx under a grammar
    Any task: GEN(x)GEN(x) is the top N most probable parses under a history-based model
    Tagging: GEN(x)GEN(x) is the set of all possible tag sequences with the same length as xx
    Translation: GEN(x)GEN(x) is the set of all possible English translations for the French sentence xx

Component 3: vv
Image Loading

Putting is all Together
Image Loading
Image Loading

Parsing Problems in This Framework: Reranking Problems

Reranking Approaches to Parsing
Image Loading
baseline parser比如lexicalized PCFG。

The Representation ff
Image Loading
Some Features in the Paper:
From [Colins and Koo, 2005]:
The following types of features were included in the model. We will use the rule VP -> PP VBD NP NP SBAR with head VBD as an example. Note that the output of out baseline parser produces syntactic trees with headword annotations.

Rules These include all context-free rules in the tree, for example VP -> PP VBD NP NP SBAR
Image Loading
Bigrams These are adjacent pairs of non-terminals to the left and right of the head. As shown, the example rule would contribute the bigrams (Right, VP, NP, NP), (Right, VP, NP, SBAR), (Right, VP, SBAR, STOP), and (Left, VP, PP, STOP) to the left of the head.
Image Loading
Grandparent Rules Same as Rules, but also including the non-terminal above the rule.
Image Loading
Two-level Rules Same as Rules, but also including the entire rule above the rule.
Image Loading

Parameter Estimation Method 1: A Variant of the Perceptron Algorithm

A Variant of the Perceptron Algorithm
Image Loading
T: number of iterations, small 5, 10, 20
Any features which are seen in the true candidate structure have their parameter value increased, whereas any features seen in the incorrectly proposed structure have their parameter values decreased. Kind of Reinforcing Step.

Perceptron Experiments: Parse Reranking
Parsing the Wall Street Journal Treebank
Training set = 40,000 sentences, test = 2,416 sentences
Generative model(Collins 1999): 88.2% F-measure
Reranked model: 89.5% F-measure(11% relative error reduction)

  • Results from Charniak and Johnson, 2005
    Improvement from 89.7% (baseline generative model) to 91.0% accuracy
    Gains from improved n-best lists, better features, better baseline model

Summary

  • A new framework: global linear models
    GEN, f, v
  • There are several ways to train the parameters v:
    Perceptron
    Boosting
    Log-linear models (maximum-likelihood)
  • Applications:
    Parsing
    Generation
    Machine translation
    Tagging problems
    Speech recognition

GLMs for Tagging

Global Linear Models Part 2: The Perceptron Algorithm for Tagging
针对Tagging,我们之前已经学过HMMs和Log-Linear Taggers,现在学习第三种GLMs and Perceptron。
Tagging using Global Linear Models
Image Loading
例如:
n = 3
w = the dog barks
T = {D, N ,V}
GEN(the dog barks) = {DDD, DDN, DND, DNN, …} 共3^3 = 27个
f(the dog barks, D N V) = (1, 0, 0, 1)

Representation: Histories
Image Loading

Local Feature-Vector Representations
Image Loading
A tagged sentence with n words has n history/tag pairs.
Image Loading
global feature vector = the sum of local feature vectors

Global and Local Features
Image Loading

Putting it all Together
Image Loading

Recap: A Variant of the Perceptron Algorithm
Image Loading

Training a Tagger Using the Perceptron Algorithm
Image Loading

An Example
Image Loading

Experiments

  • Wall Street Journal part-of-speech tagging data
    Perceptron = 2.89% error, Log-linear tagger = 3.28% error
  • [Ramshaw and Marcus, 1995] NP chunking data
    Perceptron = 93.63% accuracy, Log-linear tagger = 93.29% accuracy

GLMs for Dependency Parsing

Global Linear Models Part 3: The Perceptron Algorithm for Dependency Parsing

Dependency Parsing

Unlabeled Dependency Parses
Image Loading
labeled就是标上每个arc代表的关系。

**All Dependency Parses for John saw Mary **
Image Loading
正确的是左列第二个。

Conditions on Dependency Structures
Image Loading

  • The dependency arcs from a directed tree, with the root symbol at the root of the tree.
    (Definition: A directed tree rooted at root is a tree, where for every word w other than the root, there is a directed path from root to w.)
  • There are no “crossing dependencies”.
    Dependency structures with no crossing dependencies are sometimes referred to as projective structures.

没有环,也没有交叉。
有交叉的叫做Non-projective Structures,在本课中认为是不允许的。
Image Loading

Dependency Parsing Resources

  • CoNLL 2006 conference had a “shared task” with dependency parsing of 12 languages (Arabic, Chinese, Czech, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish, Turkish). 19 different groups developed dependency parsing systems. (See also CoNLL 2007).
  • PhD thesis on the topic: Ryan McDonald, Discriminative Training and Spanning Tree Algorithms for Dependency Parsing, University of Pennsylvania.
  • For some languages, e.g., Czech, there are “dependency banks” available which contain training data in the form of sentences paired with dependency structures.
  • For other languages, we can extract dependency structures from treebanks.
    extract dependency structures from treebank
    图上右下角的S应该是S(was,Vt)
    head word一层层往下找,这个过程就是之前讲过的lexicalization。

Efficiency of Dependency Parsing

  • PCFG parsing is O(n3G3)O(n^3G^3) where n is the length of the sentence, G is the number of non-terminals in the grammar.
  • Lexicalized PCFG parsing is O(n5G3)O(n^5G^3) where n is the length of the sentence, G is the number of non-terminals in the grammar.
  • Unlabeled dependency parsing is O(n3)O(n^3).因为用了dynamic programming,看Jason Eisner的paper。

very efficient at parsing, very useful representations.

GLMs for Dependency Parsing

GLMs for Dependency parsing
Image Loading
size of GEN is exponentional in n.
Image Loading
x是句子,y是一种Dependency结构。

Definition of Local Feature Vectors
Image Loading

例子:针对前文的John saw a movie,可以 有如下feature。
g1(x,h,m) = 1 if xh=saw, and xm=movie; 0 otherwise
g2(x,h,m) = 1 if POS(h)=VBD, and POS(m)=NN; 0 otherwise
g3(x,h,m) = 1 if |h-m|>10; 0 otherwise

Results from McDonald(2005)

Image Loading
G = Num of non-terminals

Extensions

  • 2nd-order dependency parsing, 3rd-order dependency parsing
    不再只孤立看一个dependency arc。能得到93-94% accuracy。
  • Non-projective dependency structures
    spanning tree algorithms

Summary

Image Loading


(「・ω・)「
春假前一天顺利完成!