Natural Language Processing with Python 笔记

A Stellar Hiker

2016-02-05

Book, Natural Language Processing, Python

网上能找到的pdf格式中英文版均基于Python2.5，NLTK现已更新支持Python3.0，最新书籍英文版参看这里。还有Errata。
Edward Loper, Ewan Klein, and Steven Bird, Stanford, July 2007

Preface

Software Requirements

Python
The material presented in this book assumes that you are using Python version 3.2 or later. (Note that NLTK 3.0 also works with Python 2.6 and 2.7.)
NLTK
The code examples in this book use NLTK version 3.0. Subsequent releases of NLTK will be backward-compatible with NLTK 3.0.
NLTK-Data
This contains the linguistic corpora that are analyzed and processed in the book.
NumPy
(recommended) This is a scientific computing library with support for multidimensional arrays and linear algebra, required for certain probability, tagging, clustering, and classification tasks.
Matplotlib
(recommended) This is a 2D plotting library for data visualization, and is used in some of the book’s code samples that produce line graphs and bar charts.
Stanford NLP Tools:
(recommended) NLTK includes interfaces to the Stanford NLP Tools which are useful for large scale language processing (see http://nlp.stanford.edu/software/).
NetworkX
(optional) This is a library for storing and manipulating network structures consisting of nodes and edges. For visualizing semantic networks, also install the Graphviz library.
Prover9
(optional) This is an automated theorem prover for first-order and equational logic, used to support inference in language processing.

Natural Language Toolkit (NLTK)

Language processing task	NLTK modules	Functionality
Accessing corpora	corpus	standardized interfaces to corpora and lexicons
String processing	tokenize, stem	tokenizers, sentence tokenizers, stemmers
Collocation discovery	collocations	t-test, chi-squared, point-wise mutual information
Part-of-speech tagging	tag	n-gram, backoff, Brill, HMM, TnT
Machine learning	classify, cluster, tbl	decision tree, maximum entropy, naive Bayes, EM, k-means
Chunking	chunk	regular expression, n-gram, named-entity
Parsing	parse, ccg	chart, feature-based, unification, probabilistic, dependency
Semantic interpretation	sem, inference	lambda calculus, first-order logic, model checking
Evaluation metrics	metrics	precision, recall, agreement coefficients
Probability and estimation	probability	frequency distributions, smoothed probability distributions
Applications	app, chat	graphical concordancer, parsers, WordNet browser, chatbots
Linguistic fieldwork	toolbox	manipulate data in SIL Toolbox format

Language Processing and Python

本章要解决的问题：

What can we achieve by combining simple programming techniques with large quantities of text?
How can we automatically extract key words and phrases that sum up the style and content of a text?
What tools and techniques does the Python programming language provide for such work?
What are some of the interesting challenges of natural language processing?

在Ubuntu里设置时可能的坑

为了用新例子，更改Ubuntu的默认Python版本为3.2以上的版本。

# 安装pyenv
git clone git://github.com/yyuu/pyenv.git ~/.pyenv
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
exec $SHELL -l
# 查看可安装的python版本
pyenv install --list
# 安装python所需依赖包
sudo apt-get install libc6-dev gcc
sudo apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm
# 安装python
pyenv install 3.5.1 -v
# 更新数据库，用pip安装后也需更新
pyenv rehash
# 查看已安装的python版本
pyenv versions
# 设置全局的python版本
pyenv global 3.5.1

此时nltk等包需安装在Python指定版本下。用python -m pip install nltk。
如果不能正常使用IDLE，提示"IDLE can’t import Tkinter. Your python may not be configured for Tk."。使用以下方法：

First uninstall Python 3.5.1 : pyenv uninstall 3.5.1, then
Run sudo apt-get install tk-dev
And reinstall Python 3.5.1 : pyenv install 3.5.1

Computing with Language: Texts and Words

Token: technical name for a sequence of characters that we want to treat as a group.
Word Type: the form or spelling of the word independently of its specific occurrences in a text. i.e. the word considered as a unique item of vocabulary.

# import nltk module
import nltk
# import everything in book data
from nltk.book import *
# show name of the text source
text1
# searching text
text1.concordance("monstrous")
# what other words appear in a similar range of context
text1.similar("monstrous")
# the contexts shared by two or more words
text2.common_contexts(["monstrous", "very"])
# location of a word in the text: how many words from the beginning it appears
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
# generate random text. 3.0版本没有。
text3.generate()
# finding out the length of a text from start to finish
len(text3)
# a sorted list of the set of unique tokens(word type) in text
sorted(set(text3))
# length of the unique tokens set
len(set(text3))
# calculate lexical richness of the text
len(set(text3)) / len(text3)
# count how often a word occures in a text
text3.count("smote")
# compute what percentage of the text is taken up by a specific word
100 * text4.count('a') / len(text4)
# define python functions
def lexical_diversity(text)
	return len(set(text)) / len(text)
lexical_diversity(text3)

A Closer Look at Python: Texts as Lists of Words

基础Python3语法复习。
官方Python文档。
注意这里index从0开始。

Computing with Language: Simple Statistics

Hapaxes: words that occur only once.
Collocation: a sequence of words that occur together unusually often.
Bigrams: a list of word pairs.

Frequency Distributions

# find frequency distribution in text
fdist1 = FreqDist(text1)
print(fdist1)
fdist1.most_common(50)	# most common 50 words
fdist1["whale"]
fdist1.plot(50, cumulative=True)
fdist1.hapaxes()	# words that occur only once

Fine-grained Selection of Words

1
2
3

# find the words that are longer than seven characters that occur more than 7 times
fdist5 = FreqDist(text5)
sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)

Collocations and Bigrams

# find bigrams
list(bigrams(['more', 'is', 'said', 'than', 'done']))
# find bigrams that occur more often than we would expect based on the frequency of the individual words
text4.collocations()

Counting Other Things

# the distribution of word lengths in a text
fdist = FreqDist(len(w) for w in text1)
fdist.most_common()
fdist.max()
fdist[3]
fdist.freq(3)

Functions Defined for NLTK’s Frequency Distributions

Example	Description
fdist = FreqDist(samples)	create a frequency distribution containing the given samples
fdist[sample] += 1	increment the count for this sample
fdist[‘monstrous’]	count of the number of times a given sample occurred
fdist.freq(‘monstrous’)	frequency of a given sample
fdist.N()	total number of samples
fdist.most_common(n)	the n most common samples and their frequencies
for sample in fdist:	iterate over the samples
fdist.max()	sample with the greatest count
fdist.tabulate()	tabulate the frequency distribution
fdist.plot()	graphical plot of the frequency distribution
fdist.plot(cumulative=True)	cumulative plot of the frequency distribution
fdist1 ¦= fdist2	update fdist1 with counts from fdist2
fdist1 < fdist2	test if samples in fdist1 occur less frequently than in fdist2

Back to Python: Making Decisions and Taking Control

Some Word Comparison Operators

Function	Meaning
s.startswith(t)	test if s starts with t
s.endswith(t)	test if s ends with t
t in s	test if t is a substring of s
s.islower()	test if s contains cased characters and all are lowercase
s.isupper()	test if s contains cased characters and all are uppercase
s.isalpha()	test if s is non-empty and all characters in s are alphabetic
s.isalnum()	test if s is non-empty and all characters in s are alphanumeric
s.isdigit()	test if s is non-empty and all characters in s are digits
s.istitle()	test if s contains cased characters and is titlecased (i.e. all words in s have initial capitals)

1
2
3

sorted(wd for wd in set(text3) if wd.istitle() and len(wd) > 10)
sorted(t for t in set(text2) if 'cie' in t or 'cei' in t)
len(set(word.lower() for word in text1 if word.isalpha()))

Automatic Natural Language Understanding

Word Sense Disambiguation

Nearby words have closely related meanings.
Example:
a. The lost children were found by the searchers (agentive — Chesterton was the author of the book)
b. The lost children were found by the mountain (locative — the stove is where the cup is)
c. The lost children were found by the afternoon (temporal — Friday is the time of the submitting)

Pronoun Resolution

Who did what to whom.
Example:
a. The thieves stole the paintings. They were subsequently sold.
b. The thieves stole the paintings. They were subsequently caught.
c. The thieves stole the paintings. They were subsequently found.
They到底指代谁。
Anaphora Resolution: identifying what a pronoun or noun phrase refers to.
Semantic Role Labeling: identifying how a noun phrase relates to the verb.

Generating Language Output

Question Answering and Machine Translation
a. Text: … The thieves stole the paintings. They were subsequently sold. …
b. Human: Who or what was sold?
c. Machine: The paintings.
翻译成法语，指thief时用男性，指painting时用女性。
a. The thieves stole the paintings. They were subsequently found.
b. Les voleurs ont volé les peintures. Ils ont été trouvés plus tard. (the thieves)
c. Les voleurs ont volé les peintures. Elles ont été trouvées plus tard. (the paintings)

Machine Translation

把一个句子翻译过去又翻译回来就可以看到翻译系统的不足。
0> how long before the next flight to Alice Springs?
1> wie lang vor dem folgenden Flug zu Alice Springs?
2> how long before the following flight to Alice jump?
3> wie lang vor dem folgenden Flug zu Alice springen Sie?
4> how long before the following flight to Alice do you jump?
5> wie lang, bevor der folgende Flug zu Alice tun, Sie springen?
6> how long, before the following flight to Alice does, do you jump?
7> wie lang bevor der folgende Flug zu Alice tut, tun Sie springen?
8> how long before the following flight to Alice does, do you jump?
9> wie lang, bevor der folgende Flug zu Alice tut, tun Sie springen?
10> how long, before the following flight does to Alice, do do you jump?
11> wie lang bevor der folgende Flug zu Alice tut, Sie tun Sprung?
12> how long before the following flight does leap to Alice, does you?
Text Alignment: pair up the sentences of different languages.

Spoken Dialog Systems

Turing Test: can a dialogue system, responding to a user’s text input, perform so natually that we cannot distinguish it from a human-generated response?

nltk.chat.chatbots()来感受一下分分钟起无名火的聊天系统。

Textual Entailment

Recognizing Textual Entailment (RTE)分析某段文字能否为某论点提供论证。
a. Text: David Golinkin is the editor or author of eighteen books, and over 150 responsa, articles, sermons and books
b. Hypothesis: Golinkin has written eighteen books
需要做如下论证：
(i) if someone is an author of a book, then he/she has written that book; (ii) if someone is an editor of a book, then he/she has not written (all of) that book; (iii) if someone is editor or author of eighteen books, then one cannot conclude that he/she is author of eighteen books.

Limitations of NLP

An important goal of NLP research has been to make progress on the difficult task of building technologies that “understand language,” using superficial yet powerful techniques instead of unrestricted knowledge and reasoning capabilities.

Accessing Text Corpora and Lexical Resources

本章要解决的问题：

What are some useful text corpora and lexical resources, and how can we access them with Python?
Which Python constructs are most helpful for this work?
How do we avoid repeating ourselves when writing Python code?

Copora: large bodied of linguistic data.

Accessing Text Corpora

Corporas

Gutenberg Corpus
contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/.

from nltk.corpus import gutenberg
gutenberg.fileids()
emma = nltk.Text(gutenberg.words('austen-emma.txt'))
emma.concordance("surprize")

'''
average word length, average sentence length, the number of times each vocabulary item 
appears in the text on average(our lexical diversity score) for each text in gutenberg
'''
for fileid in gutenberg.fileids():
	num_chars = len(gutenberg.raw(fileid)) [1]
	num_words = len(gutenberg.words(fileid))
	num_sents = len(gutenberg.sents(fileid))
	num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
	print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

# sents() divides the text up into its sentences
macbeth_sentences = gutenberg.sents("shakespeare-macbeth.txt")
longest_len = max(len(s) for s in macbeth_sentences)
[s for s in macbeth_sentences if len(s) == longest_len]

words(),raw(),sents()之外还有更多操作，以后再讲。

Web and Chat Text
content from a Firefox discussion forum, conversations overheard in New York, the movie script of Pirates of the Carribean, personal advertisements, wine reviews.

1
2
3

from nltk.corpus import webtext
for fileid in webtext.fileids():
	print(fileid, webtext.raw(fileid)[:65], '...')

instant messaging chat sessions.

1
2
3

from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')
chatroom[123]

Brown Corpus
first million-word electronic corpus of English, sources categorized by genre.
genre的完整列表见http://icame.uib.no/brown/bcm-los.html。

# compare genres in their usage of modal verbs
# 1. produce the counts for a particular genre
fro nltk.corpus import brown
brown.categories()
news_text = brown.words(categories='news')
fdist = nltk.FreqDist(w.lower() for w in news_text)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
	print(m + ':', fdist[m], end=' ')
# 2. obtain counts for each genre of interest
cfd = nltk.ConditionalFreqDist(
	(genre, word)
	for genre in brown.categories()
	for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

Reuters Corpus
classified into 90 topics, grouped into “training” and “test”.

# categories overlap with each other
from nltk.corpus import reuters
reuters.fileids()
reuters.categories()
reuters.categories(['training/9865', 'training/9880'])
reuters.fileids(['barley', 'corn'])
# words or sentences in terms of files or categories
reuters.words('training/9865')[:14]
reuters.words(categories=['barley', 'corn'])

Inaugural Address Corpus
a collection of 55 texts.

# get the year of each text
from nltk.corpus import inaugural
inaugural.fileids()
[fileid[:4] for fileid in inaugural.fileids()]
# 'America' and 'citizen' are used over time
cfd = nltk.ConditionalFreqDist(
	(target, fileid[:4])
	for fileid in inaugural.fileids()
	for w in inaugural.words(fileid)
	for target in ['america', 'citizen']
	if w.lower().startswith(target))
cfd.plot()

Plot of a Conditional Frequency Distribution: all words in the Inaugural Address Corpus that begin with america or citizen are counted; separate counts are kept for each address; these are plotted so that trends in usage over time can be observed; counts are not normalized for document length.

Annotated Text Corpora
contain linguistic annotations, representing POS tags, named entities, syntactic structures, semantic roles, and so forth.
For information about downloading them, see http://nltk.org/data. For more examples of how to access NLTK corpora, please consult the Corpus HOWTO at http://nltk.org/howto.

Some of the Corpora and Corpus Samples Distributed with NLTK

Corpus	Compiler	Contents
Brown Corpus	Francis, Kucera	15 genres, 1.15M words, tagged, categorized
CESS Treebanks	CLiC-UB	1M words, tagged and parsed (Catalan, Spanish)
Chat-80 Data Files	Pereira & Warren	World Geographic Database
CMU Pronouncing Dictionary	CMU	127k entries
CoNLL 2000 Chunking Data	CoNLL	270k words, tagged and chunked
CoNLL 2002 Named Entity	CoNLL	700k words, pos- and named-entity-tagged (Dutch, Spanish)
CoNLL 2007 Dependency Treebanks (sel)	CoNLL	150k words, dependency parsed (Basque, Catalan)
Dependency Treebank	Narad	Dependency parsed version of Penn Treebank sample
FrameNet	Fillmore, Baker et al	10k word senses, 170k manually annotated sentences
Floresta Treebank	Diana Santos et al	9k sentences, tagged and parsed (Portuguese)
Gazetteer Lists	Various	Lists of cities and countries
Genesis Corpus	Misc web sources	6 texts, 200k words, 6 languages
Gutenberg (selections)	Hart, Newby, et al	18 texts, 2M words
Inaugural Address Corpus	CSpan	US Presidential Inaugural Addresses (1789-present)
Indian POS-Tagged Corpus	Kumaran et al	60k words, tagged (Bangla, Hindi, Marathi, Telugu)
MacMorpho Corpus	NILC, USP, Brazil	1M words, tagged (Brazilian Portuguese)
Movie Reviews	Pang, Lee	2k movie reviews with sentiment polarity classification
Names Corpus	Kantrowitz, Ross	8k male and female names
NIST 1999 Info Extr (selections)	Garofolo	63k words, newswire and named-entity SGML markup
Nombank	Meyers	115k propositions, 1400 noun frames
NPS Chat Corpus	Forsyth, Martell	10k IM chat posts, POS-tagged and dialogue-act tagged
Open Multilingual WordNet	Bond et al	15 languages, aligned to English WordNet
PP Attachment Corpus	Ratnaparkhi	28k prepositional phrases, tagged as noun or verb modifiers
Proposition Bank	Palmer	113k propositions, 3300 verb frames
Question Classification	Li, Roth	6k questions, categorized
Reuters Corpus	Reuters	1.3M words, 10k news documents, categorized
Roget’s Thesaurus	Project Gutenberg	200k words, formatted text
RTE Textual Entailment	Dagan et al	8k sentence pairs, categorized
SEMCOR	Rus, Mihalcea	880k words, part-of-speech and sense tagged
Senseval 2 Corpus	Pedersen	600k words, part-of-speech and sense tagged
SentiWordNet	Esuli, Sebastiani	sentiment scores for 145k WordNet synonym sets
Shakespeare texts (selections)	Bosak	8 books in XML format
State of the Union Corpus	CSPAN	485k words, formatted text
Stopwords Corpus	Porter et al	2,400 stopwords for 11 languages
Swadesh Corpus	Wiktionary	comparative wordlists in 24 languages
Switchboard Corpus (selections)	LDC	36 phonecalls, transcribed, parsed
Univ Decl of Human Rights	United Nations	480k words, 300+ languages
Penn Treebank (selections)	LDC	40k words, tagged and parsed
TIMIT Corpus (selections)	NIST/LDC	audio files and transcripts for 16 speakers
VerbNet 2.1	Palmer et al	5k verbs, hierarchically organized, linked to WordNet
Wordlist Corpus	OpenOffice.org et al	960k words and 20k affixes for 8 languages
WordNet 3.0 (English)	Miller, Fellbaum	145k synonym sets

Corpora in Other Languages

nltk.corpus.cess_esp.words()
nltk.corpus.floresta.words()
nltk.corpus.indian.words('hindi.pos')
# udhr, contains the Universal Declaration of Human Rights in over 300 languages
nltk.corpus.udhr.fileids()
nltk.corpus.udhr.words('Javanese-Latin1')[11:]
# use a conditional frequency distribution to examine the differences 
# in word lengths for a selection of languages included in the udhr corpus
from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch',
	'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist(
	(lang, len(word))
	for lang in languages
	for word in udhr.words(lang + '-Latin1'))
cfd.plot(cumulative=True)

Cumulative Word Length Distributions: Six translations of the Universal Declaration of Human Rights are processed; this graph shows that words having 5 or fewer letters account for about 80% of Ibibio text, 60% of German text, and 25% of Inuktitut text.

Text Corpus Structure
Common Structures for Text Corpora: The simplest kind of corpus is a collection of isolated texts with no particular organization; some corpora are structured into categories like genre (Brown Corpus); some categorizations overlap, such as topic categories (Reuters Corpus); other corpora represent language use over time (Inaugural Address Corpus).

Corpus Functionality

Basic Corpus Functionality defined in NLTK [More]

Example	Description
fileids()	the files of the corpus
fileids([categories])	the files of the corpus corresponding to these categories
categories()	the categories of the corpus
categories([fileids])	the categories of the corpus corresponding to these files
raw()	the raw content of the corpus
raw(fileids=[f1,f2,f3])	the raw content of the specified files
raw(categories=[c1,c2])	the raw content of the specified categories
words()	the words of the whole corpus
words(fileids=[f1,f2,f3])	the words of the specified fileids
words(categories=[c1,c2])	the words of the specified categories
sents()	the sentences of the whole corpus
sents(fileids=[f1,f2,f3])	the sentences of the specified fileids
sents(categories=[c1,c2])	the sentences of the specified categories
abspath(fileid)	the location of the given file on disk
encoding(fileid)	the encoding of the file (if known)
open(fileid)	open a stream for reading the given corpus file
root	if the path to the root of locally installed corpus
readme()	the contents of the README file of the corpus

corpus readers can be used to work with new corpora.

Loading your own Corpus

Use NLTK’s PlaintextCorpusReader.
Set file location as the value of corpus_root.

from nltk.corpus import PlaintextCorpusReader
corpus_root = '/usr/share/dict'
wordlists = PlaintextCorpusReader(corpus_root, '[abc]/.*\.txt')
wordlists.fileids()
wordlists.words('sample1')

from nltk.corpus import BracketParseCorpusReader
corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj"
file_pattern = r".*/wsj_.*\.mrg"
ptb = BracketParseCorpusReader(corpus_root, file_pattern)
ptb.fileids()
ptb.sents(fileids='20/wsj_2013.mrg')[19]

Conditional Frequency Distributions

Conditions and Events
A conditional frequency distribution needs to pair each event with a condition, so we will process a sequence of pairs with the form (condition, event).

Counting Words by Genre

from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
	(genre, word)
	for genre in brown.categories()
	for word in brown.words(categories=genre))
cfd.conditions()
cfd['romance'].most_common(20)
cfd['romance']['could']

Plotting and Tabulating Distributions

# 前文america和citizen在不同年份词频图
from nltk.corpus import inaugural
cfd = nltk.ConditionalFreqDist(
	(target, fileid[:4])
	for fileid in inaugural.fileids()
	for w in inaugural.words(fileid)
	for target in ['america', 'citizen']
	if w.lower().startswith(target))

# 前文不同语言词汇数图
from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist(
	(lang, len(word))
	for lang in languages
	for word in udhr.words(lang + '-Latin1'))
cfd.tabulate(conditions=['English', 'German_Deutsch'], samples=range(10), cumulative=True)

还可以看前文Brown Corpus那里的例子。

Generating Random Text with Bigrams

def generate_model(cfdist, word, num=15):
    for i in range(num):
        print(word, end=' ')
        word = cfdist[word].max()
text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

NLTK’s Conditional Frequency Distributions
commonly-used methods and idioms for defining, accessing, and visualizing a conditional frequency distribution of counters.

Example	Description
cfdist = ConditionalFreqDist(pairs)	create a conditional frequency distribution from a list of pairs
cfdist.conditions()	the conditions
cfdist[condition]	the frequency distribution for this condition
cfdist[condition][sample]	frequency for the given sample for this condition
cfdist.tabulate()	tabulate the conditional frequency distribution
cfdist.tabulate(samples, conditions)	tabulation limited to the specified samples and conditions
cfdist.plot()	graphical plot of the conditional frequency distribution
cfdist.plot(samples, conditions)	graphical plot limited to the specified samples and conditions
cfdist1 < cfdist2	test if samples in cfdist1 occur less frequently than in cfdist2

More Python: Reusing Code

text editors and Python functions.

Creating Programs with a Text Editor
IDLE -> File Menu -> New File
Run Menu -> Run Module or from test import *

Functions

def plural(word):
    if word.endswith('y'):
        return word[:-1] + 'ies'
    elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
        return word + 'es'
    elif word.endswith('an'):
        return word[:-2] + 'en'
    else:
        return word + 's'
plural('fairy')
plural('woman')

Modules

1 2	from text_proc import plural plural('wish')

Lexical Resources

Lexicon(or Lexical Resource): a collection of words and/or phrases along with associated information such as part of speech and sense definitions.
A lexical entry consists of a headword (also known as a lemma) along with additional information such as the part of speech and the sense definition. Two distinct words having the same spelling are called homonyms.
Lexicon Terminology: lexical entries for two lemmas having the same spelling (homonyms), providing part of speech and gloss information.

Wordlist Corpora
nothing more than wordlist. The Words Corpus is the /usr/share/dict/words file from Unix, used by some spell checkers.

# Filtering a Text: 
# computes the vocabulary of a text, then removes all items that occur in an existing wordlist, 
# leaving just the uncommon or mis-spelt words.
def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)
unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))
unusual_words(nltk.corpus.nps_chat.words())

stopwords: high-frequency words like the, to and also that we sometimes want to filter out of a document before further processing.

from nltk.corpus import stopwords
def content_fraction(text):
	stopwords = nltk.corpus.stopwords.words('english')
	content = [w for w in text if w.lower() not in stopwords]
	return len(content) / len(text)
content_fraction(nltk.corpus.reuters.words())

Example:
A Word Puzzle: a grid of randomly chosen letters with rules for creating words out of the letters; this puzzle is known as "Target."

# The FreqDist comparison method permits us to check that the frequency of each letter in the candidate word 
# is less than or equal to the frequency of the corresponding letter in the puzzle.
puzzle_letters = nltk.FreqDist('egivrvonl')
obligatory = 'r'
wordlist = nltk.corpus.words.words()
[w for w in wordlist if len(w) >= 6 [1]
		and obligatory in w [2]
		and nltk.FreqDist(w) <= puzzle_letters]

Names ending in the letter a are almost always female.

from nltk.corpus import names
cfd = nltk.ConditionalFreqDist(
	(fileid, name[-1])
	for fileid in names.fileids()
	for name in names.words(fileid))
cfd.plot()

Conditional Frequency Distribution: this plot shows the number of female and male names ending with each letter of the alphabet; most names ending with a, e or i are female; names ending in h and l are equally likely to be male or female; names ending in k, o, r, s, and t are likely to be male.

A Pronouncing Dictionary
spreadsheet containing a word plus some properties in each row.
CMU Pronouncing Dictionary for US English, which was designed for use by speech synthesizers. 单词和其发音，如('fireball', ['F', 'AY1', 'ER0', 'B', 'AO2', 'L'])，数字表示重音。

entries = nltk.corpus.cmudict.entries()
for word, pron in entries:
	if len(pron) == 3:
		ph1, ph2, ph3 = pron
		if ph1 == 'P' and ph3 == 'T':
			print(word, ph2, end=' ')

# finds all words whose pronunciation ends with a syllable sounding like nicks.
syllable = ['N', 'IH0', 'K', 'S']
[word for word, pron in entries if pron[-4:] == syllable]
[w for w, pron in entries if pron[-1] == 'M' and w[-1] == 'n']
sorted(set(w[:2] for w, pron in entries if pron[0] == 'N' and w[0] != 'n'))

# find words having a particular stress pattern.
def stress(pron):
	return [char for phone in pron for char in phone if char.isdigit()]
[w for w, pron in entries if stress(pron) == ['0', '1', '0', '2', '0']]

# find all the p-words consisting of three sounds, and group them according to their first and last sounds.
p3 = [(pron[0]+'-'+pron[2], word)
	for (word, pron) in entries
	if pron[0] == 'P' and len(pron) == 3]
cfd = nltk.ConditionalFreqDist(p3)
for template in sorted(cfd.conditions()):
	if len(cfd[template]) > 10:
		words = sorted(cfd[template])
		wordstring = ' '.join(words)
		print(template, wordstring[:70] + "...")

1
2
3

prondict = nltk.corpus.cmudict.dict()
prondict['fire']
# [['F', 'AY1', 'ER0'], ['F', 'AY1', 'R']]

Comparative Wordlists
Swadesh wordlists consist of of about 200 common words in several languages. The languages are identified using an ISO 639 two-letter code.

from nltk.corpus import swadesh
swadesh.fileids()
swadesh.words('en')
fr2en = swadesh.entries(['fr', 'en'])	# 法英对照list
translate = dict(fr2en)
translate['chien']
de2en = swadesh.entries(['de', 'en'])
translate.update(dict(de2en))
translate['Hund']

Shoebox and Toolbox Lexicons
A Toolbox file consists of a collection of entries, where each entry is made up of one or more fields. loose structured.

1 2	from nltk.corpus import toolbox toolbox.entries('rotokas.dic')

WordNet

WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure.

Senses and Synonyms
The entity car.n.01 is called a synset, or “synonym set”, a collection of synonymous words (or “lemmas”):

from nltk.corpus import wordnet as wn
wn.synsets('motorcar')	# 有几种释义
wn.synset('car.n.01').lemma_names()	# 此释义下的所有同义词
wn.synset('car.n.01').definition()	# 释义
wn.synset('car.n.01').examples()	# 例句
wn.synset('car.n.01').lemmas()
wn.lemma('car.n.01.automobile').synset()
wn.lemma('car.n.01.automobile').name()

The WordNet Hierarchy
Fragment of WordNet Concept Hierarchy: nodes correspond to synsets; edges indicate the hypernym/hyponym relation, i.e. the relation between superordinate and subordinate concepts.

motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()
sorted(lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas())
motorcar.hypernyms()
paths = motorcar.hypernym_paths()
len(paths)
[synset.name() for synset in paths[0]]
motorcar.root_hypernyms()

More Lexical Relations
from items to their components (meronyms) or to the things they are contained in (holonyms)

1
2
3

wn.synset('tree.n.01').part_meronyms()
wn.synset('tree.n.01').substance_meronyms()
wn.synset('tree.n.01').member_holonyms()

the act of walking involves the act of stepping, so walking entails stepping.

1	wn.synset('eat.v.01').entailments()

Some lexical relationships hold between lemmas, e.g., antonymy

1	wn.lemma('supply.n.02.supply').antonyms()

查看synset的lexical relation可以用dir(wn.synset('harmony.n.02'))。

Semantic Similarity

right = wn.synset('right_whale.n.01')
minke = wn.synset('minke_whale.n.01')
right.lowest_common_hypernyms(minke)
 wn.synset('baleen_whale.n.01').min_depth()
right.path_similarity(minke)

查看帮助help(wn)。另一个相似的verb lexicon是nltk.corpus.verbnet。

Processing Raw Text

本章要解决的问题：

How can we write programs to access text from local files and from the web, in order to get hold of an unlimited range of language material?
How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?
How can we write programs to produce formatted output and save it in a file?

从本章起，代码执行前先：

1 2	import nltk, re, pprint from nltk import word_tokenize

Accessing Text from the Web and from Disk

Electronic Books
http://www.gutenberg.org/catalog/上有25000本书，多语种的。

# Access number 2554 from gutenberg
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')	# a string
raw[:75]

如果开了代理，要用：

1 2	proxies = {'http': 'http://www.someproxy.com:3128'} request.ProxyHandler(proxies)

Tokenization: break up the string into words and punctuation.

# tokenization
tokens = word_tokenize(raw)
tokens[:10]
# other nltk operations
text = nltk.Text(tokens)
text[1024:1062]
text.collocations()

Deal with HTML

# original web page
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]
# extract text out of html
from bs4 import BeautifulSoup
raw = BeautifulSoup(html).get_text()
tokens = word_tokenize(raw)
tokens
# manually find start and end of the text
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

Processing Search Engine Results
Massive yet unstable.

Processing RSS Feeds

import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
llog['feed']['title']
len(llog.entries)
post = llog.entries[2]
post.title
content = post.content[0].value
content[:70]
raw = BeautifulSoup(content).get_text()
word_tokenize(raw)

Reading Local Files

# read whole text
f = open('document.txt')
raw = f.read()
# read line by line
f = open('document.txt', 'rU')
for line in f:
	print(line.strip())
# read file of nltk
path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
raw = open(path, 'rU').read()

1 2	import os os.listdir('.')

Extracting Text from PDF, MSWord and Other Binary Formats
Use thirdparty libraries like pypdf and pywin32.
Better save is as a txt file manually. Or if the file is on the web, use search engine’s HTML version of the document.

Capturing User Input

1 2	s = input("Enter some text: ") print("You typed", len(word_tokenize(s)), "words.")

The NLP Pipiline
The Processing Pipeline: We open a URL and read its HTML content, remove the markup and select a slice of characters; this is then tokenized and optionally converted into an nltk.Text object; we can also lowercase all the words and extract the vocabulary.

Strings: Text Processing at the Lowest Level

Strings are immutable, Lists are mutable.

couplet = "Shall I compare thee to a Summer's day?"\
	"Thou are more lovely and more temperate:"	# 一个string
couplet = """Shall I compare thee to a Summer's day?
	Thou are more lovely and more temperate:"""	# 一个string并保留换行符
print(monty, "and the", grail)

from nltk.corpus import gutenberg
raw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
fdist.most_common(5)
[char for (char, count) in fdist.most_common()]

Useful String Methods:

Method	Functionality
s.find(t)	index of first instance of string t inside s (-1 if not found)
s.rfind(t)	index of last instance of string t inside s (-1 if not found)
s.index(t)	like s.find(t) except it raises ValueError if not found
s.rindex(t)	like s.rfind(t) except it raises ValueError if not found
s.join(text)	combine the words of the text into a string using s as the glue
s.split(t)	split s into a list wherever a t is found (whitespace by default)
s.splitlines()	split s into a list of strings, one per line
s.lower()	a lowercased version of the string s
s.upper()	an uppercased version of the string s
s.title()	a titlecased version of the string s
s.strip()	a copy of s without leading or trailing whitespace
s.replace(t, u)	replace instances of t with u inside s

Text Processing with Unicode

What is Unicode?
In Python, code points are written in the form \uXXXX, where XXXX is the number in 4-digit hexadecimal form.
Unicode Decoding and Encoding
From a Unicode perspective, characters are abstract entities which can be realized as one or more glyphs. Only glyphs can appear on a screen or be printed on paper. A font is a mapping from characters to glyphs.

Extracting encoded text from files

path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
f = open(path, encoding='latin2')
for line in f:
	line = line.strip()
	print(line)
	print(line.encode('unicode_escape'))	# 显示codepoint

In Python 3, source code is encoded using UTF-8 by default.

ord('ń')	# 324
hex(324)	# 0144
nacute = '\u0144'	# ń
print(nacute.encode('utf8'))

import unicodedata
lines = open(path, encoding='latin2').readlines()
line = lines[2]
print(line.encode('unicode_escape'))
for c in line: [1]
	if ord(c) > 127:
	print('{} U+{:04x} {}'.format(c.encode('utf8'), ord(c), unicodedata.name(c)))

# re dealing with unicode
line.find('zosta\u0142y')
line = line.lower()
line.encode('unicode_escape')
import re
m = re.search('\u015b\w*', line)
m.group()

1 2	# NLTK tokenizers allow Unicode word_tokenize(line)

Using your local encoding in Python
include the string # -*- coding: <coding> -*- as the first or second line of your file. Note that has to be a string like ‘latin-1’, ‘big5’ or ‘utf-8’.
Unicode and IDLE: UTF-8 encoded string literals in the IDLE editor; this requires that an appropriate font is set in IDLE's preferences; here we have chosen Courier CE.

Regular Expressions for Detecting Word Patterns

import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
# find words ending with ed
[w for w in wordlist if re.search('ed$', w)]
# find an 8-letter word with j as its third letter and t as its sixth letter
[w for w in wordlist if re.search('^..j..t..$', w)]
# what words can be produced using 4 words of each bucket
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]
# + means one or more instances of the preceding item
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^m+i+n+e+$', w)]
[w for w in chat_words if re.search('^[ha]+$', w)]
# use of \ {} () |
wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]
[w for w in wsj if re.search('(ed|ing)$', w)]

Basic Regular Expression Meta-Characters, Including Wildcards, Ranges and Closures

Operator	Behavior
.	Wildcard, matches any character
^abc	Matches some pattern abc at the start of a string
abc$	Matches some pattern abc at the end of a string
[abc]	Matches one of a set of characters
[A-Z0-9]	Matches one of a range of characters
ed¦ing¦s	Matches one of the specified strings (disjunction)
*	Zero or more of previous item, e.g. a, [a-z] (also known as Kleene Closure)
+	One or more of previous item, e.g. a+, [a-z]+
?	Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
{n}	Exactly n repeats where n is a non-negative integer
{n,}	At least n repeats
{,n}	No more than n repeats
{m,n}	At least m and no more than n repeats
a(b¦c)+	Parentheses that indicate the scope of the operators

Useful Applications of Regular Expressions

Extract Word Pieces

# find all the vowels in a word, then count them
word = 'supercalifragilisticexpialidocious'
len(re.findall(r'[aeiou]', word))
# look for all sequences of two or more vowels in some text, and determine their relative frequency
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj
	for vs in re.findall(r'[aeiou]{2,}', word))
fd.most_common(12)

Doing More with Word Pieces

# compress word like 'declaration' -> 'dclrtn'
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
def compress(word):
	pieces = re.findall(regexp, word)
	return ''.join(pieces)
english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))	# Unvrsl Dclrtn of Hmn Rghts Prmble...
# extract all consonant-vowel sequences from the words of Rotokas
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()
# add index
cv_word_pairs = [(cv, w) for w in rotokas_words
	for cv in re.findall(r'[ptksvr][aeiou]', w)]
cv_index = nltk.Index(cv_word_pairs)
cv_index['su']
cv_index['po']

Finding Word Stems
比如搜索的时候，laptops和laptop的效果是一样的。
我们将来用NLTK的内置stemmer，不过可以先看看用正则表达式可以怎么做。

def stem(word):
	regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
	stem, suffix = re.findall(regexp, word)[0]
	return stem
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
	is no basis for a system of government.  Supreme executive power derives from
	a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)
[stem(t) for t in tokens]

Searching Tokenized Text
a special kind of regular expression for searching across multiple words in a text.

from nltk.corpus import gutenberg, nps_chat
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby.findall(r"<a> (<.*>) <man>")	# monied; nervous; dangerous; white; white; white; pious;...
chat = nltk.Text(nps_chat.words())
chat.findall(r"<.*> <.*> <bro>")	# you rule bro; telling you bro; u twizted bro
chat.findall(r"<l.*>{3,}")	# lol lol lol; lmao lol lol; lol lol lol; la la la la la;...

nltk.re_show(p, s)annotates the string s to show every place where pattern p was matched. nltk.app.nemo()provides a graphical interface for exploring regular expressions.

Normalizeing Text

Stemmers

raw = """DENNIS: Listen, strange women lying in ponds distributing swords
	is no basis for a system of government.  Supreme executive power derives from
	a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)
# 3 different stemmers
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
[porter.stem(t) for t in tokens]
[lancaster.stem(t) for t in tokens]

# indexing some texts and want to support search using alternative forms of words
class IndexedText(object):
    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))
    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = int(width/4)                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '{:>{width}}'.format(lcontext[-width:], width=width)
            rdisplay = '{:{width}}'.format(rcontext[:width], width=width)
            print(ldisplay, rdisplay)
    def _stem(self, word):
        return self._stemmer.stem(word).lower()
porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')

Lemmatization

1
2
3

# compile the vocabulary of some texts and want a list of valid lemmas
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]

Regular Expressions for Tokenizing Text

Simple Approaches to Tokenization
最简单的就是从空格劈开。re.split(r'[ \t\n]+', raw)。
所有字母组劈开。re.split(r'\W+', raw)。
考虑连字符等。re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw)。

Regular Expression Symbols

Symbol	Function
\b	Word boundary (zero width)
\d	Any decimal digit (equivalent to [0-9])
\D	Any non-digit character (equivalent to [^0-9])
\s	Any whitespace character (equivalent to [ \t\n\r\f\v])
\S	Any non-whitespace character (equivalent to [^ \t\n\r\f\v])
\w	Any alphanumeric character (equivalent to [a-zA-Z0-9_])
\W	Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])
\t	The tab character
\n	The newline character

NLTK’s Regular Expression Tokenizer

text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
	([A-Z]\.)+        # abbreviations, e.g. U.S.A.
	| \w+(-\w+)*        # words with optional internal hyphens
	| \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
	| \.\.\.            # ellipsis
	| [][.,;"'?():-_`]  # these are separate tokens; includes ], [
	'''
nltk.regexp_tokenize(text, pattern)	# ['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

可以先列好一组词，然后和tokenlizer的结果比较判断好坏。set(tokens).difference(wordlist)。

Segmentation

Sentence Segmentation
Punkt sentence segmenter

1
2
3

text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = nltk.sent_tokenize(text)
pprint.pprint(sents[79:89])

Word Segmentataion
没有明显词语边界的语言，比如我们中文。或者一些实时语言流，没有标注，也无法预知后一词。
Object Function，该字后要分词就标1，用下图方式评估好坏，然后随机出一大堆，选里面坠吼得。

# Reconstruct Segmented Text from String Representation
def segment(text, segs):
    words = []
    last = 0
    for i in range(len(segs)):
        if segs[i] == '1':
            words.append(text[last:i+1])
            last = i+1
    words.append(text[last:])
    return words
# ------------------------------------------
# Computing the Cost of Storing the Lexicon and Reconstructing the Source Text
def evaluate(text, segs):
    words = segment(text, segs)
    text_size = len(words)
    lexicon_size = sum(len(word) + 1 for word in set(words))
    return text_size + lexicon_size
# ------------------------------------------
# Non-Deterministic Search Using Simulated Annealing
# begin searching with phrase segmentations only;
# randomly perturb the zeros and ones proportional to the "temperature"; 
# with each iteration the temperature is lowered and the perturbation of boundaries is reduced.
from random import randint
def flip(segs, pos):
    return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]
def flip_n(segs, n):
    for i in range(n):
        segs = flip(segs, randint(0, len(segs)-1))
    return segs
def anneal(text, segs, iterations, cooling_rate):
    temperature = float(len(segs))
    while temperature > 0.5:
        best_segs, best = segs, evaluate(text, segs)
        for i in range(iterations):
            guess = flip_n(segs, round(temperature))
            score = evaluate(text, guess)
            if score < best:
                best, best_segs = score, guess
        score, segs = best, best_segs
        temperature = temperature / cooling_rate
        print(evaluate(text, segs), segment(text, segs))
    print()
    return segs
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
segment(text, seg1)
evaluate(text, seg1)
anneal(text, seg1, 5000, 1.2)

Formatting: From Lists to Strings

From Lists to Strings
join()

1 2	silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.'] ';'.join(silly) # We;called;him;Tortoise;because;he;taught;us;.

Strings and Formats

1
2
3

fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in sorted(fdist):
	print('{0}->{1};'.format(word, fdist[word]), end=' ')	# cat->3; dog->4; snake->1;

Lining Things Up

'{:<6}' .format(41)	# '41    '
'{:>6}'.format('dog')	# '   dog'
'{:.4f}'.format(math.pi)	# '3.1416'
count, total = 3205, 9375
"accuracy for {} words: {:.4%}".format(total, count / total)	# 'accuracy for 9375 words: 34.1867%'

# tabulating data
def tabulate(cfdist, words, categories):
    print('{:16}'.format('Category'), end=' ')                    # column headings
    for word in words:
        print('{:>6}'.format(word), end=' ')
    print()
    for category in categories:
        print('{:16}'.format(category), end=' ')                  # row heading
        for word in words:                                        # for each word
            print('{:6}'.format(cfdist[category][word]), end=' ') # print table cell
        print()                                                   # end the row
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
	(genre, word)
	for genre in brown.categories()
	for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
tabulate(cfd, modals, genres)

Writing Results to a File
文件名中不要有空格，或是只有大小写不同。

output_file = open('output.txt', 'w')
words = set(nltk.corpus.genesis.words('english-kjv.txt'))
for word in sorted(words):
	print(word, file=output_file)

Text Wrapping

from textwrap import fill
format = '%s (%d),'
pieces = [format % (word, len(word)) for word in saying]
output = ' '.join(pieces)
wrapped = fill(output)
print(wrapped)

Writing Structured Programs

本章要解决的问题：

How can you write well-structured, readable programs that you and others will be able to re-use easily?
How do the fundamental building blocks work, such as loops, functions and assignment?
What are some of the pitfalls with Python programming and how can you avoid them?

Back to the Basics

Assignment
List的Assignment实际上是引用。用id()可以看到id是否相同。

nested = [[]] * 3
nested[1].append('Python')	# 三个引用都变成Python
nested[1] = ['Monty']	# 其中一个指向另一个list，剩下两个不受影响
nested	# [['Python'], ['Monty'], ['Python']]

想要拷贝list，使用bar = foo[:]，不过list里面的引用还是老样子。如果完全不想拷贝任何引用，使用copy.deepcopy()。

Equality
==表示值相等，is表示是同一物体。

Conditionals
在if条件中，非空string或list被视为True，空的string或list视为False。
all()和any()可以被用在list上，检查all或any项满足某条件。

1 2	all(len(w) > 4 for w in sent) # False any(len(w) > 4 for w in sent) # True

Sequences

tuple初始化时用,，可以省略()。

1	typleT = 'walk', 'fem', 3

Operating on Sequence Types

Python Expression	Comment
for item in s	iterate over the items of s
for item in sorted(s)	iterate over the items of s in order
for item in set(s)	iterate over unique elements of s
for item in reversed(s)	iterate over elements of s in reverse
for item in set(s).difference(t)	iterate over elements of s not in t

words = ['I', 'turned', 'off', 'the', 'spectroroute']
tags = ['noun', 'verb', 'prep', 'det', 'noun']
list(zip(words, tags))	# [('I', 'noun'), ('turned', 'verb'), ('off', 'prep'), ('the', 'det'), ('spectroroute', 'noun')]
list(enumerate(words))	# [(0, 'I'), (1, 'turned'), (2, 'off'), (3, 'the'), (4, 'spectroroute')]

1
2
3

text = nltk.corpus.nps_chat.words()
cut = int(0.9 * len(text))
training_data, test_data = text[:cut], text[cut:]

Combining Different Sequence Types

# sorting the words in a string by their length
words = 'I turned off the spectroroute'.split() [1]
wordlens = [(len(word), word) for word in words]
wordlens.sort()
' '.join(w for (_, w) in wordlens)

We often use lists to hold sequences of words. In contrast, a tuple is typically a collection of objects of different types, of fixed length. We often use a tuple to hold a record, a collection of different fields relating to some entity.

lexicon = [
		('the', 'det', ['Di:', 'D@']),
		('off', 'prep', ['Qf', 'O:f'])
	]

Generator Expressions

1 2	max([w.lower() for w in word_tokenize(text)]) max(w.lower() for w in word_tokenize(text)) # generator，优化内存使用

Questions of Style

“The Art of Computer Programming”.

Python Coding Style
A style guide for Python code published by the designers of the Python language.
consistency.

Procedural vs Declarative Style
Procedural更像机器操作步骤，应该用Declarative风格。

total = sum(len(t) for t in tokens)
print(total / len(tokens))
maxlen = max(len(word) for word in text)
[word for word in text if len(word) == maxlen]

Some Legitimate Uses for Counters

1
2
3

m, n = 3, 7
array = [[set() for i in range(n)] for j in range(m)]
array[2][5].add('Alice')

Functions: The Foundation of Structured Programming

import re
def get_text(file):
    """Read text from a file, normalizing whitespace and stripping HTML markup."""
    text = open(file).read()
    text = re.sub(r'<.*?>', ' ', text)
    text = re.sub('\s+', ' ', text)
    return text
>>> help(get_text)
|   Help on function get_text in module __main__:
|
|   get(text)
|       Read text from a file, normalizing whitespace and stripping HTML markup.
# 所以要写说明啊

Function Inputs and Outputs
Functions should modify the contents of a parameter, or return a value, not both.

Parameter Passing
value只传值，有结构的东西传的是引用。

Variable Scopr
Function可以通过global改变global变量的值，不过应该尽量避免使用。应该用参数和返回值。

Checking Parameter Types

def tag(word):
	assert isinstance(word, basestring), "argument to tag() must be a string"
	if word in ['a', 'the', 'all']:
		return 'det'
	else:
		return 'noun'

Functional Decomposition
大函数分解成若干小函数。不要有副作用。

from urllib import request
from bs4 import BeautifulSoup
def freq_words(url, n):
    html = request.urlopen(url).read().decode('utf8')
    text = BeautifulSoup(html).get_text()
    freqdist = nltk.FreqDist(word.lower() for word in word_tokenize(text))
    return [word for (word, _) in fd.most_common(n)]

Documenting Functions

def accuracy(reference, test):
    """
    Calculate the fraction of test items that equal the corresponding reference items.
    Given a list of reference values and a corresponding list of test values,
    return the fraction of corresponding values that are equal.
    In particular, return the fraction of indexes
    {0<i<=len(test)} such that C{test[i] == reference[i]}.
        >>> accuracy(['ADJ', 'N', 'V', 'N'], ['N', 'N', 'V', 'ADJ'])
        0.5
    :param reference: An ordered list of reference values
    :type reference: list
    :param test: A list of values to compare against the corresponding
        reference values
    :type test: list
    :return: the accuracy score
    :rtype: float
    :raises ValueError: If reference and length do not have the same length
    """
    if len(reference) != len(test):
        raise ValueError("Lists must have the same length.")
    num_correct = 0
    for x, y in zip(reference, test):
        if x == y:
            num_correct += 1
    return float(num_correct) / len(reference)

Doing More with Functions

Functions as Arguments

sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',
	'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']
def extract_property(prop):
	return [prop(word) for word in sent]
extract_property(len)	# [4, 4, 2, 3, 5, 1, 3, 3, 6, 4, 4, 4, 2, 10, 1]
extract_property(lambda w: w[-1])	# ['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']
sorted(sent, lambda x, y: cmp(len(y), len(x)))

Accumulative Functions
These functions start by initializing some storage, and iterate over input to build it up, before returning some final object.

def search2(substring, words):
    for word in words:
        if substring in word:
            yield word
for item in search2('zz', nltk.corpus.brown.words()):
	print(item, end=" ")

def permutations(seq):
	if len(seq) <= 1:
		yield seq
	else:
		for perm in permutations(seq[1:]):
			for i in range(len(perm)+1):
				yield perm[:i] + seq[0:1] + perm[i:]
list(permutations(['police', 'fish', 'buffalo']))

Higher-Order Functions
filter()

def is_content_word(word):
	return word.lower() not in ['a', 'of', 'the', 'and', 'will', ',', '.']
sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',
	'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']
list(filter(is_content_word, sent))	# ['Take', 'care', 'sense', 'sounds', 'take', 'care', 'themselves']

map()applies a function to every item in a sequence.

# equivalent
lengths = list(map(len, nltk.corpus.brown.sents(categories='news')))
lengths = [len(sent) for sent in nltk.corpus.brown.sents(categories='news')]
sum(lengths) / len(lengths)	# 21.75081116158339
# equivalent
list(map(lambda w: len(filter(lambda c: c.lower() in "aeiou", w)), sent))	# [2, 2, 1, 1, 2, 0, 1, 1, 2, 1, 2, 2, 1, 3, 0]
[len(c for c in w if c.lower() in "aeiou") for w in sent]

Named Arguments
Take care not to use a mutable object as the default value of a parameter.

def repeat(msg='<empty>', num=1):
	return msg * num
repeat(num=3)
repeat(num=5, msg='Alice')

*args= all the unnamed parameters of the function. **kwargs=named parameters.

def generic(*args, **kwargs):
	print(args)
	print(kwargs)
generic(1, "African swallow", monty="python")
# (1, 'African swallow')
# {'monty': 'python'}

1
2
3

# 用with会自动关闭打开的文档
with open("lexicon.txt") as f:
	data = f.read()

Program Development

Key high-level abilities are algorithm design and its manifestation in structured programming. Key low-level abilities include familiarity with the syntactic constructs of the language, and knowledge of a variety of diagnostic methods for trouble-shooting a program which does not exhibit the expected behavior.

Structure of a Python Module
The purpose of a program module is to bring logically-related definitions and functions together in order to facilitate re-use and abstraction.
individual .py files.
可以用__file__查看NLTK module的位置：

1	nltk.metrics.distance.__file__ # '/usr/lib/python2.5/site-packages/nltk/metrics/distance.pyc'

Some module variables and functions are only used within the module. These should have names beginning with an underscore, e.g. _helper(), since this will hide the name. If another module imports this one, using the idiom: from module import *, these names will not be imported. You can optionally list the externally accessible names of a module using a special built-in variable like this: __all__ = ['edit_distance', 'jaccard_distance'].

Multi-Module Programs
Structure of a Multi-Module Program: The main program my_program.py imports functions from two other modules; unique analysis tasks are localized to the main program, while common loading and visualization tasks are kept apart to facilitate re-use and abstraction.

Sources of Error
First, the input data may contain some unexpected characters.
Second, a supplied function might not behave as expected.
Third, our understanding of Python’s semantics may be at fault.

Debugging Techniques
print, stack trace
Python’s debugger:

1
2
3

import pdb
import mymodule
pdb.run('mymodule.myfunction()')

help/ step/ next/ break/ continue

Defensive Programming
assert如assert(isinstance(text, list))。
maintain a suite of test cases.regression testing

Algorithm Design

看我自己的算法总结系列。
Examples:

# Four Ways to Compute Sanskrit Meter: 
# (i) recursive; (ii) bottom-up dynamic programming; (iii) top-down dynamic programming; and (iv) built-in memoization.
def virahanka1(n):
    if n == 0:
        return [""]
    elif n == 1:
        return ["S"]
    else:
        s = ["S" + prosody for prosody in virahanka1(n-1)]
        l = ["L" + prosody for prosody in virahanka1(n-2)]
        return s + l
def virahanka2(n):
    lookup = [[""], ["S"]]
    for i in range(n-1):
        s = ["S" + prosody for prosody in lookup[i+1]]
        l = ["L" + prosody for prosody in lookup[i]]
        lookup.append(s + l)
    return lookup[n]
def virahanka3(n, lookup={0:[""], 1:["S"]}):
    if n not in lookup:
        s = ["S" + prosody for prosody in virahanka3(n-1)]
        l = ["L" + prosody for prosody in virahanka3(n-2)]
        lookup[n] = s + l
    return lookup[n]
from nltk import memoize
@memoize
def virahanka4(n):
    if n == 0:
        return [""]
    elif n == 1:
        return ["S"]
    else:
        s = ["S" + prosody for prosody in virahanka4(n-1)]
        l = ["L" + prosody for prosody in virahanka4(n-2)]
        return s + l
# test
>>> virahanka1(4)
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
>>> virahanka2(4)
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
>>> virahanka3(4)
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
>>> virahanka4(4)
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']

A Sample of Python Libraries

Matplotlib: 二维图形。

from numpy import arange
from matplotlib import pyplot
colors = 'rgbcmyk' # red, green, blue, cyan, magenta, yellow, black
def bar_chart(categories, words, counts):
    "Plot a bar chart showing counts for each word by category"
    ind = arange(len(words))
    width = 1 / (len(categories) + 1)
    bar_groups = []
    for c in range(len(categories)):
        bars = pyplot.bar(ind+c*width, counts[categories[c]], width,
                         color=colors[c % len(colors)])
        bar_groups.append(bars)
    pyplot.xticks(ind+width, words)
    pyplot.legend([b[0] for b in bar_groups], categories, loc='upper left')
    pyplot.ylabel('Frequency')
    pyplot.title('Frequency of Six Modal Verbs by Genre')
    pyplot.show()
# test
>>> genres = ['news', 'religion', 'hobbies', 'government', 'adventure']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> cfdist = nltk.ConditionalFreqDist(
...              (genre, word)
...              for genre in genres
...              for word in nltk.corpus.brown.words(categories=genre)
...              if word in modals)
...
>>> counts = {}
>>> for genre in genres:
...     counts[genre] = [cfdist[genre][word] for word in modals]
>>> bar_chart(genres, modals, counts)

Bar Chart Showing Frequency of Modals in Different Sections of Brown Corpus: this visualization was produced by the program

NetworkX: for defining and manipulating structures consisting of nodes and edges, known as graphs.

import networkx as nx
import matplotlib
from nltk.corpus import wordnet as wn

def traverse(graph, start, node):
    graph.depth[node.name] = node.shortest_path_distance(start)
    for child in node.hyponyms():
        graph.add_edge(node.name, child.name) [1]
        traverse(graph, start, child) [2]
def hyponym_graph(start):
    G = nx.Graph() [3]
    G.depth = {}
    traverse(G, start, start)
    return G
def graph_draw(graph):
    nx.draw_graphviz(graph,
         node_size = [16 * graph.degree(n) for n in graph],
         node_color = [graph.depth[n] for n in graph],
         with_labels = False)
    matplotlib.pyplot.show()
# test
>>> dog = wn.synset('dog.n.01')
>>> graph = hyponym_graph(dog)
>>> graph_draw(graph)

Visualization with NetworkX and Matplotlib: Part of the WordNet hypernym hierarchy is displayed, starting with dog.n.01 (the darkest node in the middle); node size is based on the number of children of the node, and color is based on the distance of the node from dog.n.01; this visualization was produced by the program

csv: read and write files stored in this format.

import csv
input_file = open("lexicon.csv", "rb") [1]
for row in csv.reader(input_file): [2]
	print(row)

NumPy: provides substantial support for numerical processing in Python.

# array
>>> from numpy import array
>>> cube = array([ [[0,0,0], [1,1,1], [2,2,2]],
...                [[3,3,3], [4,4,4], [5,5,5]],
...                [[6,6,6], [7,7,7], [8,8,8]] ])
>>> cube[1,1,1]
4
>>> cube[2].transpose()
array([[6, 7, 8],
       [6, 7, 8],
       [6, 7, 8]])
>>> cube[2,1:]
array([[7, 7, 7],
       [8, 8, 8]])
# algebra functions
>>> from numpy import linalg
>>> a=array([[4,0], [3,-5]])
>>> u,s,vt = linalg.svd(a)
>>> u
array([[-0.4472136 , -0.89442719],
       [-0.89442719,  0.4472136 ]])
>>> s
array([ 6.32455532,  3.16227766])
>>> vt
array([[-0.70710678,  0.70710678],
       [-0.70710678, -0.70710678]])

NLTK’s clustering package nltk.cluster makes extensive use of NumPy arrays, and includes support for k-means clustering, Gaussian EM clustering, group average agglomerative clustering, and dendrogram plots. For details, type help(nltk.cluster).

Other Python Libraries
help of the Python Package Index.

Categorizing and Tagging Words

本章要解决的问题：

What are lexical categories and how are they used in natural language processing?
What is a good Python data structure for storing words and their categories?
How can we automatically tag each word of a text with its word class?

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset.

Using a Tagger

>>> text = word_tokenize("They refuse to permit us to obtain the refuse permit")
>>> nltk.pos_tag(text)
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
>>> text.similar('woman')
Building word-context index...
man day time year car moment world family house boy child country job
state girl place war way case question

Tagged Corpora

Representing Tagged Tokens

>>> sent = '''
... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
... interest/NN of/IN both/ABX governments/NNS ''/'' ./.
... '''
>>> [nltk.tag.str2tuple(t) for t in sent.split()]
[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'),
('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]

Reading Tagged Corpora

1 2	nltk.corpus.sinica_treebank.tagged_words() nltk.corpus.brown.tagged_words(tagset='universal')

相对应的，tagged_sents() divides up the tagged words into sentences rather than presenting them as one big list.

A Universal Part-of-Speech Tagset

Tag	Meaning	English Examples
ADJ	adjective	new, good, high, special, big, local
ADP	adposition	on, of, at, with, by, into, under
ADV	adverb	really, already, still, early, now
CONJ	conjunction	and, or, but, if, while, although
DET	determiner, article	the, a, some, most, every, no, which
NOUN	noun	year, home, costs, time, Africa
NUM	numeral	twenty-four, fourth, 1991, 14:24
PRT	particle	at, on, out, over per, that, up, with
PRON	pronoun	he, their, her, its, my, I, us
VERB	verb	is, say, told, given, playing, would
.	punctuation marks	. , ; !
X	other	ersatz, esprit, dunno, gr8, univeristy

# most common tags in the news category of the Brown corpus.
>>> from nltk.corpus import brown
>>> brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
>>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
>>> tag_fd.most_common()
[('NOUN', 30640), ('VERB', 14399), ('ADP', 12355), ('.', 11928), ('DET', 11389),
 ('ADJ', 6706), ('ADV', 3349), ('CONJ', 2717), ('PRON', 2535), ('PRT', 2264),
 ('NUM', 2166), ('X', 106)]

Nouns
Syntactic Patterns involving some Nouns:

Word	After a determiner	Subject of the verb
woman	the woman who I saw yesterday …	the woman sat down
Scotland	the Scotland I remember as a child …	Scotland has five million people
book	the book I bought yesterday …	this book recounts the colonization of Australia
intelligence	the intelligence displayed by the child …	Mary’s intelligence impressed her teachers

N for common nouns like book, NP for proper nouns like Scotland.

# what parts of speech occur before a noun
>>> word_tag_pairs = nltk.bigrams(brown_news_tagged)
>>> noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']
>>> fdist = nltk.FreqDist(noun_preceders)
>>> [tag for (tag, _) in fdist.most_common()]
['NOUN', 'DET', 'ADJ', 'ADP', '.', 'VERB', 'CONJ', 'NUM', 'ADV', 'PRT', 'PRON', 'X']

Verbs
Syntactic Patterns involving some Verbs:

Word	Simple	With modifiers and adjuncts (italicized)
fall	Rome fell	Dot com stocks suddenly fell like a stone
eat	Mice eat cheese	John ate the pizza with gusto

VBD past tense, VBN past participle.

wsj = nltk.corpus.treebank.tagged_words(tagset='universal')
# frequency-ordered list of tags given a word
cfd1 = nltk.ConditionalFreqDist(wsj)
cfd1['yield'].most_common()
# find words which can be both VBD and VBN
[w for w in cfd1.conditions() if 'VBD' in cfd1[w] and 'VBN' in cfd1[w]]
# surrounding text
idx2 = wsj.index(('kicked', 'VBN'))
wsj[idx2-4:idx2+1]

Adjectives and Adverbs
Each dictionary and grammar classifies differently.

Unsiplified Tags

# finds all tags starting with NN, and provides a few example words for each one.
def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())
>>> tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))
>>> for tag in sorted(tagdict):
...     print(tag, tagdict[tag])
NN [('year', 137), ('time', 97), ('state', 88), ('week', 85), ('man', 72)]
NN$ [("year's", 13), ("world's", 8), ("state's", 7), ("nation's", 6), ("company's", 6)]
NN$-HL [("Golf's", 1), ("Navy's", 1)]
NN$-TL [("President's", 11), ("Army's", 3), ("Gallery's", 3), ("University's", 3), ("League's", 3)]
NN-HL [('sp.', 2), ('problem', 2), ('Question', 2), ('business', 2), ('Salary', 2)]
NN-NC [('eva', 1), ('aya', 1), ('ova', 1)]
NN-TL [('President', 88), ('House', 68), ('State', 59), ('University', 42), ('City', 41)]
NN-TL-HL [('Fort', 2), ('Dr.', 1), ('Oak', 1), ('Street', 1), ('Basin', 1)]
NNS [('years', 101), ('members', 69), ('people', 52), ('sales', 51), ('men', 46)]
NNS$ [("children's", 7), ("women's", 5), ("janitors'", 3), ("men's", 3), ("taxpayers'", 2)]
NNS$-HL [("Dealers'", 1), ("Idols'", 1)]
NNS$-TL [("Women's", 4), ("States'", 3), ("Giants'", 2), ("Bros.'", 1), ("Writers'", 1)]
NNS-HL [('comments', 1), ('Offenses', 1), ('Sacrifices', 1), ('funds', 1), ('Results', 1)]
NNS-TL [('States', 38), ('Nations', 11), ('Masters', 10), ('Rules', 9), ('Communists', 9)]
NNS-TL-HL [('Nations', 1)]

Exploring Tagged Corpora

# "often" used as what?
>>> brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')
>>> tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']
>>> fd = nltk.FreqDist(tags)
>>> fd.tabulate()
 PRT  ADV  ADP    . VERB  ADJ
   2    8    7    4   37    6

# searching for Three-Word Phrases Using POS Tags
from nltk.corpus import brown
def process(sentence):
    for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):
        if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
            print(w1, w2, w3)
>>> for tagged_sent in brown.tagged_sents():
...     process(tagged_sent)
...
combined to achieve
continue to place
serve to protect
wanted to wait
allowed to place
expected to become

Mapping Words to Properties Using Python Dictionaries

Indexing Lists vs Dictionaries
Linguistic Objects as Mappings from Keys to Values:

Linguistic Object	Maps From	Maps To
Document Index	Word	List of pages (where word is found)
Thesaurus	Word sense	List of synonyms
Dictionary	Headword	Entry (part-of-speech, sense definitions, etymology)
Comparative Wordlist	Gloss term	Cognates (list of words, one per language)
Morph Analyzer	Surface form	Morphological analysis (list of component morphemes)

Dictionaries in Python

pos = {}
for key, val in sorted(pos.items()):
	print(key + ":", val)
pos.keys()
pos.values()
pos.items()

Defining Dictionaries
dictionary keys must be immutable types.

1 2	pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'} pos = dict(colorless='ADJ', ideas='N', sleep='V', furiously='ADV')

Default Dictionaries
寻某值不遇时，自动创建一个有默认值的项。

pos = defaultdict(list)
pos['sleep'] = ['NOUN', 'VERB']
pos['ideas']	# []
# 可以这样设置值
pos = defaultdict(lambda: 'NOUN')
pos['colorless'] = 'ADJ'
 pos['blog']	# 'NOUN'
list(pos.items())	# [('blog', 'NOUN'), ('colorless', 'ADJ')] # [_automatically-added]

# 出现太少的词用UNK代替
alice = nltk.corpus.gutenberg.words('carroll-alice.txt')
vocab = nltk.FreqDist(alice)
v1000 = [word for (word, _) in vocab.most_common(1000)]
mapping = defaultdict(lambda: 'UNK')
for v in v1000:
	mapping[v] = v
alice2 = [mapping[v] for v in alice]
alice2[:100]

Incrementally Updating a Dictionary

# Incrementally Updating a Dictionary, and Sorting by Value
from collections import defaultdict
counts = defaultdict(int)
from nltk.corpus import brown
for (word, tag) in brown.tagged_words(categories='news', tagset='universal'):
	counts[tag] += 1
from operator import itemgetter
sorted(counts.items(), key=itemgetter(1), reverse=True)	# [('NOUN', 30640), ('VERB', 14399), ('ADP', 12355), ('.', 11928), ...]

NLTK中创建defaultdict(list)的一个更方便的形式。其实nltk.FreqDist实际上也是脱胎于defaultdict(int)。

1 2	anagrams = nltk.Index((''.join(sorted(w)), w) for w in words) anagrams['aeilnrt']

Complex Keys and Values

pos = defaultdict(lambda: defaultdict(int))
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
for ((w1, t1), (w2, t2)) in nltk.bigrams(brown_news_tagged):
	pos[(t1, w2)][t2] += 1
pos[('DET', 'right')]	# defaultdict(<class 'int'>, {'ADJ': 11, 'NOUN': 5})

Inverting a Dictionary
当想根据value查找key的时候。

pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
pos.update({'cats': 'N', 'scratch': 'V', 'peacefully': 'ADV', 'old': 'ADJ'})
pos2 = defaultdict(list)
for key, value in pos.items():
	pos2[value].append(key)

Python’s Dictionary Methods: A summary of commonly-used methods and idioms involving dictionaries.

Example	Description
d = {}	create an empty dictionary and assign it to d
d[key] = value	assign a value to a given dictionary key
d.keys()	the list of keys of the dictionary
list(d)	the list of keys of the dictionary
sorted(d)	the keys of the dictionary, sorted
key in d	test whether a particular key is in the dictionary
for key in d	iterate over the keys of the dictionary
d.values()	the list of values in the dictionary
dict([(k1,v1), (k2,v2), …])	create a dictionary from a list of key-value pairs
d1.update(d2)	add all items from d2 to d1
defaultdict(int)	a dictionary whose default value is zero

Automatic Tagging

Automatically add part-of-speech tags to text. 因为word的tag在句子中才有意义，所以我们分析tagged过的sentence。

# 我们用这个示例
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')

The Default Tagger
把所有词都tag成最可能的那一款。

tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
nltk.FreqDist(tags).max()	# 'NN'
raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
tokens = word_tokenize(raw)
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)

The Regular Expression Tagger
根据一定pattern对word做匹配。

patterns = [
		(r'.*ing$', 'VBG'),               # gerunds
		(r'.*ed$', 'VBD'),                # simple past
		(r'.*es$', 'VBZ'),                # 3rd singular present
		(r'.*ould$', 'MD'),               # modals
		(r'.*\'s$', 'NN$'),               # possessive nouns
		(r'.*s$', 'NNS'),                 # plural nouns
		(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
		(r'.*', 'NN')                     # nouns (default)
	]
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown_sents[3])
regexp_tagger.evaluate(brown_tagged_sents)

The Lookup Tagger
先总结最常用词的tag，然后用他们做表在里面查找，没有就用默认值。
随着model size变大，tagger表现迅速达到一个高值，然后就停滞了。

def performance(cfd, wordlist):
    lt = dict((word, cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))
def display():
    import pylab
    word_freqs = nltk.FreqDist(brown.words(categories='news')).most_common()
    words_by_freq = [w for (w, _) in word_freqs]
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(15)
    perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes, perfs, '-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
    pylab.ylabel('Performance')
    pylab.show()

Evaluation
评估表现很重要，因为错误在流水线中会不断放大。

N-Gram Tagging

Unigram Tagging
给每个词其最可能的tag。有点像lookup tagger，只不过是通过training建立的model。
Separating the Training and Testing Data.

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)

General N-Gram Tagging
An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens, as shown in 5.1. The tag to be chosen, tn, is circled, and the context is shaded in grey. In the example of an n-gram tagger shown in 5.1, we have n=3; that is, we consider the tags of the two preceding words in addition to the current word. An n-gram tagger picks the tag that is most likely in the given context.
Tagger Context
NgramTagger
对见过的句子表现的很好，没见过的就糟糟糟。

bigram_tagger = nltk.BigramTagger(train_sents)
bigram_tagger.tag(brown_sents[2007])
unseen_sent = brown_sents[4203]
bigram_tagger.tag(unseen_sent)

Combining Taggers
Use the more accurate algorithms when we can, but to fall back on algorithms with wider coverage when necessary.
例如：
Try tagging the token with the bigram tagger.
If the bigram tagger is unable to find a tag for the token, try the unigram tagger.
If the unigram tagger is also unable to find a tag, use a default tagger.
用NLTK tagger的backoff参数。

t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)

Tagging Unknown Words
A useful method to tag unknown words based on context is to limit the vocabulary of a tagger to the most frequent n words, and to replace every other word with a special word UNK.
During training, a unigram tagger will probably learn that UNK is usually a noun. However, the n-gram taggers will detect contexts in which it has some other tag.

Storing Taggers
保存训练好的tagger。

# save tagger t2 to a file t2.pk1
from pickle import dump
output = open('t2.pkl', 'wb')
dump(t2, output, -1)
output.close()
# load saved tagger
from pickle import load
input = open('t2.pkl', 'rb')
tagger = load(input)
input.close()

Performance Limitations

# part-of-speech ambiguity of a trigram tagger.
cfd = nltk.ConditionalFreqDist(
           ((x[1], y[1], z[0]), z[1])	# curent word and the previous two tags
           for sent in brown_tagged_sents
           for x, y, z in nltk.trigrams(sent))
ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1]
sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()	# 0.049297702068029296

confusion matrix charts expected tags (the gold standard) against actual tags generated by a tagger.

test_tags = [tag for sent in brown.sents(categories='editorial')
                 for (word, tag) in t2.tag(sent)]
gold_tags = [tag for (word, tag) in brown.tagged_words(categories='editorial')]
print(nltk.ConfusionMatrix(gold_tags, test_tags))

Transformation-Based Tagging

n-gram tagger会有较大的稀疏散列表存储n-gram table(or language model)。而且只考虑之前word的tagger而忽略word本身的信息。
Brill Tagging，begin with broad brush strokes then fix up the details, with successively finer changes.
比如先用Unigram标记一遍，然后根据Brill Tagging中的每条Rule修改一遍结果，这些Rule也是training出来的。

1	nltk.tag.brill.demo()

How to Determine the Category of a Word

In general, linguists use morphological, syntactic, and semantic clues to determine the category of a word.
Morphological Clues
词的内部结构提供的信息，例如-ness is a suffix that combines with an adjective to produce a noun。
Syntactic Clues
该词出现的上下文。
Semantic Clues
根据词的定义。
New Words
名词多。noun属于open class，而preposition就是closed class。
Morphology in Part of Speech Tagsets
不同的语法结构暗示该词词性。

Learning to Classify Text

本章要解决的问题：

How can we identify particular features of language data that are salient for classifying it?
How can we construct models of language that can be used to perform language processing tasks automatically?
What can we learn about language from these models?

Supervised Classification

Gender Identification
首先要想好what features of the input are relevant, and how to encode those features.

# feature set
def gender_features(word):
	ruturn {'last_letter':word[-1]}
# examples
from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
import random
random.shuffle(labeled_names)
# divide into training set and test set and train
#featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
#train_set, test_set = featuresets[500:], featuresets[:500]
# does not store all the feature sets in memory
from nltk.classify import apply_features
train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])
classifier = nltk.NaiveBayesClassifier.train(train_set)
# test
classifier.classify(gender_features('Neo'))	# male
classifier.classify(gender_features('Trinity'))	# female
# evaluate
print(nltk.classify.accuracy(classifier, test_set))	# 0.77
classifier.show_most_informative_features(5)

Choosing the Right Features
充分理解手头任务，好好设计feature。
可以先使用想到的所有可能feature，然后检查哪些才是真正有用的。feature用得太多容易对training data过拟合。
使用一个dev set，用其进行error analysis。看看出错的都是哪些项，总结一下，修改feature。

def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features
# use
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set) [1]
print(nltk.classify.accuracy(classifier, devtest_set))	# 0.75
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )
for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

Document Classification
看看一篇影评到底是在肯定还是否定。

# get movie reviews in a tuple
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

# define feature extractor. 2000 most frequent words in the overall corpus, check whether each of these words is present in a given document
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] [1]

def document_features(document): [2]
    document_words = set(document) [3]
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

print(document_features(movie_reviews.words('pos/cv957_8737.txt'))) 	# {'contains(waste)': False, 'contains(lot)': False, ...}

# train a classifier to label new movie reviews
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))	# 0.81
classifier.show_most_informative_features(5)
###
Most Informative Features
   contains(outstanding) = True              pos : neg    =     11.1 : 1.0
        contains(seagal) = True              neg : pos    =      7.7 : 1.0
   contains(wonderfully) = True              pos : neg    =      6.8 : 1.0
         contains(damon) = True              pos : neg    =      5.9 : 1.0
        contains(wasted) = True              neg : pos    =      5.8 : 1.0
###

Part-of-Speech Tagging
通过后缀猜测词性。

# find the most common suffixes
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]
print(common_suffixes)

# define a feature extractor function which checks a given word for these suffixes
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features

# train
tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.DecisionTreeClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)	# 0.62705121829935351
classifier.classify(pos_features('cats'))	# NNS

# decision tree model is easy to interpret, we can print out the pseudocode
print(classifier.pseudocode(depth=4))

Exploiting Context
不再只是孤立看待每个词，还要看上下文。所以参数不能只是一个词了，而是整句话。

def pos_features(sentence, i):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features
# use
pos_features(brown.sents()[0], 8)	# {'suffix(3)': 'ion', 'prev-word': 'an', 'suffix(2)': 'on', 'suffix(1)': 'n'}
tagged_sents = brown.tagged_sents(categories='news')
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append( (pos_features(untagged_sent, i), tag) )

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

nltk.classify.accuracy(classifier, test_set)	# 0.78915962207856782

Sequence Classification
In order to capture the dependencies between related classification tasks, we can use joint classifier models, which choose an appropriate labeling for a collection of related inputs. In the case of part-of-speech tagging, a variety of different sequence classifier models can be used to jointly choose part-of-speech tags for all the words in a given sentence.
Consecutive classification(greedy sequence classification): find the most likely class label for the first input, then to use that answer to help find the best label for the next input. The process can then be repeated until all of the inputs have been labeled. Used by bigram tagger.

# Part of Speech Tagging with a Consecutive Classifier
def pos_features(sentence, i, history):
     features = {"suffix(1)": sentence[i][-1:],
                 "suffix(2)": sentence[i][-2:],
                 "suffix(3)": sentence[i][-3:]}
     if i == 0:
         features["prev-word"] = "<START>"
         features["prev-tag"] = "<START>"
     else:
         features["prev-word"] = sentence[i-1]
         features["prev-tag"] = history[i-1]
     return features

class ConsecutivePosTagger(nltk.TaggerI):

    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)
# use
tagged_sents = brown.tagged_sents(categories='news')
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]
tagger = ConsecutivePosTagger(train_sents)
print(tagger.evaluate(test_sents))	# 0.79796012981

Other Methods for Sequence Classification
之前的算法都是定了tag就不能回头。
一种方法是改用一个transformational strategy的classifier。Transformational joint classifiers work by creating an initial assignment of labels for the inputs, and then iteratively refining that assignment in an attempt to repair inconsistencies between related inputs. 比如The Brill tagger.
另一种方法是给每个tag打分，然后比较总分。就是Hidden Markov Models用的方法。还有Maximum Entropy Markov和Linear-Chain Conditional Random Field Models。

Further Examples of Supervised Classification

Sentence Segmentation
Sentence Segmentation: whenever we encounter a symbol that could possibly end a sentence, such as a period or a question mark, we have to decide whether it terminates the preceding sentence.

# obtain some data that has already been segmented into sentences and convert it into a form that is suitable for extracting features
sents = nltk.corpus.treebank_raw.sents()
tokens = []
boundaries = set()
offset = 0
for sent in sents:
    tokens.extend(sent)	# token is a merged list of tokens from the individual sentences
    offset += len(sent)
    boundaries.add(offset-1)	# boundaries is a set containing the indexes of all sentence-boundary tokens.

# specify the features
def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prev-word': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}

# create a list of labeled featuresets by selecting all the punctuation tokens, and tagging whether they are boundary tokens or not.
featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!']

# train and evaluate a punctuation classifier
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

# use
def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])
    return sents

Identifying Dialogue Act Types

# extract the basic messaging data
posts = nltk.corpus.nps_chat.xml_posts()[:10000]

# define a feature extractor that checks what words the post contains
def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains({})'.format(word.lower())] = True
    return features

# construct the training and testing data by applying the feature extractor to each post and create a new classifier
featuresets = [(dialogue_act_features(post.text), post.get('class'))
               for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))	# 0.67

Recognizing Textual Entailment
Recognizing Textual Entailment: the task of determining whether a given piece of text T entails another text called the “hypothesis”.
两句话是否是合理的上下文对话。注意这个合理并不是逻辑上，而是whether a human would conclude that the text provides reasonable evidence for taking the hypothesis to be true.也就是从text中能否合理推论出hypothesis的论点。

# word types serve as proxies for information, and out features count the degree of word overlap, and the degree to which there are words in the hypothesis but not in the text
# not all words are equally important, Named Entity is more significant. some high frequency function function words are filtered out as stopwords.
# "Recognizing Text Entailment" Feature Extractor. The RTEFeatureExtractor class builds a bag of words for both the text and the hypothesis after throwing away some stopwords, then calculates overlap and difference.
def rte_features(rtepair):
    extractor = nltk.RTEFeatureExtractor(rtepair)
    features = {}
    features['word_overlap'] = len(extractor.overlap('word'))
    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))
    features['ne_overlap'] = len(extractor.overlap('ne'))
    features['ne_hyp_extra'] = len(extractor.hyp_extra('ne'))
    return features
rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33]
extractor = nltk.RTEFeatureExtractor(rtepair)
print(extractor.text_words)	# {'Russia', 'Organisation', 'Shanghai', 'Asia', 'four', 'at', 'operation', 'SCO', ...}
print(extractor.hyp_words)	# {'member', 'SCO', 'China'}
print(extractor.overlap('word'))	# set()
print(extractor.overlap('ne'))	# {'SCO', 'China'}
print(extractor.hyp_extra('word'))	# {'member'}

Scaling Up to Large Datasets
python的效率比较低，如果数据量大的话看看NLTK’s facilities for interfacing with external machine learning packages.

Evaluation

看看classification model到底可靠否。

The Test Set
即evaluation set。training data和test data不要太相似，别是同一个种类同一个文件的。

Accuracy
nltk.classify.accuracy()

Precision and Recall

True positives are relevant items that we correctly identified as relevant.
True negatives are irrelevant items that we correctly identified as irrelevant.
False positives (or Type I errors) are irrelevant items that we incorrectly identified as relevant.
False negatives (or Type II errors) are relevant items that we incorrectly identified as irrelevant.

于是

Precision, which indicates how many of the items that we identified were relevant, is TP/(TP+FP).
Recall, which indicates how many of the relevant items that we identified, is TP/(TP+FN).
The F-Measure (or F-Score), which combines the precision and recall to give a single score, is defined to be the harmonic mean of the precision and recall: (2 × Precision × Recall) / (Precision + Recall).

Confusion Matrices
A confusion matrix is a table where each cell [i,j] indicates how often label j was predicted when the correct label was i.

>>> def tag_list(tagged_sents):
...     return [tag for sent in tagged_sents for (word, tag) in sent]
>>> def apply_tagger(tagger, corpus):
...     return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus]
>>> gold = tag_list(brown.tagged_sents(categories='editorial'))
>>> test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial')))
>>> cm = nltk.ConfusionMatrix(gold, test)
>>> print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))
    |                                         N                      |
    |      N      I      A      J             N             V      N |
    |      N      N      T      J      .      S      ,      B      P |
----+----------------------------------------------------------------+
 NN | <11.8%>  0.0%      .   0.2%      .   0.0%      .   0.3%   0.0% |
 IN |   0.0%  <9.0%>     .      .      .   0.0%      .      .      . |
 AT |      .      .  <8.6%>     .      .      .      .      .      . |
 JJ |   1.7%      .      .  <3.9%>     .      .      .   0.0%   0.0% |
  . |      .      .      .      .  <4.8%>     .      .      .      . |
NNS |   1.5%      .      .      .      .  <3.2%>     .      .   0.0% |
  , |      .      .      .      .      .      .  <4.4%>     .      . |
 VB |   0.9%      .      .   0.0%      .      .      .  <2.4%>     . |
 NP |   1.0%      .      .   0.0%      .      .      .      .  <1.8%>|
----+----------------------------------------------------------------+
(row = reference; col = test)

Cross-Validation
perform multiple evaluations on different test sets, then to combine the scores from those evaluations, a technique known as cross-validation.

Decision Trees

Three machine learning methods that can be used to automatically build classification models: decision trees, naive Bayes classifiers, and Maximum Entropy classifiers.
Decision Tree model for the name gender task. Note that tree diagrams are conventionally drawn "upside down," with the root at the top, and the leaves at the bottom.

Entropy and Information Gain
information gain, measures how much more organized the input values become when we divide them up using a given feature. To measure how disorganized the original set of input values are, we calculate entropy of their labels, which will be high if the input values have highly varied labels, and low if many input values all have the same label. In particular, entropy is defined as the sum of the probability of each label times the log probability of that same label. 每个label都要有一定分量才好，不要只有一个label最多。

import math
def entropy(labels):
    freqdist = nltk.FreqDist(labels)
    probs = [freqdist.freq(l) for l in freqdist]
    return -sum(p * math.log(p,2) for p in probs)
# use
print(entropy(['male', 'male', 'male', 'male']))	# 0.0
print(entropy(['female', 'male', 'female', 'male']))	# 1.0
print(entropy(['female', 'female', 'male', 'female']))	# 0.811

Decision Tree的缺点是分支多就需要更大的training data，branches几何级数增长，影响力小的feature很难用上，被迫按照一定顺序进行判断。

Naive Bayes Classifiers

In naive Bayes classifiers, every feature gets a say in determining which label should be assigned to a given input value. To choose a label for an input value, the naive Bayes classifier begins by calculating the prior probability of each label, which is determined by checking frequency of each label in the training set. The contribution from each feature is then combined with this prior probability, to arrive at a likelihood estimate for each label. The label whose likelihood estimate is the highest is then assigned to the input value.

每个feature在每种label里的可能性自己乘起来。是假设feature之间是完全独立的，所以naive。

Maximum Entropy Classifiers

不用probabilites，而是用search techniques to find a set of parameters that will maximize the performance of the classifier.
In particular, it looks for the set of parameters that maximizes the total likelihood of the training corpus, which is defined as:
$P(features) = \sum_{x|in|corpus}P(label(x)|features(x))$

$P(label|features) = P(label, features) / \sum_{label}P(label, features)$
使用iterative optimization，从随机参数开始，逐步迭代改进。算法可能会比较慢。

Generative vs Conditional Classifiers

Naive Bayes Classifier is an example of a generative classifier, which builds a model that predicts P(input, label), the joint probability of a (input, label) pair.可以回答以下问题：

What is the most likely label for a given input?
How likely is a given label for a given input?
What is the most likely input value?
How likely is a given input value?
How likely is a given input value with a given label?
What is the most likely label for an input that might have one of two values (but we don’t know which)?

Maximum Entropy Classifier is an example of a conditional classifier. Conditional classifiers build models that predict P(label|input) — the probability of a label given the input value. Thus, conditional models can still be used to answer questions 1 and 2. However, conditional models can not be used to answer the remaining questions 3-6.

Modeling Linguistic Patterns

Classifier的存在是为了更好地分析Linguistic Patterns。
descriptive models and explanatory models.
Descriptive models capture patterns in the data but they don’t provide any information about why the data contains those patterns.
Explanatory models attempt to capture properties and relationships that cause the linguistic patterns.
从corpus直接分析出来的一般都是descriptive models。

Extracting Information from Text

本章要解决的问题：

How can we build a system that extracts structured data, such as tables, from unstructured text?
What are some robust methods for identifying the entities and relationships described in a text?
Which corpora are appropriate for this work, and how do we use them for training and evaluating our models?

Information Extraction

First convert the unstructured data of natural language sentences into the structured data of the text. Then we reap the benefits of powerful query tools such as SQL.
使用此技术的应用有：business intelligence, resume harvesting, media analysis, sentiment detection, patent search, and email scanning.

Information Extraction Architecture
Simple Pipeline Architecture for an Information Extraction System. This system takes the raw text of a document as its input, and generates a list of (entity, relation, entity) tuples as its output. For example, given a document that indicates that the company Georgia-Pacific is located in Atlanta, it might generate the tuple ([ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']).

## first 3 steps
import nltk, re, pprint
def ie_preprocess(document):
   sentences = nltk.sent_tokenize(document)
   sentences = [nltk.word_tokenize(sent) for sent in sentences]
   sentences = [nltk.pos_tag(sent) for sent in sentences]

Next, in named entity detection, we segment and label the entities that might participate in interesting relations with one another.
Finally, in relation extraction, we search for specific patterns between pairs of entities that occur near one another in the text, and use those patterns to build tuples recording the relationships between the entities.

Chunking

Entity Detection中一个基础技术Chunking。小方块表示word-level tokenization和part-of-speech tagging，大方块表示higher-level chunking。大方块就是Chunking。chunk们不会重叠。
Segmentation and Labeling at both the Token and Chunk Levels

Noun Phrase Chunking
NP-chunking。找individual noun phrases。part-of-speech tags在这里很有用。

# Example of a simple regular expression based NP Chunker
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print(result)
'''
(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))
'''
result.draw()

Image Loading

Tag Patterns
如<DT>?<JJ.*>*<NN.*>+表示 any sequence of tokens beginning with an optional determiner, followed by zero or more adjectives of any type (including relative adjectives like earlier/JJR), followed by one or more nouns of any type.

Chunking with Regular Expressions
可以同时指定多条chunking rule。

grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), [1]
                 ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
print(cp.parse(sentence))
'''
(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))
'''

Exploring Text Corpora
extract phrases matching a particular sequence of part-of-speech tags.

cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')
brown = nltk.corpus.brown
for sent in brown.tagged_sents():
    tree = cp.parse(sent)
    for subtree in tree.subtrees():
        if subtree.label() == 'CHUNK': print(subtree)
'''
(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)
(CHUNK expected/VBN to/TO become/VB)
...
(CHUNK seems/VBZ to/TO overtake/VB)
(CHUNK want/VB to/TO buy/VB)
'''

Chinking
Chink is a sequence of tokens that is not included in a chunk.

||Entire chunk| Middle of a chunk| End of a chunk|
|—|---|—|
|Input| [a/DT little/JJ dog/NN]| [a/DT little/JJ dog/NN]| [a/DT little/JJ dog/NN]
|Operation| Chink “DT JJ NN”| Chink “JJ”| Chink “NN”
|Pattern| }DT JJ NN{| }JJ{| }NN{
|Output| a/DT little/JJ dog/NN| [a/DT] little/JJ [dog/NN]| [a/DT little/JJ] dog/NN

# put the entire sentence into a single chunk, then excise the chinks.
grammar = r"""
  NP:
    {<.*>+}          # Chunk everything
    }<VBD|IN>+{      # Chink sequences of VBD and IN
  """
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
       ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
print(cp.parse(sentence))
'''
(S
   (NP the/DT little/JJ yellow/JJ dog/NN)
   barked/VBD
   at/IN
   (NP the/DT cat/NN))
'''

Representing Chunks: Tags vs Trees
最广泛使用的文件表达方式用了IOB tags。In this scheme, each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). A token is tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I. All other tokens are tagged O.
表现成Tag是这样的：
Tag Representation of Chunk Structures
在文件中表示是这样的：

We PRP B-NP
saw VBD O
the DT B-NP
yellow JJ I-NP
dog NN I-NP

表示成Tree是这样的：
Tree Representation of Chunk Structures

Developing and Evaluating Chunkers

Reading IOB Format and the CoNLL 2000 Corpus
nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()
Image Loading

1
2
3

# 从CoNLL 2000里拿出100句NP chunk的句子。
from nltk.corpus import conll2000
print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99])

Simple Evaluation and Baselines

# Noun Phrase Chunking with a Unigram Tagger
class UnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):	# train_sents is chunk tree
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]	# word, tag, chunk triples
                      for sent in train_sents]
        self.tagger = nltk.UnigramTagger(train_data)

    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]	# sentence is a tagged sentence, extract the part-of-speech tag from that sentence
        tagged_pos_tags = self.tagger.tag(pos_tags)	# tag the part-of-speech tags with IOB chunk tags
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
                     in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)	# convert result back into a chunk tree
# use
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
unigram_chunker = UnigramChunker(train_sents)
print(unigram_chunker.evaluate(test_sents))
'''
ChunkParse score:
    IOB Accuracy:  92.9%
    Precision:     79.9%
    Recall:        86.8%
    F-Measure:     83.2%
'''
# check, assign the unigram tagger to each of the part-of-speech tags that appear in the corpus
postags = sorted(set(pos for sent in train_sents
                     for (word,pos) in sent.leaves()))
print(unigram_chunker.tagger.tag(postags))
'''
[('#', 'B-NP'), ('$', 'B-NP'), ("''", 'O'), ('(', 'O'), (')', 'O'),
 (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'),
 ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'),
 ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'),
 ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'),
 ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'),
 ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'),
 ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'),
 ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'),
 ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]
'''

Training Classifier-Based Chunkers
Regular-expression based chunkers和N-gram chunkers都是完全根据part-of-speech tags来判断如何分chunk的。有时这个信息并不够，还应该用word本身有的信息。
例如下面的句子，chunk分法应完全不同。

1 2	a. Joey/NN sold/VBD the/DT farmer/NN rice/NN ./. b. Nick/NN broke/VBD my/DT computer/NN monitor/NN ./.

# Noun Phrase Chunking with a Consecutive Classifier
class ConsecutiveNPChunkTagger(nltk.TaggerI):

    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train(
            train_set, algorithm='megam', trace=0)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

class ConsecutiveNPChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        tagged_sents = [[((w,t),c) for (w,t,c) in
                         nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        self.tagger = ConsecutiveNPChunkTagger(tagged_sents)

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

# use
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    if i == len(sentence)-1:
        nextword, nextpos = "<END>", "<END>"
    else:
        nextword, nextpos = sentence[i+1]
    return {"pos": pos,
            "word": word,
            "prevpos": prevpos,
            "nextpos": nextpos,
            "prevpos+pos": "%s+%s" % (prevpos, pos),
            "pos+nextpos": "%s+%s" % (pos, nextpos),
            "tags-since-dt": tags_since_dt(sentence, i)}

def tags_since_dt(sentence, i):
    tags = set()
    for word, pos in sentence[:i]:
        if pos == 'DT':
            tags = set()
        else:
            tags.add(pos)
    return '+'.join(sorted(tags))

chunker = ConsecutiveNPChunker(train_sents)
print(chunker.evaluate(test_sents))
'''
ChunkParse score:
    IOB Accuracy:  96.0%
    Precision:     88.6%
    Recall:        91.0%
    F-Measure:     89.8%
'''

Recursion in Linguistic Structure

Building Nested Structure with Cascaded Chunkers
build chunk structures of arbitrary depth, simply by creating a multi-stage chunk grammar containing recursive rules.

# a chunker that handles NP, PP, VP and S
grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  """
cp = nltk.RegexpParser(grammar, loop=2)	# after the chunker trying all of the rules, it repeats the process.
sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),
    ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),
    ("on", "IN"), ("the", "DT"), ("mat", "NN")]

print(cp.parse(sentence))
'''
(S
  (NP John/NNP)
  thinks/VBZ
  (CLAUSE
    (NP Mary/NN)
    (VP
      saw/VBD
      (CLAUSE
        (NP the/DT cat/NN)
        (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))
'''

并不是最好的方法，期待第8章学的full parsing。

Trees

tree1 = nltk.Tree('NP', ['Alice'])
tree2 = nltk.Tree('NP', ['the', 'rabbit'])
tree3 = nltk.Tree('VP', ['chased', tree2])
tree4 = nltk.Tree('S', [tree1, tree3])
print(tree4)	# (S (NP Alice) (VP chased (NP the rabbit)))
tree4[1].label()	# 'VP'
tree4.leaves()	# ['Alice', 'chased', 'the', 'rabbit']
tree4[1][1][1]	# 'rabbit'
tree3.draw()

Tree Traversal

# a recursive function to traverse a tree
def traverse(t):
    try:
        t.label()	# duck typing, to detect that t is a tree
    except AttributeError:
        print(t, end=" ")
    else:
        # Now we know that t.node is defined
        print('(', t.label(), end=" ")
        for child in t:
            traverse(child)
        print(')', end=" ")

 >>> t = nltk.Tree('(S (NP Alice) (VP chased (NP the rabbit)))')
 >>> traverse(t)
 ( S ( NP Alice ) ( VP chased ( NP the rabbit ) ) )

Named Entity Recognition

“Facility”: human-made artifacts in the domains of architecture and civil engineering; and “GPE”: geo-political entities such as city, state/province, and country.

Commonly Used Types of Named Entity

NE Type	Examples
ORGANIZATION	Georgia-Pacific Corp., WHO
PERSON	Eddy Bonte, President Obama
LOCATION	Murray River, Mount Everest
DATE	June, 2008-06-29
TIME	two fifty a m, 1:30 p.m.
MONEY	175 million Canadian Dollars, GBP 10.40
PERCENT	twenty pct, 18.75 %
FACILITY	Washington Monument, Stonehenge
GPE	South East Asia, Midlothian

NER的目标就是identifying the boundaries of the NE, and identifying its type.

NLTK已经有一个训练好的classifer来识别named entities，nltk.ne_chunk().

>>> print(nltk.ne_chunk(sent, binary=False)) # binary为True则只识别出NE，不区分出子类。 
(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  ...
  according/VBG
  to/TO
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ...)

Relation Extraction

当named entites确定了以后就要找关系了。
一个简单的方法，找(X, a, Y)的关系，其中a是一串表达此关系的词汇。

>>> IN = re.compile(r'.*\bin\b(?!\b.+ing)')  # 其中的(?!\b.+ing\b)是删掉类似success in supervising the transition of这样的结构。
>>> for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
...     for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,
...                                      corpus='ieer', pattern = IN):
...         print(nltk.sem.rtuple(rel))
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']

用NLTK库里自带的方法，使用part-of-speech tags。例子是Dutch的。

from nltk.corpus import conll2002
vnv = """
(
is/V|    # 3rd sing present and
was/V|   # past forms of the verb zijn ('be')
werd/V|  # and also present
wordt/V  # past of worden ('become)
)
.*       # followed by anything
van/Prep # followed by van ('of')
"""
VAN = re.compile(vnv, re.VERBOSE)
for doc in conll2002.chunked_sents('ned.train'):
    for r in nltk.sem.extract_rels('PER', 'ORG', doc,
                                   corpus='conll2002', pattern=VAN):
        print(nltk.sem.clause(r, relsym="VAN"))
'''
VAN("cornet_d'elzius", 'buitenlandse_handel')
VAN('johan_rottiers', 'kardinaal_van_roey_instituut')
VAN('annie_lennox', 'eurythmics')
'''

Analyzing Sentence Structure

本章要解决的问题：

How can we use a formal grammar to describe the structure of an unlimited set of sentences?
How do we represent the structure of sentences using syntax trees?
How do parsers analyze a sentence and automatically build a syntax tree?

Some Grammatical Dilemmas

Linguistic Data and Unlimited Possibilities
使用一些template可以将句子组合成无限长，由许多grammatical sentences组成。
Grammars use recursive productions of the form S → S and S.

Ubiquitous Ambiguity
例句：While hunting in Africa, I shot an elephant in my pajamas. How he got into my pajamas, I don’t know.

groucho_grammar = nltk.CFG.fromstring("""
	S -> NP VP
	PP -> P NP
	NP -> Det N | Det N PP | 'I'
	VP -> V NP | VP PP
	Det -> 'an' | 'my'
	N -> 'elephant' | 'pajamas'
	V -> 'shot'
	P -> 'in'
	""")

sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']
parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
    print(tree)

"""
(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))
"""

这个句子可以被解读为这样两棵树。
Image Loading

What’s the Use of Syntax?

Beyond n-grams