Feature Extraction 特征值生成

A Stellar Hiker

2018-07-19

Feature Enginerring

Numeric Feature

Numeric feature preprocessing is different for tree and non-tree models:
a. Tree-based models doesn’t depend on scaling
b. Non-tree-based models hugely depend on scaling
Most often used preprocessings are:
a. MinMaxScaler - to [0,1]
b. StandardScaler - to mean==0, std==1
c. Rank - sets spaces between sorted values to be equal
d. np.log(1+x) and np.sqrt(1+x)
Feature generation is powered by:
a. Prior knowledge
b. Exploratory data analysis

# scaling To [0, 1]
sklearn.preprocessing.MinMaxScaler
X = (X - X.min()) / (X.max() - X.min())
# scaling To mean = 0, std = 1
sklearn.preprocessing.StandardScaler
X = (X - X.mean()) / X.std()
# outliers
UPPERBOUND, LOWERBOUND = np.percentile(x, [1, 99])
y = np.clip(x, UPPERBOUND, LOWERBOUND)
# rank
rank([-100, 0, 1e5]) == [0, 1, 2]
rank([1000, 1, 10]) == [2, 0, 1]
rank([-100, 0, 1e5]) == [0,1,2]
rank([1000,1,10]) = [2,0,1]
scipy.stats.rankdata
# Log transform
np.log(1 + x)
# Raising to the power < 1
np.sqrt(x + 2/3)

Categorical and ordinal features

Values in ordinal features are sorted in some meaningful order.
Label encoding maps categories to numbers.
Frequency encoding maps categories to their frequencies.
Label and Frequency encodings are often used for treebased models.
One-hot encoding is often used for non-tree-based models.
Interactions of categorical features can help linear models and KNN.

Ordinal features

# Label encoding - Alphabeical (sorted) [S,C,Q] -> [2,1,3]
sklearn.preprocessing.LabelEncoder
# Order of appearance [S,C,Q] -> [1,2,3]
Pandas.factorize
# Frequency encoding [S,C,Q] -> [0.5,0.3,0.2]
encoding = titanic.groupby(‘Embarked’).size()
encoding = encoding/len(titanic)
titanic[‘enc’] = titanic.Embarked.map(encoding)
from scipy.stats import rankdata

Categorical features

1
2
3

# One-hot encoding
pandas.get_dummies
sklearn.preprocessing.OneHotEncoder

Datetime

Periodicity
Day number in week, month, season, year, second, minute, hour.
Time since
a. Row-independent moment
For example: since 00:00:00 UTC, 1 January 1970;
b. Row-dependent important moment
Number of days left until next holidays/ time passed after last holiday.
Difference between dates
datetime_feature_1 - datetime_feature_2

Coordinates

a. Interesting places from train/test data or additional data
b. Centers of clusters
c. Aggregated statistics

Feature Extraction from Texts

Text -> vector

Bag of words

a. Very large vectors.
b. Meaning of each value in vector is known.
c. Ngrams can help to use local context
d. TFiDF can be of use as postprocessing
Embeddings (~word2vec)

a. Relatively small vectors.
b. Values in vector can be interpreted only in some cases.
c. The words with similar meaning often have similar embeddings.
d. Pretrained models

Bag of Words

Pipeline of applying BOW

Preprocessing:
Lowercase, stemming, lemmatization, stopwords
stopwords:sklearn.feature_extraction.text.CountVectorizer: max_df
Bag of words:
Ngrams can help to use local context: sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer
Postprocessing: TFiDF
count words: sklearn.feature_extraction.text.CountVectorizer
TFiDF: sklearn.feature_extraction.text.TfidfVectorizer

tf-idf模型的主要思想是：如果词w在一篇文档d中出现的频率高，并且在其他文档中很少出现，则认为词w具有很好的区分能力，适合用来把文章d和其他文章区分开来。该模型主要包含了两个因素：

词w在文档d中的词频tf (Term Frequency)，即词w在文档d中出现次数count(w, d)和文档d中总词数size(d)的比值：
tf(w,d) = count(w, d) / size(d)
词w在整个文档集合中的逆向文档频率idf (Inverse Document Frequency)，即文档总数n与词w所出现文件数docs(w, D)比值的对数:
idf = log(n / docs(w, D))
tf-idf模型根据tf和idf为每一个文档d和由关键词w[1]…w[k]组成的查询串q计算一个权值，用于表示查询串q与文档d的匹配度：

1
2
3

tf-idf(q, d) 
= sum { i = 1..k | tf-idf(w[i], d) } 
= sum { i = 1..k | tf(w[i], d) * idf(w[i]) }

Image Loading

Word2Vec

Words: Word2vec, Glove, FastText, etc.
Sentences: Doc2vec, etc.

There are pretrained models.

Feature Extraction from Images

Image -> Vector

Descriptors
Train network from scratch
Finetuning

Image Loading
a. Features can be extracted from different layers.
b. Careful choosing of pretrained network can help.
c. Finetuning allows to refine pretrained models.
d. Data augmentation can improve the model.

Ref

[1] Coursera - How to Win a Data Competition
[2] https://coolshell.cn/articles/8422.html
[3] http://datascience.la/meetup-summary-winning-data-science-competitions/