Feature Extraction 特征值生成

Feature Enginerring

Numeric Feature

  1. Numeric feature preprocessing is different for tree and non-tree models:
    a. Tree-based models doesn’t depend on scaling
    b. Non-tree-based models hugely depend on scaling
  2. Most often used preprocessings are:
    a. MinMaxScaler - to [0,1]
    b. StandardScaler - to mean==0, std==1
    c. Rank - sets spaces between sorted values to be equal
    d. np.log(1+x) and np.sqrt(1+x)
  3. Feature generation is powered by:
    a. Prior knowledge
    b. Exploratory data analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# scaling To [0, 1]
sklearn.preprocessing.MinMaxScaler
X = (X - X.min()) / (X.max() - X.min())
# scaling To mean = 0, std = 1
sklearn.preprocessing.StandardScaler
X = (X - X.mean()) / X.std()
# outliers
UPPERBOUND, LOWERBOUND = np.percentile(x, [1, 99])
y = np.clip(x, UPPERBOUND, LOWERBOUND)
# rank
rank([-100, 0, 1e5]) == [0, 1, 2]
rank([1000, 1, 10]) == [2, 0, 1]
rank([-100, 0, 1e5]) == [0,1,2]
rank([1000,1,10]) = [2,0,1]
scipy.stats.rankdata
# Log transform
np.log(1 + x)
# Raising to the power < 1
np.sqrt(x + 2/3)

Categorical and ordinal features

  1. Values in ordinal features are sorted in some meaningful order.
  2. Label encoding maps categories to numbers.
  3. Frequency encoding maps categories to their frequencies.
  4. Label and Frequency encodings are often used for treebased models.
  5. One-hot encoding is often used for non-tree-based models.
  6. Interactions of categorical features can help linear models and KNN.

Ordinal features

1
2
3
4
5
6
7
8
9
# Label encoding - Alphabeical (sorted) [S,C,Q] -> [2,1,3]
sklearn.preprocessing.LabelEncoder
# Order of appearance [S,C,Q] -> [1,2,3]
Pandas.factorize
# Frequency encoding [S,C,Q] -> [0.5,0.3,0.2]
encoding = titanic.groupby(‘Embarked’).size()
encoding = encoding/len(titanic)
titanic[‘enc’] = titanic.Embarked.map(encoding)
from scipy.stats import rankdata

Categorical features

1
2
3
# One-hot encoding
pandas.get_dummies
sklearn.preprocessing.OneHotEncoder

Datetime

  1. Periodicity
    Day number in week, month, season, year, second, minute, hour.
  2. Time since
    a. Row-independent moment
    For example: since 00:00:00 UTC, 1 January 1970;
    b. Row-dependent important moment
    Number of days left until next holidays/ time passed after last holiday.
  3. Difference between dates
    datetime_feature_1 - datetime_feature_2

Coordinates

a. Interesting places from train/test data or additional data
b. Centers of clusters
c. Aggregated statistics

Feature Extraction from Texts

Text -> vector

  1. Bag of words
    Image Loading
    a. Very large vectors.
    b. Meaning of each value in vector is known.
    c. Ngrams can help to use local context
    d. TFiDF can be of use as postprocessing
  2. Embeddings (~word2vec)
    Image Loading
    a. Relatively small vectors.
    b. Values in vector can be interpreted only in some cases.
    c. The words with similar meaning often have similar embeddings.
    d. Pretrained models

Bag of Words

Pipeline of applying BOW

  1. Preprocessing:
    Lowercase, stemming, lemmatization, stopwords
    stopwords:sklearn.feature_extraction.text.CountVectorizer: max_df
    Image Loading
  2. Bag of words:
    Ngrams can help to use local context: sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer
    Image Loading
  3. Postprocessing: TFiDF
    count words: sklearn.feature_extraction.text.CountVectorizer
    TFiDF: sklearn.feature_extraction.text.TfidfVectorizer
    Image Loading
    tf-idf模型的主要思想是:如果词w在一篇文档d中出现的频率高,并且在其他文档中很少出现,则认为词w具有很好的区分能力,适合用来把文章d和其他文章区分开来。该模型主要包含了两个因素:
  1. 词w在文档d中的词频tf (Term Frequency),即词w在文档d中出现次数count(w, d)和文档d中总词数size(d)的比值:
    tf(w,d) = count(w, d) / size(d)
  2. 词w在整个文档集合中的逆向文档频率idf (Inverse Document Frequency),即文档总数n与词w所出现文件数docs(w, D)比值的对数:
    idf = log(n / docs(w, D))
    tf-idf模型根据tf和idf为每一个文档d和由关键词w[1]…w[k]组成的查询串q计算一个权值,用于表示查询串q与文档d的匹配度:
1
2
3
tf-idf(q, d) 
= sum { i = 1..k | tf-idf(w[i], d) }
= sum { i = 1..k | tf(w[i], d) * idf(w[i]) }

Image Loading

Word2Vec

  • Words: Word2vec, Glove, FastText, etc.
  • Sentences: Doc2vec, etc.

There are pretrained models.

Feature Extraction from Images

Image -> Vector

  1. Descriptors
  2. Train network from scratch
  3. Finetuning

Image Loading
a. Features can be extracted from different layers.
b. Careful choosing of pretrained network can help.
c. Finetuning allows to refine pretrained models.
d. Data augmentation can improve the model.

Ref

[1] Coursera - How to Win a Data Competition
[2] https://coolshell.cn/articles/8422.html
[3] http://datascience.la/meetup-summary-winning-data-science-competitions/