Data Cleaning 数据清洗

A Stellar Hiker

2018-07-21

Machine Learning

Data Cleaning and Transformation

Missing and repeated values.
Cleaning outliers and errors.
Categorical to Numeric.
Scaling Data.

Missing and Repeated Values

Missing values and repeated values are common.
Many ML algorithms don’t deal with missing values.
Repeated values bias results.

Treating Missing Values

如果是不合适的值，可以用df.loc[df[col]=='?', col] = np.nan或替换为别的值。
1. The choice of method to fill NaN depends on the situation.
  -999, -1, etc
  mean, median
  Reconstruct value
  Interpolate values.
  Forward fill.
  Backward fill
  Impute
2. Binary feature “isnull” can be beneficial.
3. In general, avoid filling nans before feature generation.
4. Xgboost can handle NaN.
如果有的列大多是空的，可以直接去掉。df.drop(drop_list, axis = 1, inplace = True)
有空值的行可以直接去掉。df.dropna(axis=0, inplace = True)
有些明显无用行，比如 id，可以去掉。

Treating Repeated Values

有没有traintest.nunique(axis = 1) == 1
df.drop_duplicates(subset = '', inplace = True)

Outliers

Visualizing Outliers
Scatter plot matrix helps validate outliers.
pandas.tools.plotting.scatter_matrix

Removing Outliers
frame1 = frame1[(frame1['Col1] > 40.0) & ((frame1['Col2] < 30.0) & ((frame1['Col3] > 3.0)]

Others

see this.

Ref

[1] Edx - Data Science Essentials