Data Cleaning 数据清洗

Data Cleaning and Transformation

  • Missing and repeated values.
  • Cleaning outliers and errors.
  • Categorical to Numeric.
  • Scaling Data.

Missing and Repeated Values

Missing values and repeated values are common.
Many ML algorithms don’t deal with missing values.
Repeated values bias results.

Treating Missing Values

  • 如果是不合适的值,可以用df.loc[df[col]=='?', col] = np.nan或替换为别的值。
    1. The choice of method to fill NaN depends on the situation.
      -999, -1, etc
      mean, median
      Reconstruct value
      Interpolate values.
      Forward fill.
      Backward fill
      Impute
    2. Binary feature “isnull” can be beneficial.
    3. In general, avoid filling nans before feature generation.
    4. Xgboost can handle NaN.
  • 如果有的列大多是空的,可以直接去掉。df.drop(drop_list, axis = 1, inplace = True)
  • 有空值的行可以直接去掉。df.dropna(axis=0, inplace = True)
  • 有些明显无用行,比如 id,可以去掉。

Treating Repeated Values

有没有traintest.nunique(axis = 1) == 1
df.drop_duplicates(subset = '', inplace = True)

Outliers

Visualizing Outliers
Scatter plot matrix helps validate outliers.
pandas.tools.plotting.scatter_matrix

Removing Outliers
frame1 = frame1[(frame1['Col1] > 40.0) & ((frame1['Col2] < 30.0) & ((frame1['Col3] > 3.0)]

Others

see this.

Ref

[1] Edx - Data Science Essentials