Data Cleaning 数据清洗
Data Cleaning and Transformation
- Missing and repeated values.
- Cleaning outliers and errors.
- Categorical to Numeric.
- Scaling Data.
Missing and Repeated Values
Missing values and repeated values are common.
Many ML algorithms don’t deal with missing values.
Repeated values bias results.
Treating Missing Values
- 如果是不合适的值,可以用
df.loc[df[col]=='?', col] = np.nan
或替换为别的值。- The choice of method to fill NaN depends on the situation.
-999, -1, etc
mean, median
Reconstruct value
Interpolate values.
Forward fill.
Backward fill
Impute - Binary feature “isnull” can be beneficial.
- In general, avoid filling nans before feature generation.
- Xgboost can handle NaN.
- The choice of method to fill NaN depends on the situation.
- 如果有的列大多是空的,可以直接去掉。
df.drop(drop_list, axis = 1, inplace = True)
- 有空值的行可以直接去掉。
df.dropna(axis=0, inplace = True)
- 有些明显无用行,比如 id,可以去掉。
Treating Repeated Values
有没有traintest.nunique(axis = 1) == 1
df.drop_duplicates(subset = '', inplace = True)
Outliers
Visualizing Outliers
Scatter plot matrix helps validate outliers.
pandas.tools.plotting.scatter_matrix
Removing Outliers
frame1 = frame1[(frame1['Col1] > 40.0) & ((frame1['Col2] < 30.0) & ((frame1['Col3] > 3.0)]
Others
see this.
Ref
[1] Edx - Data Science Essentials