Feature Engineering
You can never turn sand into gold, but you can turn sand into glass. The same goes for data. You can never turn bad data into good data, but you can turn bad data into something useful. Feature engineering is the process of transforming raw data into features that can be used by machine learning models. It is a crucial step in the machine learning pipeline and can have a significant impact on the performance of the model.
Normalization
Normalization is to remove the scale of the features. It is often used when the features have different scales and the model is sensitive to the scale of the features. The most common normalization techniques are min-max normalization and z-score normalization.
x' = \frac{x - \min(x)}{\max(x) - \min(x)}
x' = \frac{x - \mu}{\sigma}
During gradient descent, the normalization can help the model converge faster and avoid getting stuck in local minima. It can also help the model to generalize better to unseen data. It is also adapted to the models, like linear regression, logistic regression, svm, etc. For tree-based models, like decision trees, random forests, etc., normalization is not necessary because they are not sensitive to the scale of the features and When performing node splitting, a decision tree mainly relies on the information gain ratio of the dataset D with respect to feature x.
Categorical Feature
Categorical features are features that can take on a limited number of values. They can be nominal (no order) or ordinal (with order). For example the gender or the education level. Categorical features need to be encoded into numerical values before they can be used by most of machine learning models. There are several encoding techniques for categorical features, including:
- One-hot Encoding: This one is used when the categorical feature is nominal. It creates a new binary feature for each category and assigns a value of 1 to the feature corresponding to the category and 0 to all other features. For example, male, female, non-binary can be encoded as [1, 0, 0], [0, 1, 0], and [0, 0, 1] respectively. One-hot encoding can lead to a high-dimensional feature space if the categorical feature has many categories, which can lead to the curse of dimensionality and may require dimensionality reduction techniques. To store them we can use sparse formats, like sparse matrices, to save memory and computational resources.
import numpy as np
from scipy.sparse import csr_matrix
dense = np.array([0, 0, 3.2, 0, 0, 5.1, 0])
sparse = csr_matrix(dense)
[1,0,0] -> [1,1]
[0,1,0] - > [2,1]
[0,0,1] -> [3,1]
- Ordinal Encoding: Ordinal encoding is used when the categorical feature is ordinal. It assigns a unique integer value to each category based on the order of the categories. For example, low, medium, high can be encoded as 1, 2, and 3 respectively.
- Binary Encoding:Binary encoding is a combination of one-hot encoding and ordinal encoding. It converts the integer values from ordinal encoding into binary code and creates new binary features for each bit. For example, if we have 4 categories, we can encode them as 00, 01, 10, and 11 respectively. Binary encoding can reduce the dimensionality of the feature space compared to one-hot encoding while still capturing the information about the categories. There are also other encoding techniques, like target encoding, frequency encoding, Helmert Contrast, Sum Contrast, Polynomial Contrast, Backward Difference Contrastetc. The choice of encoding technique depends on the specific use case and the type of categorical feature being encoded. It is important to consider the impact of the encoding technique on the performance of the model and to experiment with different techniques to find the best one for your specific problem.
High dimensional features processing
Combining features: This involves creating new features by combining existing features. For example, if you have two features, “age” and “income”, you can create a new feature called “age_income” by multiplying the two features together. This can help capture interactions between the features that may be important for the model. But if two features have high dimensionality x, y, it can lead to the curse of dimensionality, which can make it difficult for the model to learn and generalize. In this case, it may be necessary to use dimensionality reduction techniques, to reduce the representation of x, y by m,n with m,n « x,y.
How to find combinatorial features
- Domain Knowledge: Use your understanding of the problem domain to identify potential interactions between features. For example, in a housing price prediction model, you might consider the interaction between “number of bedrooms” and “square footage” as a potential combinatorial feature.
- Decision Tree: