The skill of feature engineering — crafting data features optimized for machine learning — is as old as data science itself. But it’s a skill I’ve noticed is becoming more and more neglected. The high demand for machine learning has produced a large pool of data scientists who have developed expertise in tools and algorithms but lack the experience and industry-specific domain knowledge that feature engineering requires. And they are trying to compensate for that with better tools and algorithms. However, algorithms are now a commodity and don’t generate corporate IP.
Generic data is becoming commoditized and cloud-based Machine Learning Services (MLaaS) like Amazon ML and Google AutoML now make it possible for even less experienced team members to run data models and get predictions within minutes. As a result, power is shifting to companies that develop an organizational competency in collecting or manufacturing proprietary data — enabled by feature engineering. Simple data acquisition and model building are no longer enough.
Corporate teams can learn a lot from the winners of modeling competitions such as the KDD Cup and Heritage Provider Network Health Prize that have credited feature engineering as a key element in their successes.
Feature engineering techniques
To power feature engineering, data scientists have developed a range of techniques. They can be broadly viewed as:
Contextual transformation. One set of methods involves transforming the individual features from the original set into more contextually meaningful information for each specific model.
For example, when dealing with a categorical feature, ‘unknown’ might communicate special information in the context of a specific situation. However, inside the model it looks like just another category value. In this case a team might want to introduce a new binary feature of ‘has_value’ to separate ‘unknown’ from all other options. For example, a ‘color’ feature would allow an entry of ‘has_color’ for something of unknown color.
Another approach is to turn a categorical feature into a set of variables using one-hot encoding. In the above example, turning the ‘color’ category into three features (one each for ‘red’, ‘green’, and ‘blue’) may allow for a better learning process depending on the goals of the model.
Machine learning teams also frequently use binning as a method of transforming single features into multiple features for better insight. For example splitting an ‘age’ feature into ‘young’ for < 40, ‘middle_age’ for 40-60 and ‘old’ for > 60.
Some other examples of transformations are:
- Scaling values between min-max of a variable (such as age) into a range of [0, 1]
- Dividing number of visits to each type of restaurant as an indicator of ‘interest’ in cuisines
Multi-feature arithmetic. Another approach to feature engineering applies arithmetic formulas to a set of existing data points. The formulas can create derivatives based on interactions between features, ratios, and other relationships.
This type of feature engineering can be deliver high value but requires a solid understanding of the subject matter and goals of the model.
Examples include using formulas to:
- Calculate ‘neighborhood quality’ from a combination of ‘school rating’ and ‘crime rate’
- Determine a ‘casino luck factor’ by comparing visitor ‘actual spending’ with ‘expected spending’
- Produce a ‘utilization rate’ by dividing credit card ‘balance’ by ‘limit’
- Derive a RFM score (Recency, Frequency, Monetary) to segment customers from a combination of ‘most recent transaction,’ ‘transaction frequency,’ and ‘amount spent’ during a particular timeframe.
Advanced techniques. Teams may also choose more advanced algorithmic methods that analyze existing data to find opportunities for creating new features.
- Principal component analysis (PCA) and independent component analysis (ICA) map existing data to another feature space
- Deep feature synthesis (DFS) allows for transfer of intermediate learnings from middle layers in the neural networks
Setting a framework for success
Teams must continuously look for more effective features and models. However, to be successful, this work must be done inside a methodical and repeatable framework. Here are the six critical steps for any feature engineering effort:
1. Clarify model usage. Start by clarifying the primary objectives and use cases of the model. The entire team must be in sync and working with a singularity of purpose. Otherwise, you’ll dilute efforts and waste resources.
2. Set the criteria. The process of building a high performing model requires careful exploration and analysis of available data. But the work plan also needs to accommodate real world barriers. Consider factors such as cost, accessibility, computational limits, storage constraints, and other requirements during featurization. The team must align on such preferences or limitations early.
3. Ideate new features. Think broadly about ways to create new data to better describe and solve the problem. Domain knowledge and involvement of subject matter experts at this point will ensure the results of your feature engineering add value.
4. Construct features as inputs. Once you’ve identified new feature concepts, select the most effective techniques to construct them from the data available. Picking the right technique is key to ensuring the usefulness of the new features.
5. Study the impact. Assess the impact of new features on model performance. The conclusions about the value added by the new features directly depends on how the efficacy of the model is measured.
Model performance measurement must relate to business metrics in order to be meaningful. Today, teams have a vast set of measurement options that go well beyond accuracy, such as precision, recall, F1 score, and the receiver operating characteristic (ROC) curve.
6. Refine the features. Feature engineering is an iterative process involving testing, adjusting, and refining new features. The optimization loop in this process sometimes results in removal of low performing features or replacement using close variants until the highest impact features are identified.
The takeaway
Feature engineering is the new alchemy for our modern world with successful teams turning generic data into value added intellectual property for their organizations.
Several important principles help drive success in this work:
- Include subject matter expertise to ensure programs start with a clear understanding of business objectives and related measures of model effectiveness
- Work through an iterative and systematic process
- Consider the many possible featurization options available
- Understand and monitor how the choice of features affects model performance
This ability to turn data into proprietary features that drive meaningful models can create significant value and ensure an organization’s competitive edge.