What is a Feature Vector?

Feature vectors represent features used by machine learning models in multi-dimensional numerical values. As machine learning models can only deal with numerical values, converting any necessary features into feature vectors is crucial. Here, we discuss feature vectors in various use cases. We also explain the difficulties in generating and managing feature vectors.

In this post, you will learn about:

What a feature vector is
Examples of feature vectors
Feature vectors in exploratory data analysis
Feature vectors in feature engineering
Building feature vectors with Iguazio

Definition of Feature Vector

A feature vector is an ordered list of numerical properties of observed phenomena. It represents input features to a machine learning model that makes a prediction.

Humans can analyze qualitative data to make a decision. For example, we see the cloudy sky, feel the damp breeze, and decide to take an umbrella when going outside. Our five senses can transform outside stimuli into neural activity in our brains, handling multiple inputs as they occur in no particular order.

However, machine learning models can only deal with quantitative data. As such, we must always convert features of observed phenomena into numerical values and feed them into a machine learning model in the same order. In short, we must represent features in feature vectors.

Examples of Feature Vectors

There are different types of features and techniques that are useful for building a feature vector, including:

Computer Vision

We often use image pixels represented in RGB (red, green, blue) format. Each pixel is a three-dimensional vector, ranging from 0 to 255 in 8-bit encoding.
We encode classes into each channel for semantic segmentation problems, like class1, class2, and class3.

Feature Vectors for Text Classification

A bag-of-words model represents a document in a vector format where each element has the number of a particular word’s occurrences. Although each index in a vector corresponds to a word, a machine learning model sees it as a list of numerical values to make a prediction.
Tf-idf (term frequency-inverse document frequency) measures the importance of each word in a document. The calculation involves dividing the number of a word’s occurrences by the number of documents containing the same word. If one document uses a particular word very often, but other documents do not, then the word must be important in that document.
One-hot encoding is a vector with zeros everywhere, except at one index where the value is one, uniquely representing each word. In contrast, the word2vec (word-to-vector) format uses a distributed representation, meaning that the elements in a vector are often non-zeros. This uses much less memory than one-hot encoding and even allows linear algebra operations to measure the similarity of words. This type of word vector is generally called a word embedding vector.
The use of word embedding vectors is prevalent today, as they can represent many words in a natural language concisely and yet convey the semantics and contexts very well. As we can perform matrix operations on them, they are suitable for deep-learning-based language models.

Recommendation System

A vector can represent many properties of users’ purchase activity patterns, such as time of purchase, product category, price, store ID, age, and so on.
Recommendation systems perform matrix operations on a large amount of customer data, represented in feature vectors.

Feature Vectors in Exploratory Data Analysis

In exploratory data analysis, researchers try to discover features from raw data. They may start with qualitative research, looking at visualizations and applying their domain expertise to deduce an idea that can transform the observation into feature vectors. For example, a feature vector in data mining represents a hidden pattern in large data sets, such as equity trading buy/sell signals from the historical trading price and volume data.

In the field of natural language processing, the process of splitting sentences into distinct entities is known as tokenization. For instance, researchers could treat each word or phoneme as a unique token to generate feature vectors for further analysis and experiments.

In computer vision, the RGB color scheme isn’t the only way to represent image pixels. For example, there is also HSL (hue, saturation, lightness) and HSV (hue, saturation, value). Sometimes, practitioners even use a monochrome scheme to reduce noises originating from color images.

Ultimately, researchers explore different feature vectors to evaluate the performance of their predictive models. Once the feature design is ready, they are good to go to the next stage.

Feature Vectors in Feature Engineering

Feature engineering is, in large part, the systematic process of generating feature vectors from raw data. There are, however, some obstacles to setting up such a process. First, we need a place to store generated feature vectors for later retrieval. We also need to update feature definitions from time to time to accommodate changes in the underlying dynamics or the latest discovery.

In other words, we must keep features up to date, as they change over time. However, applications cannot jump from an old feature definition to a new one overnight, so we also need to keep track of multiple versions of feature definitions. This complicates the management of feature vectors. Moreover, various teams need to share feature vectors, even though they are in different AI product development stages.

Building Feature Vectors with Iguazio

The Iguazio MLOps platform’s feature store simplifies feature engineering for both batch and real-time processing. It stores and monitors feature vectors, making them available with versions and accessible through various API calls. This ensures easy management and transitions for applications to new feature definitions over time.

Learn more: