Instead of comparing manually-combined feature data, you can reduce the feature data to representations called embeddings, and then compare the embeddings. Embeddings are generated by training a supervised deep neural network (DNN) on the feature data itself. The embeddings map the feature data to a vector in an embedding space. Typically, the embedding space has fewer dimensions than the feature data in a way that captures some latent structure of the feature data set. The embedding vectors for similar examples, such as YouTube videos watched by the same users, end up close together in the embedding space. We will see how the similarity measure uses this "closeness" to quantify the similarity for pairs of examples.
Remember, we’re discussing supervised learning only to create our similarity measure. The similarity measure, whether manual or supervised, is then used by an algorithm to perform unsupervised clustering.
Comparison of Manual and Supervised Measures
This table describes when to use a manual or supervised similarity measure depending on your requirements.
|Eliminate redundant information in correlated features.||No, you need to separately investigate correlations between features.||Yes, DNN eliminates redundant information.|
|Provide insight into calculated similarities.||Yes||No, embeddings cannot be deciphered.|
|Suitable for small datasets with few features.||Yes, designing a manual measure with a few features is easy.||No, small datasets do not provide enough training data for a DNN.|
|Suitable for large datasets with many features.||No, manually eliminating redundant information from multiple features and then combining them is very difficult.||Yes, the DNN automatically eliminates redundant information and combines features.|
Process for Supervised Similarity Measure
The following figure shows how to create a supervised similarity measure:
You've already learned the first step. This page discusses the next step, and the following pages discuss the remaining steps.
Choose DNN Based on Training Labels
Reduce your feature data to embeddings by training a DNN that uses the same feature data both as input and as the labels. For example, in the case of house data, the DNN would use the features—such as price, size, and postal code—to predict those features themselves. In order to use the feature data to predict the same feature data, the DNN is forced to reduce the input feature data to embeddings. You use these embeddings to calculate similarity.
A DNN that learns embeddings of input data by predicting the input data itself is called an autoencoder. Because an autoencoder’s hidden layers are smaller than the input and output layers, the autoencoder is forced to learn a compressed representation of the input feature data. Once the DNN is trained, you extract the embeddings from the last hidden layer to calculate similarity.
An autoencoder is the simplest choice to generate embeddings. However, an autoencoder isn't the optimal choice when certain features could be more important than others in determining similarity. For example, in house data, let's assume “price” is more important than “postal code". In such cases, use only the important feature as the training label for the DNN. Since this DNN predicts a specific input feature instead of predicting all input features, it is called a predictor DNN. Use the following guidelines to choose a feature as the label:
Prefer numeric features to categorical features as labels because loss is easier to calculate and interpret for numeric features.
Do not use categorical features with cardinality \(\lesssim\) 100 as labels. If you do, the DNN will not be forced to reduce your input data to embeddings because a DNN can easily predict low-cardinality categorical labels.
Remove the feature that you use as the label from the input to the DNN; otherwise, the DNN will perfectly predict the output.
Depending on your choice of labels, the resulting DNN is either an autoencoder DNN or a predictor DNN.
Loss Function for DNN
To train the DNN, you need to create a loss function by following these steps:
- Calculate the loss for every output of the DNN. For outputs that are:
- Numeric, use mean square error (MSE).
- Univalent categorical, use log loss. Note that you won't need to implement log loss yourself because you can use a library function to calculate it.
- Multivalent categorical, use softmax cross entropy loss. Note that you won't need to implement softmax cross entropy loss yourself because you can use a library function to calculate it.
- Calculate the total loss by summing the loss for every output.
When summing the losses, ensure that each feature contributes proportionately to the loss. For example, if you convert color data to RGB values, then you have three outputs. But summing the loss for three outputs means the loss for color is weighted three times as heavily as other features. Instead, multiply each output by 1/3.
Using DNN in an Online System
An online machine learning system has a continuous stream of new input data. You’ll need to train your DNN on the new data. However, if you retrain your DNN from scratch, then your embeddings will be different because DNNs are initialized with random weights. Instead, always warm-start the DNN with the existing weights and then update the DNN with new data.