Generating Embeddings Example

This example shows how to generate the embeddings used in a supervised similarity measure.

Imagine you have the same housing data set that you used when creating a manual similarity measure:

FeatureType
PricePositive integer
Size Positive floating-point value in units of square meters
Postal codeInteger
Number of bedroomsInteger
Type of houseA text value from “single_family," “multi-family," “apartment,” “condo”
Garage0/1 for no/yes
ColorsMultivalent categorical: one or more values from standard colors “white,” ”yellow,” ”green,” etc.

Preprocessing Data

Before you use feature data as input, you need to preprocess the data. The preprocessing steps are based on the steps you took when creating a manual similarity measure. Here's a summary:

FeatureType or DistributionAction
PricePoisson distribution Quantize and scale to [0,1].
SizePoisson distribution Quantize and scale to [0,1].
Postal codeCategorical Convert to longitude and latitude, quantize and scale to [0,1].
Number of bedroomsInteger Clip outliers and scale to [0,1].
Type of houseCategorical Convert to one-hot encoding..
Garage0 or 1 Leave as is.
ColorsCategorical Convert to RGB values and process as numeric data.

For more information on one-hot encoding, see Embeddings: Categorical Input Data.

Choose Predictor or Autoencoder

To generate embeddings, you can choose either an autoencoder or a predictor. Remember, your default choice is an autoencoder. You choose a predictor instead if specific features in your dataset determine similarity. For completeness, let's look at both cases.

Train a Predictor

You need to choose those features as training labels for your DNN that are important in determining similarity between your examples. Let's assume price is most important in determining similarity between houses.

Choose price as the training label, and remove it from the input feature data to the DNN. Train the DNN by using all other features as input data. For training, the loss function is simply the MSE between predicted and actual price. To learn how to train a DNN, see Training Neural Networks.

Train an Autoencoder

Train an autoencoder on our dataset by following these steps:

  1. Ensure the hidden layers of the autoencoder are smaller than the input and output layers.
  2. Calculate the loss for each output as described in Supervised Similarity Measure.
  3. Create the loss function by summing the losses for each output. Ensure you weight the loss equally for every feature. For example, because color data is processed into RGB, weight each of the RGB outputs by 1/3rd.
  4. Train the DNN.

Extracting Embeddings from the DNN

After training your DNN, whether predictor or autoencoder, extract the embedding for an example from the DNN. Extract the embedding by using the feature data of the example as input, and read the outputs of the final hidden layer. These outputs form the embedding vector. Remember, the vectors for similar houses should be closer together than vectors for dissimilar houses.

Next, you'll see how to quantify the similarity for pairs of examples by using their embedding vectors.

Send feedback about...

Clustering in Machine Learning