Tell-and-Show

Given a set of items \( \{ t_i \}, i \in \left[ 0, I \right] \) and a textual embedding \( E: \textrm{text} \rightarrow \mathbb{R}^t \), we use training data in the form of \( l \) runs of preferences. Each run is for the same user (which can be repeated as users can submit multiple runs). A run contains pairs \( (i, p) \) with \( i \in \left[ 0, I \right] \) the item over which the user stated the preference and \( p \in \left\{ -1, 0, 1 \right\} \) the stated preference (positive or negative, 0 means the text was not understood and those tuples are ignored, see the annotations instructions). Tell-and-Show trains a Siamese Neural Network as follows:

Given two tuples \( (i, q), (j, r) \) in the run for the same user such that \( q \neq 0 \wedge r \neq 0 \), then the following network is trained:

\[ t_i \rightarrow E \rightarrow \textrm{MLP}_p \rightarrow e_i \] \[ t_j \rightarrow E \rightarrow \textrm{MLP}_p \rightarrow e_j \] then \[ e_i \odot e_j \rightarrow \textrm{output} \]

The output is trained against \( q \, \mathrm{xor} \, r \). Only \( \textrm{MLP}_p \) is trained, \( E \) is fixed.

The intuition being that when \( r \) and \( s \) are the same, then \( e_i, e_j \) should be close and when they are different, they should be far. What \( \textrm{MLP}_p \) is doing is mapping from the text embedding space to a new, preference embedding space.

To learn which part of the textual embedding space are preference-carrying, we used BIRCH to cluster 5.5 million text descriptions in the English Wikimedia Commons. From there, we kept the 1,000 clusters that have 200 items or more (they account for 27% of all items). For each of the clusters, the item closest to the centroid is computed, we call it the center. Other 4 random items in the cluster are also kept. This constitutes the 5,000 items over which we are soliciting annotations.

The goal of the annotation process is to find as many crossing preferences for texts close enough in embedding space. As such, each time a cluster is selected for a given user, the center and the 4 extra elements are selected for annotation. Finally, there might be areas the preference space that the user likes or dislikes completely, irrespective of the items. To steer away from there, the system selects clusters equidistant from items marked positive and negative, therefore probing the decision boundary for the user.

Further detail: for annotation scheduling, the current version uses code inherited from an earlier project with LASER embeddings with a dimensionality of 1,024. The target embedding space is 128. For training the Siamese, we use Universal Sentence Encoders with a text embedding of 128 and same size for preferences embedding.

Once the preference metric (the mapping from text to the preference embeddings space) has been trained, all text descriptions of Wikicommons items can be mapped to the preference space, clustered with large clusters and their centers identified. The set of large centers can then be downloaded and a user can annotate their decision boundary over that set. That constitutes their preference profile: a set \( \{ (e, p) \} \) where \( e \) is a preference embedding and \( p \in \{ -1, 1 \} \) (text not understood by the user are ignored).

Instead of retrieving from a server a ranked list of items based on the user's preferences, the Bucket API returns a random list of 100-1000 items: \( \{ \left( \textrm{text description}, \textrm{preference embedding}, \textrm{URL} \right) \} \), the web browser then used \( k \)-nearest neighbors algorithm to rank the items.

Tell-and-Show

Training

Annotation Schedule

Production

Using the preference profiles: the Bucket API