Community Content-sharing Without Idols Nor Masters
Wahrlich es ist nicht das Wissen, sondern das Lernen, nicht das Besitzen, sondern das Erwerben, nicht das Da-Seyn, sondern das Hinkommen, was den grössten Genuss gewährt.
Carl Friedrich Gauss.
Given a set of items
Given two tuples
The output is trained against
The intuition being that when
To learn which part of the textual embedding space are preference-carrying, we used BIRCH to cluster 5.5 million text descriptions in the English Wikimedia Commons. From there, we kept the 1,000 clusters that have 200 items or more (they account for 27% of all items). For each of the clusters, the item closest to the centroid is computed, we call it the center. Other 4 random items in the cluster are also kept. This constitutes the 5,000 items over which we are soliciting annotations.
The goal of the annotation process is to find as many crossing preferences for texts close enough in embedding space. As such, each time a cluster is selected for a given user, the center and the 4 extra elements are selected for annotation. Finally, there might be areas the preference space that the user likes or dislikes completely, irrespective of the items. To steer away from there, the system selects clusters equidistant from items marked positive and negative, therefore probing the decision boundary for the user.
Further detail: for annotation scheduling, the current version uses code inherited from an earlier project with LASER embeddings with a dimensionality of 1,024. The target embedding space is 128. For training the Siamese, we use Universal Sentence Encoders with a text embedding of 128 and same size for preferences embedding.
Once the preference metric (the mapping from text to the preference embeddings space) has been trained, all text descriptions of Wikicommons items can be mapped to the preference space, clustered with large clusters and their centers identified. The set of large centers can then be downloaded and a user can annotate their decision boundary over that set. That constitutes their preference profile: a set
Instead of retrieving from a server a ranked list of items based on the user's preferences, the Bucket API returns a random list of 100-1000 items: