The **Union Visual Translation Embedding network (UVTransE)**, which learns three projection matrices $W_{s}$, $W_{o}$, $W_{u}$ which map the respective feature vectors of the bounding boxes enclosing the subject, object, and union of subject and object into a common embedding space, as well as translation vectors $t_{p}$ (to be consistent with [[(VtransE) Visual Translation Embedding Network for Visual Relation Detection]]) in the same space corresponding to each of the predicate labels that are present in the dataset.