Beyond LIGER: A New Benchmark in Sequential Recommendation
In November 2024, a paper titled "Unifying Generative and Dense Retrieval for Sequential Recommendation" was released. The paper introduced a new state-of-the-art model, LIGER, improving over a previous generative recommender model, TIGER (see our blogpost comparing TIGER and BaseModel), with particularly strong gains in cold-start settings.
In this article, we compare the performance of BaseModel with cold-start capabilities against the TIGER and LIGER models.
Overview of the LIGER model
We start by a quick introduction to LIGER's predecessor model, TIGER. The core innovation of this model is the concept of Semantic IDs. These are lightweight representations of item properties that can be used to store information about an item’s attributes, like title, price, brand, and others. These compressed representations allows a smaller memory and computational footprint, while providing the model with crucial item metadata.
In particular, TIGER utilises a bi-directional transformer architecture: given a sequence of item representations corresponding to the users purchase history, it predicts the most likely next item. During each of these steps, TIGER uses semantic IDs to represent the item:
TIGER: item representation = Semantic ID
The authors of the LIGER paper compare TIGER to dense retrieval, where instead of collapsing the item attributes into the Semantic ID, the properties are used as part of the item representation directly. More accurately, in dense retrieval each item is represented as:
DENSE RETRIEVAL: item representation = learnable item encoding + item properties + item position
In addition, in dense retrieval all items participate in the prediction process: instead of asking the model to generate the next item, all items are compared, and the best matching wins.
While dense retrieval outperforms TIGER, the authors’ experiments verify that combining TIGER’s Semantic IDs with the dense method’s auxiliary information sources yields improved performance over both TIGER and state-of-the-art dense retrieval methods. More precisely, they use the following item representation:
LIGER: item representation = Semantic ID + item properties + item position
A consequence of the revised LIGER item representations is improved cold-start recommendation quality. This refers to a common issue in recommendation systems, whence items newly introduced to a catalog often lack sufficient interaction data for traditional recommendation methods to function accurately. If no customer has interacted with an item yet, the model has limited signal to infer what kinds of users would likely purchase the product.
BaseModel vs LIGER
As with TIGER, the BaseModel architecture shares some core ideas with LIGER. Namely, BaseModel uses our proprietary Tremde algorithm to compute density-aware item sketches, which can be thought of as a sort of interactional Semantic ID. It is instructive to think of Tremde sketches as semantic IDs for the users behavioural pattern. The density-awareness means that the sketches adapt their resolution based on how much interaction data is available, allocating more representational capacity to dense regions of the interaction space. In practice the Tremde sketches seem to significantly improve representational power.
In addition, BaseModel’s core prediction engine is based on a neural network backbone specifically engineered for tasks involving behavioural data, and therefore achieves superior computational performance compared to LIGER’s bi-directional transformer.
Finally, we mention BaseModel’s new cold-start generative engine. This engine incorporates item properties into the usual interactional representations, thus allowing BaseModel to generate accurate predictions even in the absence of interaction data.
Results
We compare TIGER, LIGER, and BaseModel on the publicly available Amazon datasets.
We ran BaseModel in two different configurations, to allow the fairest comparison with LIGER: the authors report contradicting dataset metrics in their original paper. To make sure that we are not gaining advantage from using the interpretation that suits BaseModel better, we indicate the results of BaseModel across two possible interpretations. A complete description of the two setups is given in the Discussion section.
We observe that LIGER achieves significant improvements over TIGER across all datasets and metrics. BaseModel, in turn, significantly outperforms LIGER on all three datasets. A few results are worth highlighting:
- Cold-start performance stands out the most. On the Sports dataset, BaseModel achieves up to +263% improvement in cold-start NDCG@10 and +243% in cold-start Recall@10 over LIGER. Similarly, on Toys, cold-start NDCG@10 improves by up to +103%. These gains are particularly meaningful given that cold-start is the setting where recommendation models are most needed and most likely to fail.
- Warm-start ("Inset") results are also consistently strong. BaseModel improves Inset NDCG@10 by 24–60% over LIGER across datasets. The closest competition is on Toys Inset Recall@10 under the BaseModel 2 configuration, where LIGER is within 3% - the only instance where LIGER is competitive with BaseModel on any metric.
In the LIGER paper, the authors report the following table of the data that they used.
Upon regenerating the data for our experiments, we checked the generated data against this table to make sure that the comparison we perform is fair. But it turned out that while the regenerated dataset matched on the number of users, items, and actions, it did not match on the number of cold-start items. Since cold-start performance is a central focus of this comparison, this discrepancy is worth noting: a different number of cold-start items could shift the difficulty of the cold-start evaluation in either direction. Nevertheless, we run experiments in this setting. These are labelled BaseModel 1.
Upon truncating each user's purchase history to the last 20 purchases, we were able to recover the user, item, and cold-start item numbers, but this also reduces the total number of actions, again creating some discrepancy. We labelled the results obtained on the data prepared in this second way BaseModel 2.
By reporting both configurations, we aim to bracket the true comparison: BaseModel outperforms LIGER convincingly under both interpretations, giving us confidence that the gains are robust to the data preparation differences.
Choice of baseline results
In the Results of this article we reported LIGER's scores from their original paper's Table 4 "Ours (K=N)" row, rather than the numbers the authors emphasise in the more visible Table 2. This is because we disagree with an assumption made in the experiments whose result is presented in Table 2. We explain this assumption here.
The results reported in Table 2 assume a very particular retrieval method, where the model assumes that the number of cold-start items is small at any given time in the catalog. In particular, the assumption is that the number of cold-start items is not much bigger than 20.
Informally speaking, the setup works roughly as follows: the model first takes a small candidate set of “likely” items (top-20 from a warm-start retrieval stage), then adds all cold-start items to that pool. This is followed by evaluating the accuracy with which the model predicts cold start items, from the heavily biased pool.
That’s why we prefer the Table 4 K=N setting for comparison: it better reflects the regime where the model is evaluated without relying on the assumption that the cold-start pool is tiny and pre-inserted into an artificially small shortlist.
Conclusion
The comparison between BaseModel and LIGER reveals substantial differences in their architectural choices and performance.
Dataset statistics:
While LIGER represents a notable advancement in generative retrieval recommender systems, extracting state of the art performance from TIGER's Semantic ID idea, BaseModel’s approach demonstrates superior efficiency and effectiveness in sequential recommendation tasks.
The potential of our methods encourages continued innovation and comparison with leading models in the field to push the boundaries of what behavioral models can achieve.