Database Optimization And The Problem With AI-Powered Cost Models
Database optimization is an important part of running any query, it helps the database select the most optimial way of accessing the data wanted by the user, though research in this field gives good accuracy, its still lacking in terms of real world results.
This is based on reading the article "How Good Are Learned Cost Models, Really? Insights from Query Optimization Tasks"
Some code can be found here with the citations for the article, with the other articles used for the comparison, and their code:
https://github.com/DataManagementLab/lcm-eval,
This blog post summarizes the article, with my interpretation of multiple points mentioned throughout the article.
Article: https://arxiv.org/abs/2502.01229
TL;DR
The paper compares the classical cost estimation models (cost estimators from Postgres V10 and Postgres V16) and the new learned cost models, defines new metrics to compare real-world results instead of just using the model's accuracy and finds out that even though in terms of accuracy the new learned cost models outperform the classical ones, in terms of real-world statistics (mainly the total runtime of the query(ies)), old classical models most of the time pull ahead.

Better Metrics
Even though accuracy and Q-Error are more than sufficient for any other use case on a model level, in the case of databases, the most important metric is the cost of running the query, whether it is the Latency (time it takes to run the query + model inference) or the Cost (cost on the CPU, Memory Usage, etc.).
However, all the research into LCMs focuses mainly on maximizing the accuracy and minimizing the Q-Error.
For this, the authors define three tasks and separate metrics for each to indicate the real-world results.
Join Ordering:
Selecting the optimal sequence in which the tables are joined, for this, they define the following metrics:
- Selected Runtime: What's the runtime (in seconds) for the plan that the model thought was the optimal plan
- Surpassed Plans: the percentage of the plans that the selected plan was better than (100% means this is the best plan, 0% means this is the worst plan)
- Rank Correlation: the correlation of actual and predicted runtimes of different query plans to assess the ranking ability of a model
- Maximal Relative Under/Over Estimation: the likelihood of the model to pick a nonoptimal plan by over-/under-estimating.
Given an example (with the full results in the paper), we can see that the results shown by LCMs are unsatisfactory; on the other hand, it's interesting that database-agnostic models perform in some tests better than their database-specific models.

Another test was done by giving the actual cardinalities to the models. In this test, the classical models gave near-perfect estimations/plan selection, compared to the LCM of choice, zero-shot, which showed a slight improvement.
Access Path Selection:
Whether the model selects the optimal choice of access path (index vs table scan), though indexes are always a good choice, when using them with small tables, it can have the opposite effect. For this, the following metrics are defined:
- Balanced Accuracy: how accurately the model selects the access path, when one class is under-/over-represented.

With this test, there are technically no winners, as all cost models still struggle to select the best access method.
Physical Operator Selection:
The selection of the optimal join algorithm, this one has many factors to make the selection between the different algorithms ( Hash Join, Sort-Merge Join, Nested-Loop Join, etc). The following metrics are defined:
- Pick rate: how often the optimizer picks the query plan with the optimal physical operator.

Some of the results in this task show potential for LCMs to be competitive with the classical model, but still do not provide enough significant benefit.
Another point that the authors noticed is that the PostgreSQL costs allow the models that use it to provide better results, and should be included in future LCMs.
The Need For Worse Data?
The biggest takeaway from the three tasks is that the models are always trained on a dataset that has a bias in some way, either in the access path selection, where the bias is towards index scans, or with physical operator selection, where the bias is always towards index nested loop join.
The authors make this a pivotal point in their recommendations section, where they recommend future research to diversify their queries and not train only on pre-optimized queries, and to take into consideration queries that time out.

The previous figure shows fine-tuned models that were trained with diversified data had improved accuracy, and most of them had better runtimes, except for zero-shot, which resulted in a better total runtime than the default PG10 Optimizer.
Model Architecture And Featurization
Though LCMs didn't outperform the default optimizer in any of the tasks that the authors tried, they gave the following recommendations for better results:
- It's better to learn from the query plan than the SQL Query itself
- More complex architectures don't necessarily mean better performance
- DB-Agnostic LCMs often outperform DB-Specific models.
- Precise cardinalities help.
Conclusion
The conclusion that the authors came to is that LCMs still have to improve to match or outperform traditional models; these improvements can be in terms of using better metrics, diversification of data, and even using hybrid models that utilize estimates from traditional models.
For my conclusion:
- The metrics defined by the article are a great way of comparing model performance, even better than just accuracy and Q-Error, and the results between the traditional models and the LCMs shows.
- The DB workload generator introduced by the authors of the Zero-Shot model works great, but only gives pre-optimized queries, so there is room for improvement on that front.
- It's more important to have better query encoding than a more complex architecture (to be researched).