Member-only story

LightGBM vs XGBoost vs Catboost

Data Science Letter
12 min readNov 4, 2024

--

Quick summary

Hello 👋 In this article, I will compare LightGBM, XGBoost and CatBoost in the following areas:

  • Boosting algorithm
  • Node splitting
  • Missing data handling
  • Feature handling
  • Data sampling
  • LightGBM-specific features
  • XGBoost-specific features
  • CatBoost-specific features
  • Tips for choosing between LightGBM, XGBoost and CatBoost
  • Resources

🚀 Subscribe to us @ newsletter.datascienceletter.com

Boosting algorithm

Conventional boosting (LightGBM, XGBoost) vs Order boosting (CatBoost)

One of the major differences in tree building between LightGBM/XGBoost and CatBoost is the usage of ‘Order boosting’ in CatBoost.

In conventional boosting algorithms (used by LightGBM and XGBoost), at each boosting iteration, the tree is built using the same data points. It is argued that this repeated use of a single set of data points and can increase the chance of overfitting.

To mitigate this effect, CatBoost supports a different boosting algorithm as known as order boosting. The whole idea of this algorithm is to avoid repeatedly using same data points for both tree building and gradient or hessian computations. The method is briefly explained as follows:

  1. First the original training dataset with size N is shuffled S times.
  2. At each boosting iteration, for each shuffled dataset, a separate tree is built for each data position i (where i = 1, 2 ,…, N), using only data points before i (j < i).
  3. The gradients and hessians for a particular data point k are then computed using trees built before k.

In reality, it is not practical to train a tree for each data position for each shuffled datasets, as the computational complexity would scale as SN². The actual algorithm builds trees for a fixed number of…

--

--

Data Science Letter
Data Science Letter

Written by Data Science Letter

Deep dive into a topic in data science and machine learning every two weeks. Subscriber to us @ newsletter.datascienceletter.com

No responses yet