**Authors:** Yury Gorishniy, Ivan Rubachev, Artem Babenko
**Published:** NeurIPS 2022
**Link:** [GitHub](https://github.com/yandex-research/tabular-dl-num-embeddings)
---
### **What is the Paper About?**
This paper investigates how **numerical feature embeddings** affect the performance of **deep learning (DL)** models for **tabular data**. While past research focused on architectures (MLP, ResNet, Transformer), this paper shows that **how you represent numerical features matters a lot**...often more than the model itself.
---
### **Key Points:**
- **The Problem:**
Most DL models convert numerical features to high-dimensional embeddings using basic functions (e.g. linear layers), which may limit performance.
- **Two Embedding Approaches Proposed:**
1. **Piecewise Linear Encoding (PLE):**
- Breaks numerical values into bins and represents them with a smooth, linear interpolation.
- Simple, interpretable, and fast.
2. **Periodic Embeddings:**
- Uses sinusoidal functions (like sine/cosine) to encode numerical values.
- Inspired by positional encodings in Transformers.
- **General Framework:**
- Embeddings are created **per feature**, without sharing parameters.
- Works with MLP, ResNet, and Transformers.
- **Results:**
- With proper embeddings, **MLPs can match or outperform Transformers**.
- DL models with good embeddings can **compete with or beat GBDTs** (like XGBoost, [[CatBoost]]) on “GBDT-friendly” datasets.
---
### **Performance Highlights:**
- **MLP + Periodic (PLR)** and **MLP + Target-aware PLE (T-LR)** gave **state-of-the-art results**.
- Embedding techniques were critical on datasets like California Housing and Adult Income.
- Embeddings improved DL model robustness to preprocessing choices (e.g., scaling, normalization).
---
### **Strengths:**
- Establishes **embeddings as a critical design choice** for tabular DL.
- Shows **simple DL models can be very competitive** with the right embeddings.
- Provides detailed benchmarks and open-source code.
---
### **Limitations:**
- Embeddings are applied uniformly across features; **per-feature optimization** may do even better.
- DL models are still generally more **resource-intensive** than GBDTs.