Numerical Feature Embedding

**Authors:** Yury Gorishniy, Ivan Rubachev, Artem Babenko **Published:** NeurIPS 2022 **Link:** [GitHub](https://github.com/yandex-research/tabular-dl-num-embeddings) --- ### **What is the Paper About?** This paper investigates how **numerical feature embeddings** affect the performance of **deep learning (DL)** models for **tabular data**. While past research focused on architectures (MLP, ResNet, Transformer), this paper shows that **how you represent numerical features matters a lot**...often more than the model itself. --- ### **Key Points:** - **The Problem:** Most DL models convert numerical features to high-dimensional embeddings using basic functions (e.g. linear layers), which may limit performance. - **Two Embedding Approaches Proposed:** 1. **Piecewise Linear Encoding (PLE):** - Breaks numerical values into bins and represents them with a smooth, linear interpolation. - Simple, interpretable, and fast. 2. **Periodic Embeddings:** - Uses sinusoidal functions (like sine/cosine) to encode numerical values. - Inspired by positional encodings in Transformers. - **General Framework:** - Embeddings are created **per feature**, without sharing parameters. - Works with MLP, ResNet, and Transformers. - **Results:** - With proper embeddings, **MLPs can match or outperform Transformers**. - DL models with good embeddings can **compete with or beat GBDTs** (like XGBoost, [[CatBoost]]) on “GBDT-friendly” datasets. --- ### **Performance Highlights:** - **MLP + Periodic (PLR)** and **MLP + Target-aware PLE (T-LR)** gave **state-of-the-art results**. - Embedding techniques were critical on datasets like California Housing and Adult Income. - Embeddings improved DL model robustness to preprocessing choices (e.g., scaling, normalization). --- ### **Strengths:** - Establishes **embeddings as a critical design choice** for tabular DL. - Shows **simple DL models can be very competitive** with the right embeddings. - Provides detailed benchmarks and open-source code. --- ### **Limitations:** - Embeddings are applied uniformly across features; **per-feature optimization** may do even better. - DL models are still generally more **resource-intensive** than GBDTs.