ML practitioners spend up to 60% of their time on feature engineering (Airbnb data). The same feature — "customer's average purchase value over the last 30 days" — is independently computed by three different teams with slightly different SQL, producing divergent values. The offline (batch training) computation uses a different code path than the online (real-time inference) computation, causing silent training-serving skew that is "extremely hard to debug." Features are not discoverable: GoJek found 10 different versions of the same feature being independently maintained. There is no point-in-time correctness guarantee, allowing data leakage from future information during training.
A feature store has two stores: an offline store (typically a columnar format on the lakehouse — Delta Lake, Parquet) for batch training data retrieval, and an online store (Redis, DynamoDB, Cassandra) for low-latency real-time inference serving. A single feature definition generates materialization pipelines for both stores, guaranteeing consistency. Point-in-time joins ensure training data reflects only what was known at prediction time. A feature registry enables discovery and reuse across teams. Transformation pipelines can be Python, Spark, or SQL — the feature store handles scheduling and backfill.
Feature store platform (Feast / Tecton / Hopsworks / Vertex AI Feature Store / SageMaker Feature Store / Databricks Feature Store) + offline store (Delta Lake / Parquet on S3) + online store (Redis / DynamoDB / Cassandra) + feature pipeline orchestration + feature discovery UI + drift monitoring.
Modular, version-controlled SQL transformations executed inside the warehouse, bringing software engineering practices to analytics code.
Feature pipelines depend on clean, modeled data from the transformation layer.
Unified data lake + warehouse architecture on open-format object storage, eliminating copy pipelines and providing ACID semantics at petabyte scale.
Feature store offline store typically sits on top of the lakehouse.