Data Leakage

Data leakage in machine learning occurs when a model is trained on information that would not be available during real-world predictions. This unintended access to future or external data during training can make the model appear highly accurate in development but cause poor performance once deployed, leading to inaccurate predictions and unreliable insights.

Examples of Data Leakage:

Target leakage: Including future information, such as using a “payment status” feature in predicting loan approvals before the loan is issued.
Data split leakage: Accidentally using overlapping data points in both training and testing sets, giving the model unfair context.
External feature leakage: Incorporating external variables (e.g., weather forecasts) that aren’t accessible during real-time prediction.

Data leakage undermines the integrity of machine learning models and can result in false confidence in their performance, leading to flawed decision-making.

« Back to Glossary Index

Related Posts

Subscribe to our newsletter