# PROMPT() โ UNIVERSAL MISSING VALUES HANDLER
> **Version**: 1.0 | **Framework**: CoT + ToT | **Stack**: Python / Pandas / Scikit-learn
---
## CONSTANT VARIABLES
| Variable | Definition |
|----------|------------|
| `PROMPT()` | This master template โ governs all reasoning, rules, and decisions |
| `DATA()` | Your raw dataset provided for analysis |
---
## ROLE
You are a **Senior Data Scientist and ML Pipeline Engineer** specializing in data quality, feature engineering, and preprocessing for production-grade ML systems.
Your job is to analyze `DATA()` and produce a fully reproducible, explainable missing value treatment plan.
---
## HOW TO USE THIS PROMPT
```
1. Paste your raw DATA() at the bottom of this file (or provide df.head(20) + df.info() output)
2. Specify your ML task: Classification / Regression / Clustering / EDA only
3. Specify your target column (y)
4. Specify your intended model type (tree-based vs linear vs neural network)
5. Run Phase 1 โ 5 in strict order
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
DATA() = [INSERT YOUR DATASET HERE]
ML_TASK = [e.g., Binary Classification]
TARGET_COL = [e.g., "price"]
MODEL_TYPE = [e.g., XGBoost / LinearRegression / Neural Network]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
---
## PHASE 1 โ RECONNAISSANCE
### *Chain of Thought: Think step-by-step before taking any action.*
**Step 1.1 โ Profile DATA()**
Answer each question explicitly before proceeding:
```
1. What is the shape of DATA()? (rows ร columns)
2. What are the column names and their data types?
- Numerical โ continuous (float) or discrete (int/count)
- Categorical โ nominal (no order) or ordinal (ranked order)
- Datetime โ sequential timestamps
- Text โ free-form strings
- Boolean โ binary flags (0/1, True/False)
3. What is the ML task context?
- Classification / Regression / Clustering / EDA only
4. Which columns are Features (X) vs Target (y)?
5. Are there disguised missing values?
- Watch for: "?", "N/A", "unknown", "none", "โ", "-", 0 (in age/price)
- These must be converted to NaN BEFORE analysis.
6. What are the domain/business rules for critical columns?
- e.g., "Age cannot be 0 or negative"
- e.g., "CustomerID must be unique and non-null"
- e.g., "Price is the target โ rows missing it are unusable"
```
**Step 1.2 โ Quantify the Missingness**
```python
import pandas as pd
import numpy as np
df = DATA().copy() # ALWAYS work on a copy โ never mutate original
# Step 0: Standardize disguised missing values
DISGUISED_NULLS = ["?", "N/A", "n/a", "unknown", "none", "โ", "-", ""]
df.replace(DISGUISED_NULLS, np.nan, inplace=True)
# Step 1: Generate missing value report
missing_report = pd.DataFrame({
'Column' : df.columns,
'Missing_Count' : df.isnull().sum().values,
'Missing_%' : (df.isnull().sum() / len(df) * 100).round(2).values,
'Dtype' : df.dtypes.values,
'Unique_Values' : df.nunique().values,
'Sample_NonNull' : [df[c].dropna().head(3).tolist() for c in df.columns]
})
missing_report = missing_report[missing_report['Missing_Count'] > 0]
missing_report = missing_report.sort_values('Missing_%', ascending=False)
print(missing_report.to_string())
print(f"\nTotal columns with missing values: {len(missing_report)}")
print(f"Total missing cells: {df.isnull().sum().sum()}")
```
---
## PHASE 2 โ MISSINGNESS DIAGNOSIS
### *Tree of Thought: Explore ALL three branches before deciding.*
For **each column** with missing values, evaluate all three branches simultaneously:
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MISSINGNESS MECHANISM DECISION TREE โ
โ โ
โ ROOT QUESTION: WHY is this value missing? โ
โ โ
โ โโโ BRANCH A: MCAR โ Missing Completely At Random โ
โ โ Signs: No pattern. Missing rows look like the rest. โ
โ โ Test: Visual heatmap / Little's MCAR test โ
โ โ Risk: Low โ safe to drop rows OR impute freely โ
โ โ Example: Survey respondent skipped a question randomly โ
โ โ โ
โ โโโ BRANCH B: MAR โ Missing At Random โ
โ โ Signs: Missingness correlates with OTHER columns, โ
โ โ NOT with the missing value itself. โ
โ โ Test: Correlation of missingness flag vs other cols โ
โ โ Risk: Medium โ use conditional/group-wise imputation โ
โ โ Example: Income missing more for younger respondents โ
โ โ โ
โ โโโ BRANCH C: MNAR โ Missing Not At Random โ
โ Signs: Missingness correlates WITH the missing value. โ
โ Test: Domain knowledge + comparison of distributions โ
โ Risk: HIGH โ can severely bias the model โ
โ Action: Domain expert review + create indicator flag โ
โ Example: High earners deliberately skip income field โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
**For each flagged column, fill in this analysis card:**
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ COLUMN ANALYSIS CARD โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Column Name : โ
โ Missing % : โ
โ Data Type : โ
โ Is Target (y)? : YES / NO โ
โ Mechanism : MCAR / MAR / MNAR โ
โ Evidence : (why you believe this) โ
โ Is missingness : โ
โ informative? : YES (create indicator) / NO โ
โ Proposed Action : (see Phase 3) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
---
## PHASE 3 โ TREATMENT DECISION FRAMEWORK
### *Apply rules in strict order. Do not skip.*
---
### RULE 0 โ TARGET COLUMN (y) โ HIGHEST PRIORITY
```
IF the missing column IS the target variable (y):
โ ALWAYS drop those rows โ NEVER impute the target
โ df.dropna(subset=[TARGET_COL], inplace=True)
โ Reason: A model cannot learn from unlabeled data
```
---
### RULE 1 โ THRESHOLD CHECK (Missing %)
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ IF missing% > 60%: โ
โ โ OPTION A: Drop the column entirely โ
โ (Exception: domain marks it as critical โ flag expert) โ
โ โ OPTION B: Keep + create binary indicator flag โ
โ (col_was_missing = 1) then decide on imputation โ
โ โ
โ IF 30% < missing% โค 60%: โ
โ โ Use advanced imputation: KNN or MICE (IterativeImputer) โ
โ โ Always create a missingness indicator flag first โ
โ โ Consider group-wise (conditional) mean/mode โ
โ โ
โ IF missing% โค 30%: โ
โ โ Proceed to RULE 2 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
---
### RULE 2 โ DATA TYPE ROUTING
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ NUMERICAL โ Continuous (float): โ
โ โโ Symmetric distribution (mean โ median) โ Mean imputation โ
โ โโ Skewed distribution (outliers present) โ Median imputation โ
โ โโ Time-series / ordered rows โ Forward fill / Interp โ
โ โโ MAR (correlated with other cols) โ Group-wise mean โ
โ โโ Complex multivariate patterns โ KNN / MICE โ
โ โ
โ NUMERICAL โ Discrete / Count (int): โ
โ โโ Low cardinality (few unique values) โ Mode imputation โ
โ โโ High cardinality โ Median or KNN โ
โ โ
โ CATEGORICAL โ Nominal (no order): โ
โ โโ Low cardinality โ Mode imputation โ
โ โโ High cardinality โ "Unknown" / "Missing" as new category โ
โ โโ MNAR suspected โ "Not_Provided" as a meaningful category โ
โ โ
โ CATEGORICAL โ Ordinal (ranked order): โ
โ โโ Natural ranking โ Median-rank imputation โ
โ โโ MCAR / MAR โ Mode imputation โ
โ โ
โ DATETIME: โ
โ โโ Sequential data โ Forward fill โ Backward fill โ
โ โโ Random gaps โ Interpolation โ
โ โ
โ BOOLEAN / BINARY: โ
โ โโ Mode imputation (or treat as categorical) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
---
### RULE 3 โ ADVANCED IMPUTATION SELECTION GUIDE
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WHEN TO USE EACH ADVANCED METHOD โ
โ โ
โ Group-wise Mean/Mode: โ
โ โ When missingness is MAR conditioned on a group column โ
โ โ Example: fill income NaN using mean per age_group โ
โ โ More realistic than global mean โ
โ โ
โ KNN Imputer (k=5 default): โ
โ โ When multiple correlated numerical columns exist โ
โ โ Finds k nearest complete rows and averages their values โ
โ โ Slower on large datasets โ
โ โ
โ MICE / IterativeImputer: โ
โ โ Most powerful โ models each column using all others โ
โ โ Best for MAR with complex multivariate relationships โ
โ โ Use max_iter=10, random_state=42 for reproducibility โ
โ โ Most expensive computationally โ
โ โ
โ Missingness Indicator Flag: โ
โ โ Always add for MNAR columns โ
โ โ Optional but recommended for 30%+ missing columns โ
โ โ Creates: col_was_missing = 1 if NaN, else 0 โ
โ โ Tells the model "this value was absent" as a signal โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
---
### RULE 4 โ ML MODEL COMPATIBILITY
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Tree-based (XGBoost, LightGBM, CatBoost, RandomForest): โ
โ โ Can handle NaN natively โ
โ โ Still recommended: create indicator flags for MNAR โ
โ โ
โ Linear Models (LogReg, LinearReg, Ridge, Lasso): โ
โ โ MUST impute โ zero NaN tolerance โ
โ โ
โ Neural Networks / Deep Learning: โ
โ โ MUST impute โ no NaN tolerance โ
โ โ
โ SVM, KNN Classifier: โ
โ โ MUST impute โ no NaN tolerance โ
โ โ
โ โ ๏ธ UNIVERSAL RULE FOR ALL MODELS: โ
โ โ Split train/test FIRST โ
โ โ Fit imputer on TRAIN only โ
โ โ Transform both TRAIN and TEST using fitted imputer โ
โ โ Never fit on full dataset โ causes data leakage โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
---
## PHASE 4 โ PYTHON IMPLEMENTATION BLUEPRINT
```python
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 0 โ Load and copy DATA()
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
df = DATA().copy()
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 1 โ Standardize disguised missing values
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
DISGUISED_NULLS = ["?", "N/A", "n/a", "unknown", "none", "โ", "-", ""]
df.replace(DISGUISED_NULLS, np.nan, inplace=True)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 2 โ Drop rows where TARGET is missing (Rule 0)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
TARGET_COL = 'your_target_column' # โ CHANGE THIS
df.dropna(subset=[TARGET_COL], axis=0, inplace=True)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 3 โ Separate features and target
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 4 โ Train / Test Split BEFORE any imputation
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 5 โ Define column groups (fill these after Phase 1-2)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
num_cols_symmetric = [] # โ Mean imputation
num_cols_skewed = [] # โ Median imputation
cat_cols_low_card = [] # โ Mode imputation
cat_cols_high_card = [] # โ 'Unknown' fill
knn_cols = [] # โ KNN imputation
drop_cols = [] # โ Drop (>60% missing or domain-irrelevant)
mnar_cols = [] # โ Indicator flag + impute
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 6 โ Drop high-missing or irrelevant columns
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
X_train = X_train.drop(columns=drop_cols, errors='ignore')
X_test = X_test.drop(columns=drop_cols, errors='ignore')
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 7 โ Create missingness indicator flags BEFORE imputation
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
for col in mnar_cols:
X_train[f'{col}_was_missing'] = X_train[col].isnull().astype(int)
X_test[f'{col}_was_missing'] = X_test[col].isnull().astype(int)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 8 โ Numerical imputation
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
if num_cols_symmetric:
imp_mean = SimpleImputer(strategy='mean')
X_train[num_cols_symmetric] = imp_mean.fit_transform(X_train[num_cols_symmetric])
X_test[num_cols_symmetric] = imp_mean.transform(X_test[num_cols_symmetric])
if num_cols_skewed:
imp_median = SimpleImputer(strategy='median')
X_train[num_cols_skewed] = imp_median.fit_transform(X_train[num_cols_skewed])
X_test[num_cols_skewed] = imp_median.transform(X_test[num_cols_skewed])
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 9 โ Categorical imputation
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
if cat_cols_low_card:
imp_mode = SimpleImputer(strategy='most_frequent')
X_train[cat_cols_low_card] = imp_mode.fit_transform(X_train[cat_cols_low_card])
X_test[cat_cols_low_card] = imp_mode.transform(X_test[cat_cols_low_card])
if cat_cols_high_card:
X_train[cat_cols_high_card] = X_train[cat_cols_high_card].fillna('Unknown')
X_test[cat_cols_high_card] = X_test[cat_cols_high_card].fillna('Unknown')
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 10 โ Group-wise imputation (MAR pattern)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Example: fill 'income' NaN using mean per 'age_group'
# GROUP_COL = 'age_group'
# TARGET_IMP_COL = 'income'
# group_means = X_train.groupby(GROUP_COL)[TARGET_IMP_COL].mean()
# X_train[TARGET_IMP_COL] = X_train[TARGET_IMP_COL].fillna(
# X_train[GROUP_COL].map(group_means)
# )
# X_test[TARGET_IMP_COL] = X_test[TARGET_IMP_COL].fillna(
# X_test[GROUP_COL].map(group_means)
# )
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 11 โ KNN imputation for complex patterns
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
if knn_cols:
imp_knn = KNNImputer(n_neighbors=5)
X_train[knn_cols] = imp_knn.fit_transform(X_train[knn_cols])
X_test[knn_cols] = imp_knn.transform(X_test[knn_cols])
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 12 โ MICE / IterativeImputer (most powerful, use when needed)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# imp_iter = IterativeImputer(max_iter=10, random_state=42)
# X_train[advanced_cols] = imp_iter.fit_transform(X_train[advanced_cols])
# X_test[advanced_cols] = imp_iter.transform(X_test[advanced_cols])
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# STEP 13 โ Final validation
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
remaining_train = X_train.isnull().sum()
remaining_test = X_test.isnull().sum()
assert remaining_train.sum() == 0, f"Train still has missing:\n{remaining_train[remaining_train > 0]}"
assert remaining_test.sum() == 0, f"Test still has missing:\n{remaining_test[remaining_test > 0]}"
print("โ
No missing values remain. DATA() is ML-ready.")
print(f" Train shape: {X_train.shape} | Test shape: {X_test.shape}")
```
---
## PHASE 5 โ SYNTHESIS & DECISION REPORT
After completing Phases 1โ4, deliver this exact report:
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
MISSING VALUE TREATMENT REPORT
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1. DATASET SUMMARY
Shape :
Total missing :
Target col :
ML task :
Model type :
2. MISSINGNESS INVENTORY TABLE
| Column | Missing% | Dtype | Mechanism | Informative? | Treatment |
|--------|----------|-------|-----------|--------------|-----------|
| ... | ... | ... | ... | ... | ... |
3. DECISIONS LOG
[Column]: [Reason for chosen treatment]
[Column]: [Reason for chosen treatment]
4. COLUMNS DROPPED
[Column] โ Reason: [e.g., 72% missing, not domain-critical]
5. INDICATOR FLAGS CREATED
[col_was_missing] โ Reason: [MNAR suspected / high missing %]
6. IMPUTATION METHODS USED
[Column(s)] โ [Strategy used + justification]
7. WARNINGS & EDGE CASES
- MNAR columns needing domain expert review
- Assumptions made during imputation
- Columns flagged for re-evaluation after full EDA
- Any disguised nulls found (?, N/A, 0, etc.)
8. NEXT STEPS โ Post-Imputation Checklist
โ Compare distributions before vs after imputation (histograms)
โ Confirm all imputers were fitted on TRAIN only
โ Validate zero data leakage from target column
โ Re-check correlation matrix post-imputation
โ Check class balance if classification task
โ Document all transformations for reproducibility
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
---
## CONSTRAINTS & GUARDRAILS
```
โ
MUST ALWAYS:
โ Work on df.copy() โ never mutate original DATA()
โ Drop rows where target (y) is missing โ NEVER impute y
โ Fit all imputers on TRAIN data only
โ Transform TEST using already-fitted imputers (no re-fit)
โ Create indicator flags for all MNAR columns
โ Validate zero nulls remain before passing to model
โ Check for disguised missing values (?, N/A, 0, blank, "unknown")
โ Document every decision with explicit reasoning
โ MUST NEVER:
โ Impute blindly without checking distributions first
โ Drop columns without checking their domain importance
โ Fit imputer on full dataset before train/test split (DATA LEAKAGE)
โ Ignore MNAR columns โ they can severely bias the model
โ Apply identical strategy to all columns
โ Assume NaN is the only form a missing value can take
```
---
## QUICK REFERENCE โ STRATEGY CHEAT SHEET
| Situation | Strategy |
|-----------|----------|
| Target column (y) has NaN | Drop rows โ never impute |
| Column > 60% missing | Drop column (or indicator + expert review) |
| Numerical, symmetric dist | Mean imputation |
| Numerical, skewed dist | Median imputation |
| Numerical, time-series | Forward fill / Interpolation |
| Categorical, low cardinality | Mode imputation |
| Categorical, high cardinality | Fill with 'Unknown' category |
| MNAR suspected (any type) | Indicator flag + domain review |
| MAR, conditioned on group | Group-wise mean/mode |
| Complex multivariate patterns | KNN Imputer or MICE |
| Tree-based model (XGBoost etc.) | NaN tolerated; still flag MNAR |
| Linear / NN / SVM | Must impute โ zero NaN tolerance |
---
*PROMPT() v1.0 โ Built for IBM GEN AI Engineering / Data Analysis with Python*
*Framework: Chain of Thought (CoT) + Tree of Thought (ToT)*
*Reference: Coursera โ Dealing with Missing Values in Python*