MENU

Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival

Vilardell, Mireia; Buxo, Maria; Cleries, Ramon; Martinez, Jose Miguel; Garcia, Gemma; Ameijide, Alberto; Font, Rebeca; Civit, Sergi; Marcos-Gragera, Rafael; Vilardell, Maria Loreto; Carulla, Maria; Espinas, Josep Alfons; Galceran, Jaume; Izquierdo, Angel;

ARTIFICIAL INTELLIGENCE IN MEDICINE
2020
VL / 107 - BP / - EP /
abstract
Background: Two common issues may arise in certain population-based breast cancer (BC) survival studies: I) missing values in a survivals' predictive variable, such as "Stage" at diagnosis, and II) small sample size due to "imbalance class problem" in certain subsets of patients, demanding data modeling/simulation methods. Methods: We present a procedure, ModGraProDep, based on graphical modeling (GM) of a dataset to overcome these two issues. The performance of the models derived from ModGraProDep is compared with a set of frequently used classification and machine learning algorithms (Missing Data Problem) and with oversampling algorithms (Synthetic Data Simulation). For the Missing Data Problem we assessed two scenarios: missing completely at random (MCAR) and missing not at random (MNAR). Two validated BC datasets provided by the cancer registries of Girona and Tarragona (northeastern Spain) were used. Results: In both MCAR and MNAR scenarios all models showed poorer prediction performance compared to three GM models: the saturated one (GM.SAT) and two with penalty factors on the partial likelihood (GM.K1 and GM.TEST). However, GM.SAT predictions could lead to non-reliable conclusions in BC survival analysis. Simulation of a "synthetic" dataset derived from GM.SAT could be the worst strategy, but the use of the remaining GMs models could be better than oversampling. Conclusion: Our results suggest the use of the GM-procedure presented for one-variable imputation/prediction of missing data and for simulating "synthetic" BC survival datasets. The "synthetic" datasets derived from GMs could be also used in clinical applications of cancer survival data such as predictive risk analysis.

AccesS level

MENTIONS DATA