Analysis of multivariate longitudinal and survival data: From joint models to random forests

Abstract: Health studies usually involve the collection and analysis of variables repeatedly measured over time. This includes exposures (e.g., treatment, blood pressure, nutrition) and markers of progression (e.g., brain volumes, blood tests, cognitive functioning, tumor size). When interested in modeling how these variables are associated with clinical endpoints such as death in survival models, some statistical challenges arise. First, they constitute inaccurate measures of the underlying continuous-time processes of interest : they are measured with error and at sparse visit times. Neglecting such characteristics may lead to biased associations with clinical endpoints. A dedicated solution is the joint analysis of the longitudinal processes and the time-to-event in so-called joint models (1). This methodology, now available in many software, can be easily applied. However it reaches numerical limits when the number of repeated variables substantially increases (2). In this talk, I first introduce the specificity of longitudinal data collected in health studies and show through simulations how naive techniques may lead to incorrect inference. Then I describe the methodology of the joint models for longitudinal and survival data (1,3). Finally, I present how this methodology can be incorporated into random survival forests to account for a large dimension of longitudinal variables when interested in predicting a time-to-event (potentially with multiples causes) and identifying the most important predictors (4). Throughout the talk, I illustrate the methods with examples from epidemiological cohorts, notably in cerebral aging research to predict the risk of Alzheimer’s disease.