Counterfactual Data-Fusion for Online Reinforcement Learners
DSA ADS Course - 2021
Causal Reinforcement Learning, Counterfactuals, Counterfactual Data-Fusion, Online Reinforcement Learners
Discuss counterfactuals, risks of data fusion, causal reinforcement learning and fusion of observational and experimental data.
Counterfactual Data-Fusion for Online Reinforcement Learners - June, 2017
The Multi-Armed Bandit problem with Unobserved Confounders (MABUC) considers decision-making settings where unmeasured variables can influence both the agent’s decisions and received rewards (Bareinboim et al., 2015). Recent findings showed that unobserved confounders (UCs) pose a unique challenge to algorithms based on standard randomization (i.e., experimental data); if UCs are naively averaged out, these algorithms behave sub-optimally, possibly incurring infinite regret. In this paper, we show how counterfactual-based decision-making circumvents these problems and leads to a coherent fusion of observational and experimental data. We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our findings’ efficacy with extensive simulations.