#configure plotting
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib;matplotlib.rcParams['figure.figsize'] = (8,5)
from matplotlib import pyplot as plt

Horseshoe priors with pymc3¶

This notebook is to show how to implement a horseshoe prior model in a pymc3. Pymc3 is a probabilistic programming framework for the python programming language. We will just consider a toy dataset and fit a penalised logistic regression model using the horseprior. First we introduce the logistic regression model and 3 formulations of the prior. Then we introduce the toy data we wish to model. The toy data will a sparse set of salient features which we will wish to find with our model. We then implement the different formulations in pymc3.

The model¶

Logistic Regression¶

Logistic Regression can be written as,

\begin{aligned} y & \sim Bernoulli (logit = X β + β_{o}) \end{aligned}

$\begin{align*} y &\sim \operatorname{Bernoulli} \left( \operatorname{logit} = X \beta + \beta_o \right) \end{align*}$

Horseshoe prior¶

Formulation 1¶

Priors need to be placed on $\beta$ and $\beta_o$ . The horseshoe prior on $\beta$ can be written as,

\begin{aligned} τ & \sim {Cauchy}^{+} (0, 1) \\ λ_{i} & \sim {Cauchy}^{+} (0, 1) \\ β_{i} & \sim Normal (0, τ^{2} λ_{i}^{2}) \end{aligned}

$\begin{align*} \tau &\sim \operatorname{Cauchy}^{+}\left( 0, 1 \right) \\ \lambda_i &\sim \operatorname{Cauchy}^{+}\left( 0, 1 \right) \\ \beta_i &\sim \operatorname{Normal} \left( 0, \tau^2 \lambda_i^2 \right) \\ \end{align*}$

as shown in http://proceedings.mlr.press/v5/carvalho09a/carvalho09a.pdf (Equation 1)

Alternative formulations¶

Formulation 2¶

The half-Cauchy $\operatorname{Cauchy}^{+}$ can be written instead as

\begin{aligned} ξ & \sim InverseGamma (\frac{1}{2}, 1) \\ τ^{2} & \sim InverseGamma (\frac{1}{2}, \frac{1}{ξ}) \\ ϵ_{i} & \sim InverseGamma (\frac{1}{2}, 1) \\ λ_{i}^{2} & \sim InverseGamma (\frac{1}{2}, \frac{1}{ϵ_{i}}) \\ β_{i} & \sim Normal (0, τ^{2} λ_{i}^{2}) \end{aligned}

$\begin{align*} \xi &\sim \operatorname{InverseGamma}\left( \frac{1}{2}, 1 \right) \\ \tau^2 &\sim \operatorname{InverseGamma}\left( \frac{1}{2}, \frac{1}{\xi} \right) \\ \epsilon_i &\sim \operatorname{InverseGamma}\left( \frac{1}{2}, 1 \right) \\ \lambda_i^2 &\sim \operatorname{InverseGamma}\left( \frac{1}{2}, \frac{1}{\epsilon_i} \right) \\ \beta_i &\sim \operatorname{Normal} \left( 0, \tau^2 \lambda_i^2 \right) \\ \end{align*}$

as shown in Equation 26 in https://arxiv.org/pdf/1611.06649.pdf

Formulation 3¶

Alternatively,

\begin{aligned} ν & = 1 \\ r_{i}^{local} & \sim Normal (0, 1) \\ ρ_{i}^{local} & \sim InverseGamma (\frac{1}{2} ν, \frac{1}{2} ν) \\ r^{global} & \sim Normal (0, 1) \\ ρ^{global} & \sim InverseGamma (\frac{1}{2}, \frac{1}{2}) \\ z & \sim Normal (1, 100) \\ λ_{i} & = r_{i}^{local} \sqrt{ρ_{i}^{local}} \\ τ & = r^{global} \sqrt{ρ^{global}} \\ β_{i} & = z λ_{i} τ \end{aligned}

$\begin{align*} \nu &= 1 \\ r^{\text{local}}_i &\sim \operatorname{Normal}\left( 0, 1 \right) \\ \rho^{\text{local}}_i &\sim \operatorname{InverseGamma} \left( \frac{1}{2} \nu , \frac{1}{2} \nu \right) \\ r^{\text{global}} &\sim \operatorname{Normal}\left( 0, 1 \right) \\ \rho^{\text{global}} &\sim \operatorname{InverseGamma} \left( \frac{1}{2} , \frac{1}{2} \right) \\ z &\sim \operatorname{Normal}\left( 1, 100 \right) \\ \lambda_i &= r_i^{\text{local}} \sqrt{ \rho_i^{\text{local}} } \\ \tau &= r^{\text{global}} \sqrt{ \rho^{\text{global}} } \\ \beta_i &= z \lambda_i \tau \\ \end{align*}$

as shown in Appendix of "On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior" - Piironen, Vehtari 2017 . Also see Equation 5 of https://arxiv.org/pdf/1705.10388.pdf "Model Selection in Bayesian Neural Networks via Horseshoe Priors" - Ghosh and Doshi-Velez 2017. https://arxiv.org/pdf/1602.03807.pdf "Variational Inference for Sparse and Undirected Models" - Ingraham and Marks 2017

This is an example of a non-centred model.

Toy Data¶

import numpy as np
from scipy.special import expit
from scipy.stats import binom


def build_toy_dataset(N, D=50, noise_std=0.1):
    np.random.seed(1234)
    X = np.random.random((N, D))-0.5
    W = np.random.random((D, 1))-0.5
    b = np.random.random()-0.5
    zero_ind = np.arange(D//4, D)
    W[zero_ind, :] = 0
    y = expit(X.dot(W) + b)
    #y = binom.rvs(1, y)  
    y[y < 0.5] = 0
    y[y >= 0.5] = 1  
    y = y.flatten()
    return X, y, W

N, D = 200, 50
X, y, W = build_toy_dataset(N, D)
plt.scatter(np.arange(W.size), W.flatten());

PyMC3 implementations¶

We now implement the formulations in pymc3. We perform inference using the Markov Chain Monte Carlo (MCMC) sampling procedure NUTS (No U-turn Sampler) and the Hamilton Markov Chain (HMC) procedure. For each we plot the inferred parameters of the model which can be compared to the true parameters of the toy data.

Formulation 1 (NUTS)¶

import pymc3 as pm
import theano as T

# set up a placeholder to input data to the pymc3 model
Xᵢ = T.shared(X)
yᵢ = T.shared(y)


# define the model using formulation 1
with pm.Model() as logreg:
    λ = pm.HalfCauchy('lambda', beta=1, shape=D)
    τ = pm.HalfCauchy('tau', beta=1)
    σ = pm.Deterministic('horseshoe', τ*τ*λ*λ)
    β = pm.Normal('beta', mu=0, sd=σ, shape=D)
    βₒ = pm.Normal('b', mu=0, sd=1)
    μ = pm.math.invlogit(Xᵢ.dot(β) + βₒ)
    yₒ = pm.Bernoulli('obs', p=μ, observed=yᵢ)
    
# performance the inference
SEED = 20090425
warmup = 2000
num_samples = 2000
with logreg:
    step = pm.NUTS(target_accept=.8)
    logreg_posterior = pm.sample(num_samples, step=step, njobs=2, tune=warmup, random_seed=SEED)
    βₕ = logreg_posterior['beta']
    βₕ = βₕ[βₕ.shape[0]//2:, :].mean(axis=0)
plt.scatter(np.arange(βₕ.size), βₕ.flatten());

 96%|█████████▌| 3821/4000 [2:11:10<09:12,  3.09s/it]  /home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:459: UserWarning: Chain 1 reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
  'reparameterize.' % self._chain_id)
/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 1 contains 495 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 4000/4000 [2:15:46<00:00,  1.09s/it]/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:459: UserWarning: Chain 0 reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
  'reparameterize.' % self._chain_id)
/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 495 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))

Formulation 2 (NUTS)¶

import pymc3 as pm
import theano as T

# set up a placeholder to input data to the pymc3 model
Xᵢ = T.shared(X)
yᵢ = T.shared(y)


# define the model using formulation 1
with pm.Model() as logreg:
    ϵ = pm.InverseGamma('epsilon', alpha=0.5, beta=1., shape=D, testval=0.1)
    λ2 = pm.InverseGamma('lambda', alpha=0.5, beta=1./ϵ, shape=D, testval=0.1)
    ξ = pm.InverseGamma('xi', alpha=0.5, beta=1., testval=0.1)
    τ2 = pm.InverseGamma('tau', alpha=0.5, beta=1./ξ, testval=0.1)
    σ = pm.Deterministic('horseshoe', τ2*λ2)
    β = pm.Normal('beta', mu=0, sd=σ, shape=D)
    βₒ = pm.Normal('b', mu=0, sd=1)
    μ = pm.math.invlogit(Xᵢ.dot(β) + βₒ)
    yₒ = pm.Bernoulli('obs', p=μ, observed=yᵢ)
    
# performance the inference
SEED = 20090425
warmup = 2000
num_samples = 2000
with logreg:
    step = pm.NUTS(target_accept=.8)
    logreg_posterior = pm.sample(num_samples, step=step, njobs=2, tune=warmup, random_seed=SEED)
    βₕ = logreg_posterior['beta']
    βₕ = βₕ[βₕ.shape[0]//2:, :].mean(axis=0)
plt.scatter(np.arange(βₕ.size), βₕ.flatten());

 98%|█████████▊| 3909/4000 [2:14:43<03:23,  2.23s/it]  /home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:459: UserWarning: Chain 1 reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
  'reparameterize.' % self._chain_id)
/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 1 does not match the target. It is 0.709515864297, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 1 contains 955 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 4000/4000 [2:16:40<00:00,  2.04s/it]/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:459: UserWarning: Chain 0 reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
  'reparameterize.' % self._chain_id)
/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 0 does not match the target. It is 0.709515864297, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 955 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))

Formulation 3 (NUTS)¶

import pymc3 as pm
import theano as T

# set up a placeholder to input data to the pymc3 model
Xᵢ = T.shared(X)
yᵢ = T.shared(y)


# define the model using formulation 1
with pm.Model() as logreg:
    ν = 1
    rₗ = pm.Normal('r_local', mu=0, sd=1., shape=D)
    ρₗ = pm.InverseGamma('rho_local', alpha=0.5*ν, beta=0.5*ν, shape=D, testval=0.1)
    rᵧ = pm.Normal('r_global', mu=0, sd=1)
    ρᵧ = pm.InverseGamma('rho_global', alpha=0.5, beta=0.5, testval=0.1)
    τ = rᵧ * pm.math.sqrt(ρᵧ)
    λ = rₗ * pm.math.sqrt(ρₗ)
    z = pm. Normal('z', mu=0, sd=1, shape=D)
    β = pm.Deterministic('beta', z*λ*τ)
    βₒ = pm.Normal('b', mu=0, sd=1)
    μ = pm.math.invlogit(Xᵢ.dot(β) + βₒ)
    yₒ = pm.Bernoulli('obs', p=μ, observed=yᵢ)
    
# performance the inference
SEED = 20090425
warmup = 2000
num_samples = 2000
with logreg:
    step = pm.NUTS(target_accept=.8)
    logreg_posterior = pm.sample(num_samples, step=step, njobs=2, tune=warmup, random_seed=SEED)
    βₕ = logreg_posterior['beta']
    βₕ = βₕ[βₕ.shape[0]//2:, :].mean(axis=0)
plt.scatter(np.arange(βₕ.size), βₕ.flatten());

 95%|█████████▍| 3781/4000 [22:57<00:54,  4.00it/s] /home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 1 contains 233 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 4000/4000 [23:37<00:00,  4.87it/s]/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 233 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))

Formulation 3 (HMC)¶

import pymc3 as pm
import theano as T

# set up a placeholder to input data to the pymc3 model
Xᵢ = T.shared(X)
yᵢ = T.shared(y)


# define the model using formulation 1
with pm.Model() as logreg:
    ν = 1
    rₗ = pm.Normal('r_local', mu=0, sd=1., shape=D)
    ρₗ = pm.InverseGamma('rho_local', alpha=0.5*ν, beta=0.5*ν, shape=D, testval=0.1)
    rᵧ = pm.Normal('r_global', mu=0, sd=1)
    ρᵧ = pm.InverseGamma('rho_global', alpha=0.5, beta=0.5, testval=0.1)
    τ = rᵧ * pm.math.sqrt(ρᵧ)
    λ = rₗ * pm.math.sqrt(ρₗ)
    z = pm. Normal('z', mu=0, sd=1, shape=D)
    β = pm.Deterministic('beta', z*λ*τ)
    βₒ = pm.Normal('b', mu=0, sd=1)
    μ = pm.math.invlogit(Xᵢ.dot(β) + \beta\_o)
    yₒ = pm.Bernoulli('obs', p=μ, observed=yᵢ)
    
# performance the inference
SEED = 20090425
warmup = 2000
num_samples = 2000
with logreg:
    step = pm.HamiltonianMC()
    logreg_posterior = pm.sample(num_samples, step=step, njobs=2, tune=warmup, random_seed=SEED)
    βₕ = logreg_posterior['beta']
    βₕ = βₕ[βₕ.shape[0]//2:, :].mean(axis=0)
plt.scatter(np.arange(βₕ.size), βₕ.flatten());

100%|██████████| 4000/4000 [04:32<00:00, 14.66it/s]

Formulation 3 (NUTS $\nu = 3$ )¶

import pymc3 as pm
import theano as T

# set up a placeholder to input data to the pymc3 model
Xᵢ = T.shared(X)
yᵢ = T.shared(y)


# define the model using formulation 1
with pm.Model() as logreg:
    ν = 3
    rₗ = pm.Normal('r_local', mu=0, sd=1., shape=D)
    ρₗ = pm.InverseGamma('rho_local', alpha=0.5*ν, beta=0.5*ν, shape=D, testval=0.1)
    rᵧ = pm.Normal('r_global', mu=0, sd=1)
    ρᵧ = pm.InverseGamma('rho_global', alpha=0.5, beta=0.5, testval=0.1)
    τ = rᵧ * pm.math.sqrt(ρᵧ)
    λ = rₗ * pm.math.sqrt(ρₗ)
    z = pm. Normal('z', mu=0, sd=1, shape=D)
    β = pm.Deterministic('beta', z*λ*τ)
    βₒ = pm.Normal('b', mu=0, sd=1)
    μ = pm.math.invlogit(Xᵢ.dot(β) + \beta\_o)
    yₒ = pm.Bernoulli('obs', p=μ, observed=yᵢ)
    
# performance the inference
SEED = 20090425
warmup = 2000
num_samples = 2000
with logreg:
    step = pm.NUTS(target_accept=.8)
    logreg_posterior = pm.sample(num_samples, step=step, njobs=2, tune=warmup, random_seed=SEED)
    βₕ = logreg_posterior['beta']
    βₕ = βₕ[βₕ.shape[0]//2:, :].mean(axis=0)
plt.scatter(np.arange(βₕ.size), βₕ.flatten());

 97%|█████████▋| 3893/4000 [10:31<00:17,  6.27it/s]/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 1 contains 100 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|█████████▉| 3998/4000 [10:40<00:00,  8.78it/s]/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 100 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 4000/4000 [10:40<00:00,  6.25it/s]

Conclusion¶

We have shown 3 formulations of the horseshoe prior and implemented them in pymc3. We find the 3rd formulation produces a result more quickly.