In [1]:
#configure plotting
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib;matplotlib.rcParams['figure.figsize'] = (8,5)
from matplotlib import pyplot as plt

Horseshoe priors with pymc3

This notebook is to show how to implement a horseshoe prior model in a pymc3. Pymc3 is a probabilistic programming framework for the python programming language. We will just consider a toy dataset and fit a penalised logistic regression model using the horseprior. First we introduce the logistic regression model and 3 formulations of the prior. Then we introduce the toy data we wish to model. The toy data will a sparse set of salient features which we will wish to find with our model. We then implement the different formulations in pymc3.

The model

Logistic Regression

Logistic Regression can be written as,

yBernoulli(logit=Xβ+βo)

Horseshoe prior

Formulation 1

Priors need to be placed on β and βo. The horseshoe prior on β can be written as,

τCauchy+(0,1)λiCauchy+(0,1)βiNormal(0,τ2λi2)

as shown in http://proceedings.mlr.press/v5/carvalho09a/carvalho09a.pdf (Equation 1)

Alternative formulations

Formulation 2

The half-Cauchy Cauchy+ can be written instead as

ξInverseGamma(12,1)τ2InverseGamma(12,1ξ)ϵiInverseGamma(12,1)λi2InverseGamma(12,1ϵi)βiNormal(0,τ2λi2)

as shown in Equation 26 in https://arxiv.org/pdf/1611.06649.pdf

Formulation 3

Alternatively,

ν=1rilocalNormal(0,1)ρilocalInverseGamma(12ν,12ν)rglobalNormal(0,1)ρglobalInverseGamma(12,12)zNormal(1,100)λi=rilocalρilocalτ=rglobalρglobalβi=zλiτ

as shown in Appendix of "On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior" - Piironen, Vehtari 2017 . Also see Equation 5 of https://arxiv.org/pdf/1705.10388.pdf "Model Selection in Bayesian Neural Networks via Horseshoe Priors" - Ghosh and Doshi-Velez 2017. https://arxiv.org/pdf/1602.03807.pdf "Variational Inference for Sparse and Undirected Models" - Ingraham and Marks 2017

This is an example of a non-centred model.

See also this blog post http://twiecki.github.io/blog/2017/02/08/bayesian-hierchical-non-centered/

Toy Data

In [2]:
import numpy as np
from scipy.special import expit
from scipy.stats import binom


def build_toy_dataset(N, D=50, noise_std=0.1):
    np.random.seed(1234)
    X = np.random.random((N, D))-0.5
    W = np.random.random((D, 1))-0.5
    b = np.random.random()-0.5
    zero_ind = np.arange(D//4, D)
    W[zero_ind, :] = 0
    y = expit(X.dot(W) + b)
    #y = binom.rvs(1, y)  
    y[y < 0.5] = 0
    y[y >= 0.5] = 1  
    y = y.flatten()
    return X, y, W

N, D = 200, 50
X, y, W = build_toy_dataset(N, D)
plt.scatter(np.arange(W.size), W.flatten());

PyMC3 implementations

We now implement the formulations in pymc3. We perform inference using the Markov Chain Monte Carlo (MCMC) sampling procedure NUTS (No U-turn Sampler) and the Hamilton Markov Chain (HMC) procedure. For each we plot the inferred parameters of the model which can be compared to the true parameters of the toy data.

Formulation 1 (NUTS)

In [17]:
import pymc3 as pm
import theano as T

# set up a placeholder to input data to the pymc3 model
Xᵢ = T.shared(X)
yᵢ = T.shared(y)


# define the model using formulation 1
with pm.Model() as logreg:
    λ = pm.HalfCauchy('lambda', beta=1, shape=D)
    τ = pm.HalfCauchy('tau', beta=1)
    σ = pm.Deterministic('horseshoe', τ*τ*λ*λ)
    β = pm.Normal('beta', mu=0, sd=σ, shape=D)
    βₒ = pm.Normal('b', mu=0, sd=1)
    μ = pm.math.invlogit(Xᵢ.dot(β) + βₒ)
    yₒ = pm.Bernoulli('obs', p=μ, observed=yᵢ)
    
# performance the inference
SEED = 20090425
warmup = 2000
num_samples = 2000
with logreg:
    step = pm.NUTS(target_accept=.8)
    logreg_posterior = pm.sample(num_samples, step=step, njobs=2, tune=warmup, random_seed=SEED)
    βₕ = logreg_posterior['beta']
    βₕ = βₕ[βₕ.shape[0]//2:, :].mean(axis=0)
plt.scatter(np.arange(βₕ.size), βₕ.flatten());
 96%|█████████▌| 3821/4000 [2:11:10<09:12,  3.09s/it]  /home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:459: UserWarning: Chain 1 reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
  'reparameterize.' % self._chain_id)
/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 1 contains 495 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 4000/4000 [2:15:46<00:00,  1.09s/it]/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:459: UserWarning: Chain 0 reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
  'reparameterize.' % self._chain_id)
/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 495 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))

Formulation 2 (NUTS)

In [6]:
import pymc3 as pm
import theano as T

# set up a placeholder to input data to the pymc3 model
Xᵢ = T.shared(X)
yᵢ = T.shared(y)


# define the model using formulation 1
with pm.Model() as logreg:
    ϵ = pm.InverseGamma('epsilon', alpha=0.5, beta=1., shape=D, testval=0.1)
    λ2 = pm.InverseGamma('lambda', alpha=0.5, beta=1./ϵ, shape=D, testval=0.1)
    ξ = pm.InverseGamma('xi', alpha=0.5, beta=1., testval=0.1)
    τ2 = pm.InverseGamma('tau', alpha=0.5, beta=1./ξ, testval=0.1)
    σ = pm.Deterministic('horseshoe', τ2*λ2)
    β = pm.Normal('beta', mu=0, sd=σ, shape=D)
    βₒ = pm.Normal('b', mu=0, sd=1)
    μ = pm.math.invlogit(Xᵢ.dot(β) + βₒ)
    yₒ = pm.Bernoulli('obs', p=μ, observed=yᵢ)
    
# performance the inference
SEED = 20090425
warmup = 2000
num_samples = 2000
with logreg:
    step = pm.NUTS(target_accept=.8)
    logreg_posterior = pm.sample(num_samples, step=step, njobs=2, tune=warmup, random_seed=SEED)
    βₕ = logreg_posterior['beta']
    βₕ = βₕ[βₕ.shape[0]//2:, :].mean(axis=0)
plt.scatter(np.arange(βₕ.size), βₕ.flatten());
 98%|█████████▊| 3909/4000 [2:14:43<03:23,  2.23s/it]  /home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:459: UserWarning: Chain 1 reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
  'reparameterize.' % self._chain_id)
/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 1 does not match the target. It is 0.709515864297, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 1 contains 955 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 4000/4000 [2:16:40<00:00,  2.04s/it]/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:459: UserWarning: Chain 0 reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
  'reparameterize.' % self._chain_id)
/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:451: UserWarning: The acceptance probability in chain 0 does not match the target. It is 0.709515864297, but should be close to 0.8. Try to increase the number of tuning steps.
  % (self._chain_id, mean_accept, target_accept))
/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 955 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))

Formulation 3 (NUTS)

In [4]:
import pymc3 as pm
import theano as T

# set up a placeholder to input data to the pymc3 model
Xᵢ = T.shared(X)
yᵢ = T.shared(y)


# define the model using formulation 1
with pm.Model() as logreg:
    ν = 1
    rₗ = pm.Normal('r_local', mu=0, sd=1., shape=D)
    ρₗ = pm.InverseGamma('rho_local', alpha=0.5*ν, beta=0.5*ν, shape=D, testval=0.1)
    rᵧ = pm.Normal('r_global', mu=0, sd=1)
    ρᵧ = pm.InverseGamma('rho_global', alpha=0.5, beta=0.5, testval=0.1)
    τ = rᵧ * pm.math.sqrt(ρᵧ)
    λ = rₗ * pm.math.sqrt(ρₗ)
    z = pm. Normal('z', mu=0, sd=1, shape=D)
    β = pm.Deterministic('beta', z*λ*τ)
    βₒ = pm.Normal('b', mu=0, sd=1)
    μ = pm.math.invlogit(Xᵢ.dot(β) + βₒ)
    yₒ = pm.Bernoulli('obs', p=μ, observed=yᵢ)
    
# performance the inference
SEED = 20090425
warmup = 2000
num_samples = 2000
with logreg:
    step = pm.NUTS(target_accept=.8)
    logreg_posterior = pm.sample(num_samples, step=step, njobs=2, tune=warmup, random_seed=SEED)
    βₕ = logreg_posterior['beta']
    βₕ = βₕ[βₕ.shape[0]//2:, :].mean(axis=0)
plt.scatter(np.arange(βₕ.size), βₕ.flatten());
 95%|█████████▍| 3781/4000 [22:57<00:54,  4.00it/s] /home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 1 contains 233 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 4000/4000 [23:37<00:00,  4.87it/s]/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 233 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))

Formulation 3 (HMC)

In [3]:
import pymc3 as pm
import theano as T

# set up a placeholder to input data to the pymc3 model
Xᵢ = T.shared(X)
yᵢ = T.shared(y)


# define the model using formulation 1
with pm.Model() as logreg:
    ν = 1
    rₗ = pm.Normal('r_local', mu=0, sd=1., shape=D)
    ρₗ = pm.InverseGamma('rho_local', alpha=0.5*ν, beta=0.5*ν, shape=D, testval=0.1)
    rᵧ = pm.Normal('r_global', mu=0, sd=1)
    ρᵧ = pm.InverseGamma('rho_global', alpha=0.5, beta=0.5, testval=0.1)
    τ = rᵧ * pm.math.sqrt(ρᵧ)
    λ = rₗ * pm.math.sqrt(ρₗ)
    z = pm. Normal('z', mu=0, sd=1, shape=D)
    β = pm.Deterministic('beta', z*λ*τ)
    βₒ = pm.Normal('b', mu=0, sd=1)
    μ = pm.math.invlogit(Xᵢ.dot(β) + \beta\_o)
    yₒ = pm.Bernoulli('obs', p=μ, observed=yᵢ)
    
# performance the inference
SEED = 20090425
warmup = 2000
num_samples = 2000
with logreg:
    step = pm.HamiltonianMC()
    logreg_posterior = pm.sample(num_samples, step=step, njobs=2, tune=warmup, random_seed=SEED)
    βₕ = logreg_posterior['beta']
    βₕ = βₕ[βₕ.shape[0]//2:, :].mean(axis=0)
plt.scatter(np.arange(βₕ.size), βₕ.flatten());
100%|██████████| 4000/4000 [04:32<00:00, 14.66it/s]

Formulation 3 (NUTS ν=3)

In [14]:
import pymc3 as pm
import theano as T

# set up a placeholder to input data to the pymc3 model
Xᵢ = T.shared(X)
yᵢ = T.shared(y)


# define the model using formulation 1
with pm.Model() as logreg:
    ν = 3
    rₗ = pm.Normal('r_local', mu=0, sd=1., shape=D)
    ρₗ = pm.InverseGamma('rho_local', alpha=0.5*ν, beta=0.5*ν, shape=D, testval=0.1)
    rᵧ = pm.Normal('r_global', mu=0, sd=1)
    ρᵧ = pm.InverseGamma('rho_global', alpha=0.5, beta=0.5, testval=0.1)
    τ = rᵧ * pm.math.sqrt(ρᵧ)
    λ = rₗ * pm.math.sqrt(ρₗ)
    z = pm. Normal('z', mu=0, sd=1, shape=D)
    β = pm.Deterministic('beta', z*λ*τ)
    βₒ = pm.Normal('b', mu=0, sd=1)
    μ = pm.math.invlogit(Xᵢ.dot(β) + \beta\_o)
    yₒ = pm.Bernoulli('obs', p=μ, observed=yᵢ)
    
# performance the inference
SEED = 20090425
warmup = 2000
num_samples = 2000
with logreg:
    step = pm.NUTS(target_accept=.8)
    logreg_posterior = pm.sample(num_samples, step=step, njobs=2, tune=warmup, random_seed=SEED)
    βₕ = logreg_posterior['beta']
    βₕ = βₕ[βₕ.shape[0]//2:, :].mean(axis=0)
plt.scatter(np.arange(βₕ.size), βₕ.flatten());
 97%|█████████▋| 3893/4000 [10:31<00:17,  6.27it/s]/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 1 contains 100 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|█████████▉| 3998/4000 [10:40<00:00,  8.78it/s]/home/joe/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py:467: UserWarning: Chain 0 contains 100 diverging samples after tuning. If increasing `target_accept` does not help try to reparameterize.
  % (self._chain_id, n_diverging))
100%|██████████| 4000/4000 [10:40<00:00,  6.25it/s]

Conclusion

We have shown 3 formulations of the horseshoe prior and implemented them in pymc3. We find the 3rd formulation produces a result more quickly.