LCLogit, Latent class logit

Myseminarpaper2 HessBierPola06 lclogit lc3 bhat lc2

I wrote my paper in latex. I started it and to be honest, I don′t really understand the topic so I was basically doing what someone else did. What is a latent class logit? What is the purpose of lc logit? background on lclogit application of lc logit? What does lclogit do? Conclusion.

Introduction

What is a latent variable? Whats is the latent class logit model? What does the model do?
what are the advantages and disadvantages of using this model?

The latent class logit (lclogit) or discrete mixture logit model is a stata module that uses
the EM algorithm to fit latent class conditional logit (Pacifico and Yoo,

). Traditionally,
latent class models are normally estimated using gradient-based optimization techniques such
as Newton-Raphson or Berndt–Hall–Hall–Hausman (BHHH) algorithmBerndt et al. (1

).
When these techniques are used and the number of parameter or latent classes increase it
becomes difficult to estimate the maximum likelihood and takes more time to calculate the
gradient. Bhat (1997); Train (200

) stated that the expectation-maximisation (EM) algo-
rithm can be used in place of these traditional algorithms as it makes the model numerically
more stable and estimate parameters efficiently with the large number of parameters.

A latent class logit model when compared to the mixed logit model the computational
cost is lower and the processing time is faster. The EM algorithm iterates until the maximum
likelihood reaches convergence. The discrete mixture approach is more flexible and easier to
implement.

2 EM algorithm for latent class logit

Train (2008) used Bhat (1997) research on latent class model to show that the EM algorithm
can be used on with a large number of parameters. The EM algorithm that is used in Pacifico
and Yoo (2013) will be used in this paper. Let N be the agents, J is the alternatives and T
be the choice scenarios. also ynjt denote the alternative that agent n chooses in situation t
the alternative j.

Pn(βc) =
T∏
t=1

J∏
j=1

(exp(βcxnjt)∑J
k=1 exp(βcxnkt)

(1)

Equation 1 shows the choice probability of a of conditional logit where all the variables
were previously described expect for Pn which is the choice probability, β is the parameters
and C is the classes.

πcn(θ) =
exp(θczn)

1 +
∑C−1

l=1 exp(θlzn)
(2)

In equation 2 the weighted average of equation one is divided by the classes to get the weight
for for class c, which is πcn(θ) θ = (θ1,θ2, …,θc−1 denotes the class membership parameter

lnL(β,θ) =
N∑

n=1

C∑
c=1

πcn(θ)Pn(βc) (3)

The sample log likelihood is depicted in equation 3 which is derived by adding the log
unconditional likelihood.

βs+1 = argmaxβ
N∑

n=1
C∑
c=1

hcn(β
s,θs)lnPn(βc)

θs+1 = argmaxθ
N∑

n=1
C∑
c=1

hcn(β
s,θs)lnπcn(θ)

(4)

The term being maximized in Equation 4 is the log likelihood function for a logit model with
each choice situation of each agent treated as an observation. Let s be the estimates for the
sth iteration, hcn(β

s,θs) is the posterior probability.

hcn(β
s,θs) =

πcn(θ
s)Pn(β

s
c∑C

l=1 πln(θ
s)Pn(β

s
l )

(

)

The updating procedure can be implemented easily in Stata, exploiting clogit and fmlogit
routines as follows. β(s+ 1) is computed by fitting a conditional logit model(clogit) C times,
each time using hcn(βs,θs)for a particular c to weight observations on each n. θ

s+1 is obtained
by fitting a fractional multinomial logit model (fmlogit) that takes hln(βs,θs), h2n(βs,θs),…,
hCn(βs,θs) as dependent variables. When zn only includes the constant term so that each
class share is the same for all agents, that is, when πcn(θ) = πc(θ), each class share can
be directly updated by using the following analytical solution without fitting the fractional
multinomial logit model:

πc(θ
s+1) =

∑N
n=1 hcn(β

s,θs)∑C
c=1

∑N
n=1 hln(β

s,θs)
(

)

3 The lclogit command

The lclogit Stata command user-written program that is a numerically stable, faster and a
more cost effective method of estimating nonparametric estimation of mixing distributions.
These characteristics allows for the command to estimate a large number of latent classes
in a short period of time. In addition, log probabilities and the generate command in stata
are used which also reduces the estimation time. The clogit maximum likelihood evaluator
is used as the lclogit does not have it’s own.

The results are displayed in a table by using the estimate store and estimate table pro-
grams with the the columns labelled as the classes. If the latent classes are 20 and above
the results will be in matrix form and no longer a table.

Pacifico and Yoo (2013) stated that their are certain requirements that is needed for
the lclogit command such as: Group() and id() which are numeric variables that shows the
choice occasion and choice makers respectively. If cross sectional data is being used then
the same variable can be selected for each. Another option is the number of latent classes

which is selected by using the CAIC and BIC criteria methods. These information criteria
also helps to select when the model has been converged, Convergence(). When convergence
is declared the threshold stops and the maximum number of iterations, iterate(#) and the
log likelihood is specified. The membership(varlist) uses constant independent variables for
the fractional multinomial logit model of class membership. s.

4 Post-estimation command: lclogitpr

Pacifico and Yoo (2012) stated that the probabilities of selecting each alternative in the
choice occassion can be predicted by the lclogitpr. The options for lclogitpr:

• class(numlist) is the class

• pr0 estimates the unconditional choice probability;

• up estimates the class shares or prior probabilities that the agent is in particular classes.

5 Post-estimation command: lclogitcov

This command shows the choice models coefficients by estimating the variance and co-
variance.

”The default setting stores the predicted variances in a set of variables named var 1,
var 2, …, where var k is the predicted variance of the coefficient on the kth variable listed in
varlist, and to store the predicted covariances in cov 12, cov 13, …, cov 23, …, where cov kj
is the predicted covariance between the coefficients on the kth variable and the jth variable
in varlist.”(Pacifico and Yoo, 2012, p.631)

• nokeep shows the average covariance matrix and removes the drops the predicted vari-
ances and covariances

• varname(stubname) states what the predicted variance should be saved as stubname1,
stubname2, ….

• covname(stubname) tates what the predicted covariance should be saved as stub-
name12, stubname13, ….

• matrix(name) stores the reported average covariance matrix in a Stata matrix called
name.

6 Application

The lclogit command will be used on an example that Pacifico and Yoo (2013) used to
estimate latent class logit. The data in the example is used to determine the preference of

household’s choice of electricity supplier. The example consist of 100 customers have at least
12 choice situations with 4 suppliers in which they can only choose one. The data contatins
the; the price of the contract; length of contract that the supplier offered(years); whether the
supplier is a local company (local); Whether the supplier is a well-known company (wknown);
Whether the supplier offers a time-of-day rate instead of a fixed rate (tod)and Whether the
supplier offers a seasonal rate instead of a fixed rate (seasonal). The data can be seen in
Table one where y is the dummy variable for choice; pid and gid are numeric variables that
shows the agents and the choice situations.

Table 1: Variables and Data

y price contract local wknown tod seasonal gid pid x1

1. 0 7 5 0 1 0 0 1 1 27

2. 0 9 1 1 0 0 0 1 1 27

3. 0 0 0 0 0 0 1 1 1 27

4. 1 0 5 0 1 1 0 1 1 27

5. 0 7 0 0 1 0 0 2 1 27

6. 0 9 5 0 1 0 0 2 1 27

7. 1 0 1 1 0 1 0 2 1 27

8. 0 0 5 0 0 0 1 2 1 27

9. 0 9 5 0 0 0 0 3 1 27

10. 0 7 1 0 1 0 0 3 1 27

11. 0 0 0 0 1 1 0 3 1 27

12. 1 0 0 1 0 0 1 3 1 27

The above table was derived by using the following commands in Stata:

• use http://fmwww.bc.edu/repec/bocode/t/traindata.dta

• set seed 1234567890

• by pid, sort: egen x1=sum(round(rnormal(0.5),1))

• list in 1/12, sepby(gid)

The information criteria used was the CAIC and the BIC to select the optimal number
of latent classes. The estimation results can be seen in the Table the CAIC decreases from
2337.273 to 2292.538 as the the fifth class was added and increases to 2313.10 when sixth
class was added. The same was done for the BIC criteria except that the number of latent

classes is eight. This example however, will use the 5 classes. The following commands were
used in Stata:

• forvalues c = 2/10

• quietly lclogit y price contract local wknown tod seasonal, group(gid) id(pid) nclasses
(‘c’) membership ( x1) seed(1234567890)

• matrix b = e(b)

• matrix ic = nullmat(ic) ‘e(nclasses)´, ‘e(ll)´, ‘=colsof(b)´, ‘e(caic)´, ‘e(bic)´

(output omitted )

•• matrix colnames ic = ”Classes” ”LLF” ”Nparam” ”CAIC” ”BIC”

• matlist ic, name(columns)

Table 2: Number of Class Selection

Classes LLF Nparam CAIC BIC

2 -1211.232 14 2500.935 2486.935

3 -1117.521 22 2258.356 2336.356

4 -1084.559 30 2337.273 2307.273

5 – 1039.771 38 2292.538 2254.538

6 – 1027.633 46 2313.103 2267.103

7 -999.9628 54 2302.605 2248.605

8 -987.7199 62 2322.96 2260.96

9 -985.1933 70 2362.748 2292.748

10 -966.3487 78 2369.901 2291.901

Table 3 gives the estimated model with 5 classes. Class 2 is the largest class with 28
percent. The average share over agents is represented by the class shares. This is the case
because the class shares are now agent specific which is estimated by using the lclogitpr
command.

• by ‘(id)’, sort: generate first = n==1

• lclogitpr cp, cp

• egen double cpmax = rowmax(cp1-cp5)

Table 3: Choice Model parameters and average class share

Variable Class1 Class2 Class3 Class4 Class5

price -0.315 -0.562 -0.887 -1.497 -0.762

contract 0.025 -0.083 -0.470 -0.380 -0.538

local 3.072 4.512 0.400 0.803 0.526

wknown 2.256 3.405 0.424 1.075 0.317

tod -2.183 -7.872 -8.245 -15.229 -5.356

seasonal -2.484 -7.705 -6.225 -14.419 -7.760

Class Share 0.300 0.174 0.112 0.254 0.160

Variable Class1 Class2 Class3 Class4 Class5

x1 -0.011 0.024 -0.022 -0.027 0.000

cons 0.902 -0.556 0.172 1.119 0.000

• summarize cpmax if first, sep(0)

The lclogitpr command is used to describe the efficiency of the model when measuring the
difference in class preference. The mean of .95 states that the model did a good job of
differentiating preference of each class. This can be seen in Table 4. The following command
was inserted in stata: lclogit y price contract local wknown tod seasonal, group(gid) id(pid)
nclasses(5) membership( x1) seed(1234567890)

Table 4: Fitness of Model

Variable Obs Mean Std. Dev. Min Max

cpmax 100 .9596674 0.860159 .5899004 1

The respondents are classifed in classes based on the one that gives that agent ”high
posterior probablity” (Pacifico and Yoo, 2013). This is done so that the choice outcomes
within within the model can be predicted. The conditional and unconditional probability
for the choice is computed for being in that class.

• lclogitpr pr, pr

• generate byte class = . (4780 missing values generated)

•

forvalues c = 1/‘e(nclasses)´

• quietly replace class = ‘c´ if cpmax==cp‘c´

forvalues c = 1/‘e(nclasses)´

•• quietly summarize pr if class == ‘c´ y==1

• local n=r(N)

• local a=r(mean)

• quietly summarize pr‘c´ if class == ‘c´ y==1

• local b=r(mean)

• matrix pr = nullmat(pr) ‘n´, ‘c´, ‘a´, ‘b´

•• matrix colnames pr = ”Obs” ”Class” ”Uncond Pr” ”Cond PR”

• matlist pr, name(columns)

Table 5 shows the conditional and unconditional probabilities if the model. Pacifico and
Yoo (2013) states that the average conditional is 0.5 while the unconditional probability is
0.25. The probabilities depicted in the table is higher than the the usual probably which
indicates that this model is estimates observed choice situations. The fowllowing stata code
was used

• matrix list e(PB)

• e(PB)[1,6]

Table 5: Conditional and Unconditional Probabilities

Obs Class Uncondi Pr Cond Pr

129 1 .3364491 .5387555

336 2 .3344088 .4585939

191 3 .3407353 .5261553

300 4 .4562778 .7557497

239 5 .4321717 .6582177

7 Conclusion

The lclogit is a stata command that uses the EM algorithm to estimate discrete mixing
distribution choices. This algorithm allows for large parameters to be estimated in a shorter
period of time and with a lower computational cost. It is also used to describe the efficiency
of the model when measuring the difference in class preference The CAIC and BIC are used
to select the number of latent classes. The EM algorithm makes the model numerically more
stable.

References

Berndt, E. R., Hall, B. H., Hall, R. E., and Hausman, J. A. (1974). Estimation and inference
in nonlinear structural models. In Annals of Economic and Social Measurement, Volume
3, number 4, pages 653–665. NBER.

Bhat, C. R. (1997). An endogenous segmentation mode choice model with an application to
intercity travel. Transportation science, 31(1):34–48.

Pacifico, D. and Yoo, H. I. (2012). A stata module for estimating latent class conditional
logit models via the expectation-maximization algorithm. Technical report, School of
Economics, The University of New South Wales.

Pacifico, D. and Yoo, H. I. (2013). lclogit: A stata command for fitting latent-class con-
ditional logit models via the expectation-maximization algorithm. The Stata Journal,
13(3):625–639.

Train, K. E. (2008). Em algorithms for nonparametric estimation of mixing distributions.
Journal of Choice Modelling, 1(1):40–69.

A systematic comparison of continuous

and discrete mixture models

Stephane Hess∗ Michel Bierlaire† John W. Polak‡

November,

Report TRANSP-OR 06

Transport and Mobility Laboratory

School of Architecture, Civil and Environmental Engineering

Ecole Polytechnique F�ed�erale de Lausanne

transp-or.epfl.ch

∗Centre for Transport Studies, Imperial College London, stephane.hess@imperial.ac.uk,

Tel: +

4(0)

4 6

5, Fax: +44(0)20 7594 6102
†Transport and Mobility Laboratory, School of Civil and Environmental Engineering,

�Ecole Polytechnique F�ed�erale de Lausanne, michel.bierlaire@ep
.ch, Tel: +41(0)

37, Fax: +41(0)21 693 55 70
‡Centre for Transport Studies, Imperial College London, j.polak@imperial.ac.uk, Tel:

+44(0)20 7594 60

9, Fax: +44(0)20 7594 6102

Abstract

Modellers are increasingly relying on the use of continuous random

coe�cients models, such as Mixed Logit, for the representation of

variations in tastes across individuals. In this paper, we provide an

in-depth comparison of the performance of the Mixed Logit model

with that of its far less commonly used discrete mixture counterpart,

making use of a combination of real and simulated datasets. The

results not only show signi�cant computational advantages for the

discrete mixture approach, but also highlight greater
exibility,

and

show that, across a host of scenarios, the discrete mixture models are

able to o�er comparable or indeed superior model performance.

1 Introduction and context

Allowing for variations in behaviour across decision makers is one of the

most fundamental principles in discrete choice modelling, given that the

assumption of a purely homogeneous population cannot in general be seen

to be valid. The typical way of allowing for such variation is through

a deterministic approach, linking the taste heterogeneity to variations in

socio-demographic factors such as income or trip purpose.

While appealing from the point of view of interpretation (and especially

for forecasting), it is often not possible to represent all variations in tastes in

a deterministic fashion, for reasons of data quality, but also due to inherent

randomness in choice behaviour. For this reason, random coe�cient struc-

tures, such as the Mixed Multinomial Logit (MMNL) model, which allow

for random variations in behaviour across respondents, have an important

advantage in terms of
exibility. In general, such models have the disad-

vantage that their choice probabilities take on the form of integrals that do

not possess a closed form solution, such that numerical processes, typically

simulation, are required during estimation and application of the models.

This greatly limited the use of these structures for many years after their

initial developments. Over recent years, gains in computer speed and the

e�ciency of simulation based estimation processes (see for example Hess

et al., 2006) have however led to increased interest in the MMNL model in

particular, by researchers and, to a lesser degree, also practitioners.

Despite the improvements in estimation capability, the cost of using the

MMNL model remains high. While this might be acceptable in many cases,

another important issue remains, namely the choice of distribution to be

used for representing the random variations in tastes across respondents.

Here, there is a major risk of producing misleading results when making

an inappropriate choice of distribution, as discussed by Hess et al.

(2005).

In this paper, we explore an alternative approach, based on the idea of

replacing the continuous distribution functions by discrete distributions,

spreading the mass among several discrete values. Mathematically, the

model structure of a DM model is a special case of a latent class model

(Kamakura and Russell,

89; Chintagunta et al., 1991, cf.), assigning dif-

ferent coe�cient values to di�erent parts of the population of respondents,

a concept discussed in the �eld of transport studies for example by Greene

and Hensher (2003) and Lee et al. (2003). Latent class approaches make

use of two sub-models, one for class allocation, and one for within class

choice. The former models the probability of an individual being assigned

to a speci�c class as a function of attributes of the respondent and possibly

of the alternatives in the choice set. The within class model is then used to

compute the class-speci�c choice probabilities for the di�erent alternatives,

conditional on the tastes within that class. The actual choice probability

for individual n and alternative i is given by a sum of the class-speci�c

choice probabilities, weighted by the class allocation choice probabilities

for that speci�c individual.

The latent class approach is appealing from the point of view that it

allows for di�erences in sensitivities across population groups, where the

group allocation can be related to socio-demographic characteristics. How-

ever, in practice, it may not always be possible to explain group allocation

with the help of a probabilistic model relating the outcome to observed

variables. This situation is similar to the case where taste heterogeneity

cannot be explained deterministically, leading to a requirement for using

random coe�cients models. As such, in this paper, we explore the use

of models in which the class allocation probabilities are independent of

explanatory variables, and are simply given by constants that are to be

estimated during model calibration. As such, the resulting model exploits

the class membership concept in the context of random coe�cients models,

with a limited set of possible values for the coe�cients.

Thus far, there have seemingly been only two applications of this ap-

proach in the area of transport research, by Gopinath (1995), in the context

of mode choice for freight shippers, and by Dong and Koppelman (2003),

who made use of discrete mixtures of MNL models in the analysis of mode

choice for work trips in New York, referring to the resulting model as the

\Mass Point Mixed Logit model”. Although the properties of DM models

have been discussed by several other authors (Wedel et al., 1999, e.g.), the

model structure does not seem to have received widespread exposure or

application, despite its many appealing characteristics.

Given the above discussion, part of the aim of this paper is to re-explore

the potential advantages of DM models, with the hope of encouraging their

more widespread use. Additionally, the paper aims to o�er a systematic

comparison of the performance of discrete and continuous mixture models

across a host of situations, making use of simulated data.

The remainder of this paper is organised as follows. The next section

sets out the theory behind DM models. Section 3 presents a case study

using real data, while Section 4 uses four di�erent simulated datasets in a

systematic comparison of discrete and continuous mixture models. Finally,

Section 5 presents the conclusions of the paper.

2 Methodology

We begin by introducing some general notation, which is used throughout

the remainder of this paper. Speci�cally, let xin be a vector de�ning the

attributes of alternative i as faced by respondent n (potentially including

interactions with socio demographic variables), and let β be a vector de�n-

ing the tastes of the decision maker, where, in purely deterministic models,

β is constant across respondents. Let xn be a vector grouping together the

individual vectors xjn across the alternatives contained in the choice set of

respondent n, and let γ represent an additional set of parameters, which

can for example contain the structural parameters (and possibly allocation

parameters) used to represent inter-alternative correlation in a Generalised

Extreme Value (GEV) context. In a very general form, we can then de�ne

Pn (i | xn, Cn, γ, β) to give the choice probability of alternative i for indi-

vidual n, with a choice set Cn, conditional on the observed vector xn, and

for given values for the vectors of parameters β and γ (to be estimated).

Due to the potential inclusion of socio-demographic attributes in xn, this

notation allows for deterministic variations in tastes across respondents.

In a discrete mixture context, the number of possible values for the taste

coe�cients β is �nite. Here, we divide the set of parameters β into two sets;
�β represents a part of β containing deterministic parameters, while β̂ is a

set of K random parameters that have a discrete distribution. Within this

set, the parameter β̂k has mk mass points β̂
j
k, j = 1, . . . , mk, each of them

associated with a probability π
j
k, where we impose the conditions that

0 ≤ πjk ≤ 1, k = 1, . . . , K; j = 1, . . . , mk, (1)

and
mk∑
j=1

π
j
k = 1, k = 1, . . . , K. (2)

For each realisation β̂

j1
1 , . . . , β̂

jK
K of β̂, the choice probability is given by

(
i | xn, Cn, γ, β = 〈�β, β̂

j1
1 , . . . , β̂

jK
K 〉

)
, (3)

where the deterministic part of �β stays constant across realisations of the

vector β̂.

The unconditional (on a speci�c realisation of β, not on the distribution

of β̂) choice probability for alternative i and decision maker n can now be

written straightforwardly as a mixture over the discrete distributions of the

various elements contained in β̂ as:

(
i | xn, Cn, γ, �β, β̂, π

)
=

m1∑
j1=1

· · ·
mK∑

jK=1

Pn
(
i | xn, Cn, γ, β = 〈�β, β̂
j1
1 , . . . , β̂
jK
K 〉

)
π

j1
1 · . . . · π

jK
K , (4)

where �β, β̂ and π (π = 〈π11, . . . , π
m1
1 , . . . , π

1
K, . . . , π

mK
K 〉) are vectors of pa-

rameters to be estimated in a regular maximum likelihood estimation pro-

cedure. An obvious advantage of this approach is that, if the model (3)

used inside the mixture has a closed form, then so does the DM itself.

In this paper, we mainly focus on the simple case where the underly-

ing choice model is of MNL form; however, the form given in equation (4)

is appropriate for any underlying model, where, with an underlying GEV

structure, the resulting model obtains a closed form expression, avoiding

the need for simulation in estimation and application. In this case, the

1These constraints can be avoided by setting πi =
eαi∑J

j=1
e

αj
, where αj with j = 1, . . . , J

are estimated without constraints. While avoiding the need for constraints, this formula-

tion becomes highly non-linear and di�cult to handle in

estimation.

vector γ would contain parameters that determine the nesting structure

of the model. The approach can easily be extended to the case of com-

bined discrete and continuous random taste variation, by partitioning β

into three parts; the above de�ned parts �β and β̂, and an additional part

β̃, whose elements follow continuous distributions2. This however leads to

a requirement to use simulation, as with all continuous mixture models.

Finally, independently of the additional treatment of random variations

in tastes, a treatment of repeated choice observations analogous to the stan-

dard continuous mixture treatment, with tastes varying across individuals,

but not across observations for the same individual, is made possible by

replacing the conditional choice probabilities for individual observations in

equation (4) by probabilities for sequences of choices, and by using the

resulting DM term inside the log-likelihood function.

Several issues arise in the estimation of DM models. Firstly, the non-

concavity of the log likelihood function does not allow the identi�cation of

a global maximum, even for discrete mixtures of MNL. Given the potential

presence of a high number of local maxima, performing several estimations

from various starting points is advisable. Also, it is good practice to use

starting values other than 0 or 1 for the π
j
k parameters. Secondly, con-

strained maximum likelihood must be used to account for constraints (1)

and (2). Thirdly, clustering of mass points (for example around the mode

of the true distribution) is a frequent phenomenon with DM models, and

the use of additional bounds on the mass points can be useful, based on

the de�nition of (potentially mutually exclusive) a priori intervals for the

individual mass points. In this context, a heuristic is needed to determine

the optimal number of support points in actual applications.

For the purpose of this analysis, the model was coded into BIOGEME

(Bierlaire, 2003), where various constraints on the parameters can be im-

posed to address the issues described above. This also allows modellers

to test the validity of speci�c assumptions, such as a mass at zero for the

VTTS, a concept discussed for example by Cirillo and Axhausen (2006).

2This approach can then also be used to include error components for correlation or

heteroscedasticity.

3 VTTS case study

In this section, we present the �ndings of an analysis making use of real

world data. We �rst give a brief description of the data in Section 3.1,

before looking at model speci�cation in Section 3.2. The estimation results

are presented in Section 3.3.

3.1 Data

The study presented here makes use of Stated Preference (SP) data col-

lected as part of a recent value of time study undertaken in Denmark (Burge

and Rohr, 2004). Speci�cally, we make use of data describing a binary

choice process for car travellers, with alternatives described only in terms

of travel cost and travel time. Each respondent was presented with 9 choice

situations, including one with a dominating alternative.

After eliminating the observations with a dominating alternative, as well

as additional data cleaning (removing non-traders and respondents who did

not choose the dominating alternative), a sample of

,386 observations

from 1,7

respondents was obtained. This equates to 3,037 observations

from 392 commuters, 1,081 observations from

2 respondents travelling for

education purposes, 1,767 observations from 2

people on shopping trips,

5 observations from 404 people travelling to visit friends or relatives,

1,752 observations from

4 general leisure travellers and 2,594 observations

from 331 respondents travelling for other purposes.

To allow us to gauge the stability of the results, random subsamples of

around 80% of the original sample size were generated for each of the above

listed six purpose segments3, where in each case, 10 such subsamples were

created.

3.2 Model specification

The models used in this paper were estimated in log-WTP (willingness

to pay) space (Fosgerau, 2004, cf.), avoiding the e�ect of heterogenous

3The selection was performed at the individual-speci�c level, rather than the

observation-speci�c level.

scale (Fosgerau and Bierlaire, 2006, cf.), while allowing us to represent

random variations in the VTTS without the issue of calculating the VTTS

on the basis of separate randomly distributed coe�cients for travel time and

travel cost. Some modi�cations of the utility functions are required before

estimation. Speci�cally, let Ti and Ci de�ne the time and cost attributes of

alternative i, and let us rearrange the data such that T1 > T2 and C1 < C2,

i.e., the �rst alternative is slower but cheaper than the second alternative.

Then, a very basic speci�cation of utility is given by:

Ui = βT Ti + βC Ci, (5)

where βT and βC represent time and cost coe�cients respectively, and

where i = 1, 2.

The �rst alternative is then chosen if the respondent is not willing to

pay C2 − C1 to obtain a reduction in travel time by T1 − T2. This equates

to:

P1 = P (βT (T1 − T2) > βC (C2 − C1)) . (6)

With both βT and βC forced to take on negative values (Hess et al., 2005,

cf.), and with the above detailed relationships between the cost and time

attributes for the two alternatives, equation (6) can be rewritten as:

P1 = P

(
−

∆C

∆T

βT

βC

)
, (7)

where ∆C = C1 − C2 and ∆T = T1 − T2. After noting that the VTTS is given

by βT
βC
, and after a further change to equation (7), we obtain the choice

probabilities in log-WTP space as follows:

P1 = P

[
ln

(
−
∆C
∆T

)
> αLV

]
, (8)

where αLV = ln (VTTS).

By noting that the absence of an estimated coe�cient in the utility for

alternative 1 leads to a need to explicitly estimate the scale, the utility

functions for alternative 1 and 2 are given by:

U1 = λ ln

(
−
∆C
∆T

)
+ ε1 (9)

and

U2 = λ αLV + ε2, (10)

where λ is estimated in addition to αLV, and where ε1 and ε2 give the usual

type I iid extreme value terms. With travel costs given in Danish Krona

(DKK) and travel times given in minutes, the actual VTTS in DKK per

hour is obtained by 60 · exp (αLV ).
The speci�cation set out above can now be used in a standard discrete

choice framework, with either a �xed estimate for αLV, or with random

variation across respondents. At this point, it should also be noted that at-

tempts to estimate models with an additional constant associated with the

�rst of the two SP alternatives4, hence accounting for a left-right reading

e�ect, did not lead to any signi�cant di�erences in the VTTS estimates.

3.3 Model results

During the analysis, four di�erent types of model were estimated on the

data; a simple MNL model, a MMNL model using a Normal distribution,

and two DM speci�cations, one with two support points, DM(2), and one

with three support points, DM(3)5. In the MMNL and DM models, the

repeated choice nature of the data was taken into account by specifying the

likelihood function with the integration (respectively summation in the DM

models) outside the product over replications for the same respondent.

Each of these models was estimated across the six population segments

and the ten subsets of the data, leading to

0 estimated models. Given

this wealth of results, we presented detailed results only for shopping trips

(Section 3.3.1), and give summary results for the remaining �ve population

segments (Section 3.3.2).

3.3.1 Detailed results for shopping trips

The results for the various models estimated on the data for shopping trips

are summarised in Table 1. Several di�erences arise across models in the

4In preference space.
5Models with more than three support points collapsed back to the more basic speci-

�cations.

S
u
b
s
a
m
p
le
:

1
2

3
4

5
6

7
8

9
1
0

R
e
s
p
o
n
d
e
n
t
s
:

8
5

1
8
6

8
0

1
9
0

1
8
2

1
7
9

1
8
8

1
7
9
1
8
0

7
5

O
b
s
e
r
v
a
t
io
n
s
:

1
4
2
1

1
4
2
9

1
3

8
2

1
4
5
7

1
4
0
3

1
3

7
6

4
4

1
3
7
7

1
3
8
3

1
3
4
3

O
v
e
r
a
ll

F
in
a
l
L
L

-8
8
0
.9
6

-8
7
7
.3
8

-8
6
1
.6
1

-9
0
9
.2
5

-8
6
9
.0
3

-8
6
2
.3
5

-8
9
1
.5
7

-8
6

3
.8
2

-8
4
0
.0
2

-8
3
7
.6
9

a
d
j.

ρ
2

0
.1
0
3
6

0
.

1
1

2
2

0
.0
9
8
5

0
.0
9
7
7

0
.1
0
4
3

0
.0
9
3
8

0
.1
0
8
5

0
.0
9
2
9

0
.1
2

1
6

0
.0
9
8
0

0
.1
0
3
1

E
s
t
im

a
t
io
n
t
im

e
(
s
)
:

1
1
1
1
1
1
1
1
1
1

1
α

L
V

e
s
t
.

-1
.1
1
0
0

–

1
.0
5
0
0

–

1
.0
8
0
0

-1
.0
8
0
0

-1
.1
0
0
0

-1
.0
5
0
0

–

1
.0
6
0
0

-1
.1
5
0
0

–

1
.0
2
0
0

a
s
y
.
t
-r
a
t
io

-1
4
.2
0

-1
4
.8
0

-1
3
.6
0

–

1
3
.8
0

-1
4
.0
0

-1
3
.2
0

-1
4
.6
0

-1
3
.2
0

-1
5
.0
0

-1
3
.3
0

λ
e
s
t
.

0
.8
3
8
0

0
.8
8
0
0

0
.8
3
9
0

0
.8
2
1
0

0
.8
4
9
0

0
.7
9
5
0

0
.8
8
9
0

0
.8
0

7
0

0
.8
9
6
0

0
.8
4
6
0

a
s
y
.
t
-r
a
t
io

1
1
.5
0

1
2
.0
0

1
1
.4
0

1
1
.5
0

1
.7
0

1
0
.9
0

1
2
.1
0

1
1
.1
0

1
2
.0
0
1
1
.4
0

MNL

T
T
S

(
D
K
K
/
h
o
u
r
)
:

1
9
.7
7

2
1
.0
0

2
0
.3
8

1
9
.9
7

2
1
.0
0

2
0
.7
9

1
9
.0
0

2
1
.6
4

2
0
.3
7
(

0
.7
7

)

F
in
a
l
L
L
:

-8
4
9
.6
5

-8
5
1
.1
1

-8
2
8
.2
7

-8
7
2
.4
4

-8
4
2
.0
2

-8
3
1
.1
7

-8
6

2
.2
5

-8
2
6
.9
2

–

8
1

8
.6
1

-8
0
5
.1
6

a
d
j.
ρ
2
:

0
.1
3
4
3

0
.1
3
7
7

0
.1

3
2

0
.1

3
3

0
.1
3
1
1

0
.1
2
5
4

0
.1
3
6
7

0
.1
3
0
5

0
.1
4
2
9

0
.1
3
1
9

0
.1
3
3
6

E
s
t
im
a
t
io
n
t
im
e
(
s
)
:
7
5
8
1

7
4

7
3

8
0

6
8

7
6

7
1

8
0
7
0

7
4
.8

α
L

V
,
µ

e
s
t
.
-1
.0
8
0
0
-1
.0
8
0
0
-1
.0
2
0
0
-1
.0
5
0
0
-1
.0
5
0
0

–

1
.0
7
0
0

–

1
.0
3
0
0

-1
.0
3
0
0

-1
.1
2
0
0

-0
.9
9
1
0

a
s
y
.
t
-r
a
t
io

-1

1
.3
0

-1
2
.0
0

-1
0
.7
0

-1
0
.9
0

-1
1
.3
0

-1
0
.5
0

-1

1
.6
0

-1
0
.3
0

–

1
2
.4
0

-1
0
.4
0

α
L

V
,
σ

e
s
t
.

0
.8
9
5
0

0
.8
2
6
0

0
.9
0
7
0

0
.9
4
4
0

0
.8

5
9

0
.9
4
2
0

0
.8
3
6
0

0
.9
7
0
0

0
.7
8
0
0

0
.9
0
0
0

a
s
y
.
t
-r
a
t
io

8
.7
5

8
.5
3

8
.8
0

9
.0
4

8
.4
3

8
.5
3

8
.7
2

8
.8
5

8
.0
7

8
.7
5
λ
e
s
t
.
1
.0
3
0
0
1
.0
5
0
0

1
.0
4
0
0

1
.0
2
0
0
1
.0
2
0
0

0
.9
8

1
0

1
.0
7
0
0

1
.0
1
0
0

1
.0
5
0
0
1
.0
6
0
0
a
s
y
.
t
-r
a
t
io
1
2
.1
0
1
2
.4
0
1
2
.0
0
1
2
.1
0
1
2
.1
0
1
1
.5
0

1
2
.6
0

1
1
.8
0

2
.3
0

1
2
.0
0

M
e
a
n
V
T
T
S
(
D
K
K
/
h
o
u
r
)
:

3
0
.

4
1

2
8
.6
6

3
2
.6
4

3
2
.7
8

3
0
.3
6

3
2
.0
7

3
0
.3
8

3
4
.2
9

2
6
.5
4

3
3
.3
9

3
1
.1
5
(
2
.3
5
)

MMNL

V
T
T
S
s
t
a
n
d
a
r
d
d
e
v
ia
t
io
n

3
3
.7
0

2
8
.3
5

3
6
.8
8

3
9
.3
1

1
.7
2

3
8
.3
4

3
0
.5
5

4
2
.8
6

2
4

.2
9

3
7
.3
0

3
4
.3
3
(
5
.6
5
)

F
in
a
l
L
L
:

-8
4
5
.4
0

-8
4
7
.1
5

-8
2
6
.7
4

-8
6
8
.6
4

-8
3
8
.9
0

-8
2
8
.0
1

-8
5
9
.4
0

-8
2
0
.4
8

-8
1
7
.1
5

-8
0
3
.6
0

a
d
j.
ρ
2
:

0
.1
3
6
6

0
.1
3
9
7

0
.1
3
1
7

0
.1
3
4
9

0
.1
3
2
2

0
.1
2
6
6

0
.1
3
7
6

0
.1
3
5
1

0
.1
4
2
4

0
.1

3
1
4

0
.1
3
4
8

E
s
t
im
a
t
io
n
t
im
e
(
s
)
:
1
1
1
1
1
1
1
1
1
1
1
α
L
V

,
1

e
s
t
.

0
.5
4
1
0

0
.4
5
9
0

0
.3
7
8
0

0
.4
1
2
0

0
.5
7
5
0

0
.4
2
4
0

0
.4
9
2
0

0
.7
6
0
0

0
.0
6
8
5

0
.1
8
4
0

a
s
y
.
t
-r
a
t
io

1
.8
9

1
.6
0

1
.2
2

1
.6
0
1
.7
2

1
.4
2

1
.6
0

2
.7
9

0
.2
2

0
.7
4

π
1

e
s
t
.

0
.

2
1
3

0
.2
0
2
0

0
.2
6
2
0

0
.2
6
7
0

0
.1
9
4
0

0
.2
5

7
0

0
.2
0
7
0

0
.2
1
1
0

0
.2
6
8
0

0
.3
3

3
0

a
s
y
.
t
-r
a
t
io

(
i
)

3
.7
3

3
.4
0

3
.1

4
.0
3

3
.1
3

3
.5
1

3
.2
4

4
.4
2

2
.5
6

3
.5
6

α
L

V
,
2

e
s
t
.

–

1
.4
7

0
0

-1
.4
3
0
0

-1
.4
9
0
0

-1
.5
4
0
0

-1
.4
0
0
0

-1
.5
4
0
0

-1
.3
9
0
0

-1
.4
5
0
0

-1
.5
4
0
0

–

1
.5
5

0
0

a
s
y
.
t
-r
a
t
io

–

1
3
.4
0

-1
3
.4
0
-1
0
.7
0

-1

2
.2
0

-1
2
.6
0

-1
1
.3
0

-1
2
.8
0

-1
4
.1
0

-9
.8
2

-9
.6
0

π
2

e
s
t
.

0
.7
8
7
0

0
.7
9
8
0

0
.7
3
8
0

0
.7
3
3
0

0
.8
0
6
0

0
.7
4
3
0

0
.7
9
3
0

0
.7
8
9
0

0
.7
3
2
0

0
.6
6
7
0

a
s
y
.
t
-r
a
t
io
(
i
)
1
3
.8
0
1
3
.4
0

8
.9
4

1
1
.1
0

1
3
.0
0

1
0
.2
0

1
2
.4
0

1
6
.6
0

6
.9
9

7
.1
1

λ
e
s
t
.
1
.0
1
0
0
1
.0
4
0
0
1
.0
2
0
0
1
.0
1
0
0
1
.0
1
0
0

0
.9
6
8
0

1
.0
5
0
0
1
.0
1
0
0
1
.0
4
0
0
1
.0
4
0
0
a
s
y
.
t
-r
a
t
io

1
2
.3
0

1
2
.6
0
1
2
.1
0
1
2
.3
0
1
2
.3
0

1
1
.6
0

1
2
.7
0

1
2
.0
0
1
2
.4
0
1
2
.1
0
V
T
T
S
(
D
K
K
/
h
o
u
r
)
:

3
2
.8
1

3
0
.6
4

3
2
.9
2

3
3
.6
2

3
2
.6
1

3
3
.1
2

3
2
.1
6

3
8
.1
8

2
6
.6
4

3
2
.5
1

3
2
.5
2
(
2
.8
3
)

V
T
T
S
s
t
a
n
d
a
r
d
d
e
v
ia
t
io
n

6
.5
5

3
2
.3
6

3
2
.5
6

3
4
.3
9

3
6
.3
1

3
4
.4
4

3
3
.7
1

4
6
.6
1

2
2
.7
6

2
7
.9
9

3
3
.7
7
(
6
.1
3
)

Discretemixture(2pts.)

F
in
a
l
L
L
:

-8
4
4
.6
0

-8
4
6
.5
5

-8
2
4
.7
9

-8
6
7
.6
4

-8
3
7
.7
6

-8
2
7
.0
6

-8
5
8
.0
6

-8
1
9
.8
2

-8
1
6
.1
7

-8
0
2
.0
6

a
d
j.
ρ
2
:

0
.1
3
5
4

0
.1
3
8
3

0
.1
3
1
7

0
.1
3
3
9

0
.1
3
1
3

0
.1
2
5
5

0
.1
3
6
9

0
.1
3
3
7

0
.

1
4
1

0
.1
3
0
9

0
.1
3
3
9
E
s
t
im
a
t
io
n
t
im
e
(
s
)
:
3
3
3
3
3
3
4
4
3
2
3
.1
α
L

V
,
1

e
s
t
.

0
.7
7
7
0

0
.7
0
6
0

0
.8
1
6
0

0
.7
1
7
0

0
.9
0
7
0

0
.7
6
3
0

0
.8
1
6
0

0
.9
1
7
0

0
.6
4
2
0

0
.7
4
1
0

a
s
y
.
t
-r
a
t
io

2
.2
6

1
.8
8

2
.3
1

2
.1
1

2
.3
4

1
.9
8

2
.3
0

2
.9
8

1
.2
6

1
.8
5

π
1
e
s
t
.

0
.1
5
6
0

0
.

1
4
3

0
.1
4
6
0

0
.1
7
5
0

0
.1
2
9
0

0
.1
6

1
0

0
.1
3
4
0

0
.1
7
5
0

0
.1
0
5
0

0
.1
4
7
0

a
s
y
.
t
-r
a
t
io
(
i
)

2
.6
9

2
.2
5

2
.5
3

2
.4
8

2
.5
6

2
.2
1

2
.5
6
3
.5
1

1
.3
8

1
.9
2

α
L
V
,
2
e
s
t
.

-1
.7
8
0
0

-0
.8
9
4
0

-0
.7
8
9
0

-0
.7
9
2
0

-0
.8
6
9
0

-0
.7
8
9
0

-0
.8
3
8
0

-1
.7
8
0
0
-1
.7
8
0
0

-0
.6
6
5
0

a
s
y
.
t
-r
a
t
io

-5
.4
0

–

1
.8
1

-2
.7
1

-1
.8
7

-2
.5
0

–

1
.7
6

-2
.6
3

-4
.8
7

-6
.6
3

-2
.0
0

π
2
e
s
t
.

0
.4
5
5
0

0
.3
8
1
0

0
.4
4
9
0

0
.3
4
6
0

0
.4
5
6
0

0
.3
6
0
0

0
.4
5
0
0

0
.4
3
3
0

0
.4
8
6
0

0
.4
1
1
0

a
s
y
.
t
-r
a
t
io
(
i
)

1
.5
3

1
.0
9

2
.8
2

1
.7
5

1
.7
6
1
.7
2

2
.0
4

1
.3
0

2
.2
9

2
.7
9
α
L

V
,
3

e
s
t
.

-0
.9
1
0
0

-1
.7
0
0
0

-1
.8
8

0
0

-1
.8
1
0
0

-1
.7
9
0
0

-1
.8
3
0
0

-1
.7
7
0
0

-0
.9
6
7
0

-0
.7
4
5
0

-1
.8
3
0
0
a
s
y
.
t
-r
a
t
io

-2
.0
9

-4
.9
2

-6
.8
6

-6
.7
6

-5
.1
7

-6
.3
4

-5
.8
8

-2
.2
4

-1
.9
3

-7
.4
0

π
3

e
s
t
.

0
.3
8
9
0

0
.4
7

6
0

0
.4
0
5
0

0
.4
8
0
0

0
.4
1
5
0

0
.4
7
9
0

0
.4
1
6
0

0
.3
9
2
0

0
.4
0
8
0

0
.4
4
2
0

a
s
y
.
t
-r
a
t
io
(
i
)

1
.3
5

1
.3
1

2
.4
3

2
.3
4
1
.5
5
2
.2
0
1
.8
1

1
.2
0

2
.0
4

2
.8
8

λ
e
s
t
.
1
.0
3
0
0
1
.0
6
0
0
1
.0
5
0
0
1
.0
3
0
0
1
.0
3
0
0

0
.9
8
9
0

1
.0
8
0
0
1
.0
3
0
0
1
.0
6
0
0
1
.0
7
0
0
a
s
y
.
t
-r
a
t
io
1
2
.1
0

1
2
.5
0

1
2
.0
0

1
2
.2
0

1
2
.2
0
1
1
.5
0
1
2
.6
0

1
.9
0

1
2
.3
0
1
2
.0
0

Discretemixture(3pts.)

V
T
T
S
(
D
K
K
/
h
o
u
r
)
:
3
4
.2
9

3
1
.9
4

3
5
.7
8

3
5
.6
8

3
4
.8
6

3
5
.1
4

3
4
.1
1

3
9
.6
0

2
8
.4
1

3
5
.4
2

3
4
.5
2
(
2
.8
7
)

V
T
T
S
s
t
a
n
d
a
r
d
d
e
v
ia
t
io
n

4
1
.8
6

3
7
.1
4

4
2
.1
5

4
0
.9
2

4
4
.3
4

4
1
.7
5

4
0
.6
2

5
1
.2
2

3
0
.5
5

3
8
.8
1

4
0
.9
3
(
5
.2
4
)

T
a
b
le
1
:
E
st
im

a
ti
o
n
re
su
lt
s
o
n
D
a
n
is
h
sh
o
p
p
in
g
d
a
ta

presentation of the results. As such, for the MNL model, only αLV and λ

are estimated. For the MMNL model, αLV follows a Normal distribution,

with mean αLV,µ and standard deviation αLV,σ. For the two DM models,

the value of αLV is spread across several support points αLV,k with associ-

ated probabilities 0 ≤ πk ≤ 1, such that
∑K

k=1 πk = 1, with K = 2 and

K = 3 in DM(2) and DM(3) respectively. In addition, the table shows the

calculated VTTS. For the MNL model, the mean VTTS is simply obtained

through 60 · exp (αLV ). However, for the three mixture models, the non-
linearity in the exponential means that a di�erent approach is required.

With αLV ∼ N (µα, σα) in the MMNL model, the actual VTTS follows

a log-normal distribution with mean µVTTS = exp
(
µα +

σ2α
2

)
and standard

deviation σVTTS = µ
√
exp (σ2α) − 1. Both µVTTS and σVTTS can then be multi-

plied by 60 to obtain hourly values. For the DM models, a slightly di�erent

approach was used. As such, with K support points αLV,k and associated

probabilities πk, a sequence of draws was generated that contained πk · N
points with a value equal to exp (αLV,k), with k = 1, . . . , K. The sample

mean and standard deviation from this sequence were then used as esti-

mates of the mean and standard deviation for the actual VTTS. For the

results presented here, the value of N was set to 100, 000, beyond which

no visible di�erences were observed for σVTTS. Finally, along with the re-

sults for individual subsamples, the table also shows some overall measures,

namely the average of the adjusted ρ2 measure, the average estimation time,

and the average for µVTTS and σVTTS (together with a standard deviation of

this mean across subsamples).

The �rst observation that can be made from Table 1 is that all three

mixture models o�er signi�cant improvements in model �t over the base

MNL model, across all ten subsamples. Given the structural di�erences

between the continuous and discrete mixture models, the comparison be-

tween these models is carried out using the adjusted ρ2 measure rather

than the log-likelihood function. Here, we can see that, overall, DM(2)

o�ers the best performance, ahead of DM(3) and the MMNL model. While

the model with three support points always obtains slightly better model

�t than the model with two support points, the gains are not large enough

to be signi�cant when taking into account the additional cost in terms of

the number of parameters. In other words, the model with three support

points is not able to retrieve signi�cant amounts of additional heterogene-

ity when compared to the model with two support points. This can partly

be seen as a re
ection of the success of the model with two support points,

but is also an illustration of the di�culties of estimating models with more

than two support points, as alluded to in Section 2.

With three exceptions (samples 3, 9 and 10), the DM(2) model obtains

the best performance across the three structures. Overall, the di�erences

in performance between the DM(2) model and the MMNL model are very

small, such that we now focus on other factors. Here, the �rst observation

relates to the much lower estimation cost for the DM(2) model, with an

average estimation time of one second, compared to seventy-�ve with the

MMNL model. This much lower estimation cost would give the DM models

a signi�cant advantage in the case of larger datasets, where the absolute

estimation times would be more substantial. Furthermore, the estimation

time for the MMNL model was in this case kept low through the use of

only 250 Halton draws in the estimation.

In terms of substantive results, the mean VTTS measures obtained by

the three mixture models are signi�cantly higher than the point estimate

obtained with the MNL model. This is at least partly a result of the

asymmetrical distribution of the VTTS in the mixture models. While there

are also some di�erences between the three mixture models in the estimates

for µVTTS, these are much smaller than the di�erence when compared to the

MNL estimates. Finally, the estimate for σVTTS is much higher in the DM(3)

model, while the estimate for the DM(2) model and the MMNL model are

very similar.

3.3.2 Other results

Table 2 summarises the results for the various models estimated on the

remaining �ve purpose segments. With very little variation across the ten

subsamples, only the overall results are shown here. These in turn are very

similar to those obtained on the data for shopping trips. As such, all three

mixture models outperfom the MNL model, where the best performance is

Commuters Education Leisure Other Visit

adj. ρ2: 0.1017 0.

82 0.1102 0.0888 0.1007

estimation time (s): 1 1 1 1 1

M
N
L

Mean VTTS (DKK/hour):

.08 29.32

.40 22.73 23.82

adj. ρ2: 0.1263 0.1599 0.1395 0.11

0.1294

estimation time (s): 131 51 74 107 127

Mean VTTS (DKK/hour): 39.51 37.

37.62 34.76 35.83

M
M
N
L

Std.dev. VTTS 35.90 29.24 38.43 39.85 39.61

adj. ρ2: 0.1291 0.

09 0.1433 0.1156 0.1337

estimation time (s): 2 1 1 2 2

Mean VTTS (DKK/hour): 39.78 36.96 37.03 34.43 37.36

D
M
(2
)

Std.dev. VTTS 30.50 24.01 28.03 29.17 36.28

adj. ρ2: 0.1279 0.1576 0.1412 0.1142 0.1326

estimation time (s): 4 1 2 3 4

Mean VTTS (DKK/hour): 39.78 37.04 37.03 34.43 37.

D
M
(3
)

Std.dev. VTTS 30.50 24.36 28.03 29.17 36.16

Table 2: Summary of results for commuters, education trips, leisure trips,

other purposes and visits

consistently obtained by the DM(2) model. Again, the DM(3) model is not

able to retrieve signi�cant levels of additional taste heterogeneity to warrant

the estimation of two additional parameters. In fact, the estimates for µVTTS
and σVTTS are almost universally equivalent across the two models

6. As in

the case of shopping trips, the advantages of the DM models in terms of

estimation time are again very signi�cant, across all �ve purpose segments.

Finally, while there are almost no di�erences in the estimates for µVTTS
between the three di�erent mixture models (where the estimates are again

signi�cantly higher than those for the MNL models), the estimates for σVTTS
are now lower in the DM models, something that was not the case in the

shopping segment.

6It is worth noting that, with the exception of the education segment, the adjusted ρ2

measure is higher for the DM(3) model than for the MMNL model.

4 Simulated data case studies

The application presented in Section 3 has shown the potential advantages

of using a discrete mixture approach. However, it is clearly impossible to

generalise these results, which could well be speci�c to the data at hand.

For this, a systematic comparison between discrete and continuous mixture

models is required; this is the topic of this section, which presents the

�ndings of four case studies making use of simulated data.

In each of the four case studies, the generation of the data is based on the

Danish VOT data used in the case study described in Section 3. Speci�cally,

we use 10, 776 observations from 1, 347 respondents, and generate choices

based on the attributes used in the original survey data. For each of the four

di�erent true models, ten sets of choices are generated for each observation,

allowing us to gauge the stability of results across di�erent samples. Unlike

in the case study described in Section 3, we now work in preference space,

with separate coe�cients for travel time and travel cost. In each case,

the travel cost coe�cient is kept �xed while some random distribution is

used for the travel time coe�cient. Finally, the data generation was in each

case carried out under the assumption of constant tastes across replications

for the same individual, and the same approach was later used in model

estimation.

In the �rst two case studies, the true model is a discrete mixture, while

in the �nal two case studies, the true model is a continuous mixture. This

allows us to gauge the relative di�culties of the two types of model in

dealing with data for which the other model type is more appropriate.

Before proceeding to the discussion of the results, it should be noted

that all MMNL models presented here make use of a Normal distribu-

tion. Attempts to use alternative continuous distribution functions, such

as Johnson’s SB, did not lead to consistent results on the data used here.

While the �ndings from this analysis are thus limited to a comparison be-

tween a discrete mixture and a normal mixture, it should be remembered

that the vast majority of MMNL studies make use speci�cally of this Nor-

mal distribution, such that the results are still relevant.

4.1 Case study 1: discrete mixture with two support
points

The �rst case study makes use of data generated with the help of a discrete

mixture model with two mass points for βT , at −1 and 0.5, with probabil-

ities of 0.25 and 0.75 respectively. The travel cost coe�cient is �xed at a

value of −1, such that we obtain a mean VTTS of 37.5 DKK per hour with

a standard deviation of 13.33 DKK per hour.

The estimation results obtained on this dataset are presented in two

parts. Table 3 presents detailed results for the �rst of the ten subsamples,

while Table 4 summarises the results obtained across all ten subsamples.

In addition to a basic MNL model, we estimated a MMNL model using a

Normal distribution and a discrete mixture model with two support points

on this dataset7. In both cases, we allowed for random variations in βC as

well as βT . Consistent with the true model, no variations were observed

for βC in the discrete mixture model, labelled DM(2)A, such that a second

model, DM(2)B, was estimated, in which βC was kept �xed.

In a comparison between the three remaining models, MNL, MMNL

and DM(2)B, we observe that the discrete mixture model outperforms the

continuous mixture model, which in turn outperforms the MNL model. In

terms of estimation time, DM(2)B has clear advantages over the MMNL

model, and the higher estimation cost when compared to MNL is well

justi�ed on the basis of the improvements in model performance. All three

models o�er very good performance in retrieving the mean VTTS, while the

two mixture models additionally o�er good performance in the estimation

of the standard deviation.

A �nal point deserves some special attention. As mentioned above, we

initially allowed for random variation in βC as well as βT . The estimation

of the �rst discrete mixture model, DM(2)A, o�ered no evidence of such

heterogeneity, such that the model was replaced by DM(2)B. However,

for the continuous mixture model, MMNL, we retrieved signi�cant hetero-

geneity for βC as well as for βT , despite the fact that βC was kept �xed in

7No further gains in model performance were obtained by allowing for more than two

support points.

MNL MMNL DM(2)A DM(2)B
Final LL -4565.42 -4122.22 -4007.05 -4007.05

par. 2 4 8 5

adj. ρ2 0.3885 0.4476 0.4625 0.4629

est.time (s) 2 234 17 6

est. asy.t-rat. est. asy.t-rat. est. asy.t-rat. est. asy.t-rat.

βT -0.4081 -36.16 – – – – – –

βT,µ – – -0.6409 -36.28 – – – –

βT,σ – – 0.1553 10.31 – – – –

βT,1 – – – – -0.5050 -40.99 -0.5050 -40.99

πT,1 – – – – 0.7258 50.22 0.7258 50.22

βT,2 – – – – -1.0231 -40.81 -1.0231 -40.81

πT,2 – – – – 0.2742 18.97 0.2742 18.97

βC -0.6424 -34.09 – – – – -1.0083 -42.20

βC,µ – – -1.0613 -36.64 – – – –

βC,σ – – 0.2071 9.66 – – – –

βC,1 – – – – -1.0083 -12.20 – –

πC,1 – – – – 0.3035 0.00 – –

βC,2 – – – – -1.0083 -24.03 – –

πC,2 – – – – 0.6965 0.00 – –

µVTTS 38.11 37.81 38.50 38.50

σVTTS – 12.75 13.75 13.75

Table 3: Detailed estimation results for �rst subsample for �rst simulated

dataset

the generation of the data. This o�ers clear evidence of confounding; by

being unable to retrieve the correct patterns of heterogeneity for βT , the

MMNL model explains part of the remaining error in the model through

heterogeneity in βC. As such, while the model is able to correctly retrieve

the mean and standard deviation of the VTTS, it does so by incorrectly

indicating a variation across respondents in the sensitivity to changes in

travel cost.

The �ndings from Table 3 are con�rmed by a graphical analysis of the

shape for the distribution of βT in Figure 1, where this comparison is made

possible by the fact that the mean estimate for βC is essentially equal to

−1 in all models.

Due to space considerations, no detailed results are presented for the

remaining nine subsamples. The results are available on request. Never-

Figure 1: Cumulative distribution function for βT �rst subsample for �rst

simulated dataset

theless, the results presented in Table 4 give an indication of the stability

of the results across the ten samples. As such, there is very little variation

in terms of model performance (adj. ρ2), where the advantages of the DM

model clearly remain, with the same applying in the case of estimation

time. Finally, while for the mean VTTS, the results are very stable across

datasets and models, the estimation of the MMNL models led to very high

standard deviations for the VTTS measures in some of the subsamples,

which is re
ected in a higher mean value for σVTTS, along with greater vari-

ation across samples. This is a direct result of the incorrect patterns of

MNL MMNL DM(2)A DM(2)B
mean 0.3919 0.4475 0.4601 0.4605

adj.ρ2
std.dev. 0.0053 0.0057 0.0057 0.0057

mean 1.8 258.6 19.5 5.4
Est.time (s)

std.dev. 0.42 16.96 4.72 0.52

mean 37.94 37.88 38.22 38.22
µVTTS

std.dev. 0.25 0.41 0.31 0.31

mean – 21.89 13.25 13.25
σVTTS

std.dev. – 16.43 0.38 0.38

Table 4: Summary of results across subsamples for �rst simulated dataset

heterogeneity retrieved for βC in these models, leading to a wider range for

the VTTS.

4.2 Case study 2: discrete mixture with three support
points

In the second case study, the true model is again a discrete mixture of

a MNL model, where this time, three support points are used for βT , at

−1, −0.7 and −0.4, with probabilities of 0.3, 0.35 and 0.35. This leads

to a true mean VTTS of 41.1 DKK per hour, with a standard deviation

of 14.48 DKK per hour. Four di�erent models were estimated on these

data; along with the usual MNL and MMNL models, we estimated a DM

with two support points, and a DM with three support points8. Again,

the DM models were estimated with two di�erent speci�cations, using a

randomly distributed βC coe�cient in DM(2)A and DM(3)A, and a �xed

βC coe�cient in DM(2)B and DM(3)B. The detailed results for the �rst

sample are presented in Table 5, while the overall results are summarised

in Table 6.

The results show major improvements for the MMNL and various DM

models when compared to the MNL model. All six models perform very

well in terms of retrieving the mean VTTS, while the �ve mixture models

also obtain a good approximation to the true standard deviation of the

8No further gains could be made by using more than three support points.

M
N
L
M
M
N
L

D
M
(2
) A

D
M
(2
) B

D
M
(3
) A

D
M
(3
) B

F
in
a
l
L
L

-4
7
2
1
.6
9

-4
1
5
5
.6
5

-4
1
2
6
.2
3

-4
2
2
7
.4
3

-4
1
2
0
.9
6

-4
1
2
0
.9
9

p
a
r.

2
4
8
5
1
2
7
a
d
j.
ρ
2

0
.3
6
7
6

0
.4
4
3
1

0
.4
4
6
5

0
.4
3
3
4

0
.4
4
6
7

0
.4
4
7
3

es
t.

ti
m
e
(s
)

1
3
4
6

1
6

6
1
5
1

1
3
es
t.

a
sy
.t
-r
a
t

es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t

β
T

-0
.3
9
2
5

-3
3
.7
2

–
–

–
–
–
–
–
–
–
–

β
T
,µ

–
–

-0
.6
8
1
7

-3
4
.7
8

–
–
–
–
–
–
–
–

β
T
,σ

–
–

0
.2
4
2
3

2
5
.1
5

–
–
–
–
–
–
–
–

β
T
,1

–
–
–
–

-0
.8
5
6
1

-3
6
.5
6

-0
.4
0
0
5

-3
6
.6
2

-0
.3
9
3
0

-2
9
.7
2

-0
.7
0
1
5

-3
4
.1
7

π
T
,1

–
–
–
–

0
.6
2
1
0

2
7
.3
8

0
.5
0
2
8

2
7
.7
2

0
.3
1
8
5

1
4
.8
8

0
.4
0
6
9

1
4
.0
2

β
T
,2

–
–
–
–

-0
.4
2
2
1

-2
8
.9
9

-0
.8
0
8
4

-3
9
.3
3

-0
.7
0
3
1

-3
2
.7
6

-0
.3
9
2
7

-2
9
.8
4

π
T
,2

–
–
–
–

0
.3
7
9
0

1
6
.7
1

0
.4
9
7
2

2
7
.4
1

0
.4
0
9
3

1
3
.5
0

0
.3
1
8
7

1
4
.8
9

β
T
,3

–
–
–
–
–
–
–
–

-1
.0
2
6
2

-3
2
.1
3

-1
.0
2
3
4

-3
4
.2
6

π
T
,3

–
–
–
–
–
–
–
–

0
.2
7
2
3

1
0
.9
4

0
.2
7
4
4

1
1
.6
4

β
C

-0
.5
7
3
2

-3
3
.3
9

–
–
–
–

-0
.8
7
8
3

-4
0
.9
8

–
–

-1
.0
0
8
4

-3
9
.3
1

β
C

,µ
–

–
-0
.9
9
6
5

-3
7
.4
8

–
–
–
–
–
–
–
–
β
C

,σ
–

–
0
.0
5
9
1

4
.5
1

–
–
–
–
–
–
–
–
β
C

,1
–

–
–

–
-1
.2
0
2
3

-3

5
.7
9

–
–

-1
.0
0
1
5

-2
3
.6
6

–
–

π
C

,1
–
–
–

–
0
.5
3
5
7

3
.2
8

–
–

0
.8
1
1
4

0
.6
9

–
–
β
C

,2
–

–
–

–
-0
.8
4
6
9

-3
3
.5
7

–
–

-1
.0
4
5
4

-6
.6
7

–
–
π
C
,2
–
–
–

–
0
.4
6
4
3

1
1
.5
1

–
–

0
.1
8
8
6

0
.1
6
–
–
β
C

,3
–

–
–
–
–
–
–

–
-1
.2
5
8
3

0
.0
0

–
–
π
C
,3
–
–
–
–
–
–
–

–

0
.0
0
0
0

0
.0
0
–
–

µ
V

T
T
S

4
1
.0
8

4
1
.1
8

4
1
.2
2

4
1
.2
5

4
1
.1
8

4
1
.0
9

σ
V

T
T
S

–
1
4
.8
6

1
4
.6
8

1
3
.9
4

1
4
.4
1

1
4
.4
4

T
a
b
le
5
:
D
et
a
il
ed

es
ti
m
a
ti
o
n
re
su
lt
s
fo
r
�
rs
t
su
b
sa
m
p
le
fo
r
se
co
n
d
si
m
u
la
te
d
d
a
ta
se
t

MNL MMNL DM(2)A DM(2)B DM(3)A DM(3)B
mean 0.3690 0.4429 0.4457 0.4351 0.4460 0.4467

adj.ρ2
std.dev. 0.0051 0.0035 0.0039 0.0038 0.0037 0.0037

mean 1.9 338.6 17.5 6.3 133.5 13.6
Est.time (s)

std.dev. 1.1 8.4 2.32 1.7 56.4 2.4

mean 40.93 40.74 40.87 40.75 40.79 40.78
µVTTS

std.dev. 0.16 0.25 0.24 0.29 0.22 0.23

mean – 14.62 14.43 13.77 14.35 14.30
σVTTS

std.dev. – 0.25 0.25 0.27 0.21 0.24

Table 6: Summary of results across subsamples for sec

ond simulated dataset

VTTS. We now look in more detail at the di�erences between the various

mixture models. As was the case in the case study discussed in Section

4.1, the MMNL model again falsely recovers some random variation for βC,

where the level of variation is however much lower than was the case in the

�rst case study. When only allowing for two support points, the DM models

also retrieve signi�cant variation for βC, as re
ected in the drop in model �t

observed from DM(2)A to DM(2)B when constraining βC to a �xed value.

This is no longer the case when using three support points. Finally, as was

the case in Section 4.1, the DM models again have a signi�cant advantage

over the MMNL model in terms of estimation cost.

Figure 2 shows the cumulative distribution functions for βT in the

MMNL model, as well as in DM(2)A and DM(2)B. The advantages of

the DM models are again very obvious, especially in the case of the model

with three support points.

The results from Table 6 show very stable performance across the ten

samples, for all four indicators. The fact that, unlike in the �rst case study

(cf., Table 4), the estimate for σVTTS in the MMNL model is now very stable

can be explained by the lower coe�cient of variation for βC in the MMNL

estimates in the second case study.

Figure 2: Cumulative distribution function for βT �rst subsample for sec-

ond simulated dataset

4.3 Case study 3: Normal mixture

For the third case study, a MMNL model with a normally distributed travel

time coe�cient was chosen as the true model. Speci�cally, βC is still �xed

to a value of −1, while βT now follows a Normal distribution with mean

of −0.8 and a standard deviation of 0.3, leading to a mean VTTS of 48

DKK/hour, with a standard deviation of 18 DKK.

The results for the �rst subsample of the third simulated dataset are

summarised in Table 7. A slightly di�erent strategy was employed in the

M
N
L
M
M
N
L

A
M
M
N
L

B
D
M
(5
) A

D
M
(5
) B

D
M
(6
) A

D
M
(6
) B

F
in
a
l
L
L

-4
7
4
2
.0
6

-3
9
1
2
.5
7

-3
9
1
3
.9

-3
9
1
0
.2

-3
9
1
3
.5
4

-3
9
0
8
.4
3

-3
9
0
8
.6
1

p
a
r.
2
4
3
1
4
1
1
1
6
1
3
a
d
j.
ρ
2

0
.3
6
4
9

0
.4
7
5
6

0
.4
7
4
6

0
.4
7
4
6
0
.4
7
4
6

0
.4
7
5
0

es
t.
ti
m
e
(s
)

1
3
4
1

2
3
3

1
4
3
4
1
1
4
1
5
9
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
β
T

-0
.4
0
0
8

-3
2
.1
1

–
–
–
–
–
–
–
–
–
–
–
–
β
T
,µ
–
–

-0
.8
3
5
9

-3
6
.6
7

-0
.8
3
2
9

-3
6
.6
9

–
–
–
–
–
–
–
–
β
T
,σ
–
–

0
.3
1
3
4

2
5
.2
7

0
.3
1
1
3

2
5
.0
3

–
–
–
–
–
–
–
–
β
T
,1
–
–
–
–
–
–

-0
.1
3
4
3

-1
.9
9

-0
.1
8
6
7

-4
.8
2

-0
.0
8
5
9

-1
.5
5

-0
.1
2
9
3

-1
.8
8
π
T
,1
–
–
–
–
–
–

0
.0
3
7
2

2
.3
4

0
.0
5
6
6

3
.8
2

0
.0
2
4
5

2
.5
5

0
.0
3
6
2

2
.3
2

β
T
,2
–
–
–
–
–
–

-0
.4
5
8
5

-1
3
.8
7

-0
.5
0
7
1

-1
7
.3
7

-0
.3
6
2
1

-8
.7
9

-0
.4
4
4
9

-1
4
.2
6

π
T
,2
–
–
–
–
–
–

0
.1
8
3
8

6
.5
5

0
.2
3
2
6

9
.1
4

0
.0
7
4
2

1
.9
0

0
.1
7
1
0

6
.6
6

β
T
,3
–
–
–
–
–
–

-0
.7
0
2
1

-2
2
.4
9

-0
.7
9
0
4

-2
8
.1
3

-0
.7
1
0
2

-2
0
.6
5

-0
.6
8
0
0

-2
4
.5
6

π
T
,3
–
–
–
–
–
–

0
.2
2
3
1

3
.9
6

0
.3
5
2
9

9
.6
3

0
.2
0
2
6

3
.2
8

0
.2
2
4
7

4
.7
1

β
T
,4

–
–
–
–
–
–

-1
.1
8
7
2

-2
9
.6
6

-1
.1
2
0
8

-2
6
.7
5

-0
.9
0
0
6

-1
9
.6
1

-0
.8
8
3
0

-2
0
.6
3

π
T
,4

–
–
–
–
–
–

0
.2
9

0
5

7
.7
0

0
.3
2
9
6

9
.6
7

0
.2
6
5
3

5
.1
1

0
.2
5
9
7

5
.7
9

β
T
,5

–
–
–
–
–
–

-0
.8
9
6
4

-1
9
.7
9

-1
.6
2
5
3

-2
0
.7
8

-0
.5
1
7
7

-1
0
.1
9

-1
.1
5
0
2

-2
9
.9
3

π
T
,5

–
–
–
–
–
–

0
.2
6
5
4

5
.2
5

0
.0
2
8
3

2
.8
5

0
.1
4
4
1

3
.0
1

0
.2
8
1
8

7
.4
7

β
T
,6

–
–
–
–
–
–
–
–
–
–

-1
.1
9
0
2

-2
9
.5
8

-1
.6
4
2
3

-2
1
.2
4

π
T
,6

–
–
–
–
–
–
–
–
–
–

0
.2
8
9
3

7
.6
1

0
.0
2
6
7

3
.1
0

β
C

-0
.4
9
9
9

-3
0
.1
3

–
–

-1
.0
2
6
7

-3
8
.5
3

–
–

-1
.0
1
3
5

-3
7
.6
7

–
–

-1
.0
2
1
3

-3
7
.6
6

β
C
,µ
–

–
-1
.0
2
5
4

-3
8
.5
8

–
–
–
–
–
–
–
–
–
–
β
C
,σ
–

–
0
.0
0
8
0

0
.4
7
–
–
–
–
–
–
–
–
–
–
β
C
,1
–
–
–
–
–

–
-0
.7
4
6
7

-1
8
.5
7

–
–

-0
.7
4
8
5

-1
8
.5
0

–
–
π
C
,1
–
–
–
–
–

–
0
.0
8
6
2

2
.6
7

–
–

0
.0
8
6
1

2
.6
8

–
–
β
C
,2
–
–
–
–
–

–
-1
.0
5
4
5

-3
4
.4
6

–
–

-1
.0
5
6
9

-3
4
.4
1

–
–
π
C
,2
–
–
–
–
–

–
0
.9
1
3
8

2
8
.3
1

–
–

0
.9
1
3
9

2
8
.4
2

–
–
µ
V
T
T
S

4
8
.1
0

4
8
.9
3

4
8
.6
8

4
8
.9
6

4
8
.7
2

4
8
.7
7

4
8
.8
1

σ
V
T
T
S

–
1
8
.3
4

1
8
.2
0

1
8
.0
8

1
8
.1
5

1
8
.1
5
1
8
.1
5

T
a
b
le
7
:
D
et
a
il
ed

es
ti
m
a
ti
o
n
re
su
lt
s
fo
r
�
rs
t
su
b
sa
m
p
le
fo
r
th
ir
d
si
m
u
la
te
d
d
a
ta
se
t

model estimation in this case study. From the experience of the �rst two

case studies, it had to be assumed that some of the distribution of βT would

erroneously be picked up as heterogeneity in βC. This would apply espe-

cially in the discrete mixture models with a low number of support points.

As such, alongside the MNL model, two di�erent MMNL models were es-

timated, one with βC kept �xed, and one with a randomly distributed βC.

In the discrete mixture models, 2 support points were used for βC, while

the number of support points for βT was gradually increased up to the

point where no heterogeneity was retrieved for βC, i.e. the random taste

heterogeneity in the data is captured correctly by βT on its own. It was

found that this point was reached between �ve and six support points for

βT . No further gains in model performance could be obtained by increas-

ing the number of support points for βT any further, independently of the

treatment of βC.

Again, all the di�erent models o�er good performance in retrieving the

true mean value of the VTTS, while the various mixture models addition-

ally o�er a good approximation to the true standard deviation. The six

mixture models o�er signi�cant improvements in model performance when

compared to the MNL model. As in the other examples, the DM mod-

els again have computational advantages over the MNL model. Given the

results from the other case studies, it is of interest to look at the issue

of confounding between the heterogeneity for βT and βC. In the MMNL

model and the DM model with six support points, the reductions in model

�t resulting from using a �xed βC coe�cient are not signi�cant. With only

�ve support points, the drop in model �t is slightly more visible (DM(5)A
vs DM(5)B), yet still not signi�cant when taking into account the cost of

estimating three additional parameters. However, in earlier models, using

fewer than �ve support points for βT , this was not the case, and there was

signi�cant confounding9.

Finally, it is of interest to look at the speci�c patterns of heterogene-

ity retrieved by the discrete mixture models, where we focus on MMNLB,

DM(5)A and DM(6)B. Here, it can be seen from Figure 3 that the two DM

models o�er a very good approximation to the Normal

distribution.

9Detailed results available on request.

Figure 3: Cumulative distribution function for βT �rst subsample for third

simulated dataset

The average results across the ten subsamples are summarised in Table

8. The results show that, on average (and unlike in the �rst subsample),

not allowing for heterogeneity in βC leads to a minor drop in the adjusted

ρ2 measure for the MMNL model and the DM(5) model. This is however

again not the case for DM(6), showing that six support points are su�cient

to retrieve the true heterogeneity in the data. In terms of estimation time,

the DM mixtures retain their advantage, even with a higher number of

support points. Finally, the results for the mean and standard deviation of

the VTTS are very stable across subsamples.

MNL MMNLA MMNLB DM(5)A DM(5)B DM(6)A DM(6)B
mean 0.3701 0.4724 0.4722 0.4715 0.4710 0.4714 0.4718

adj.ρ2
std.dev. 0.0057 0.0057 0.0057 0.0057 0.0055 0.0057 0.0056

mean 1 338.1 242.9 124.8 45.3 172 63.7
Est.time (s)

std.dev. 0.0 7.9 25.5 31.27 8.3 30.4 14.4

mean 47.71 48.71 48.58 48.70 48.61 48.64 48.60
VTTS µ

std.dev. 0.35 0.40 0.41 0.48 0.49 0.52 0.49

mean – 17.63 17.63 17.68 17.46 17.60 17.60
VTTS σ

std.dev. – 0.48 0.48 0.45 0.49 0.43 0.45

Table 8: Summary of results across subsamples for third simulated dataset

4.4 Case study 4: Mixture of two Normals

For the fourth case study, a more complex mixture was used. As such,

the true distribution is now a mixture of two Normal distributions, where

βT = π1 βT1 + π2 βT2, with π1 = π2 = 0.5, and with βT1 ∼ N(−0.8, 0.2)

and βT2 ∼ N(−0.3, 0.1). The cost coe�cient βC was again kept �xed at

−1. With this, we obtain a true mean VTTS of 33 DKK/hour, with a

standard deviation of 17.76 DKK. In model estimation, the strategy from

the third case study was again adopted, gradually increasing the number

of support points for βT in the DM models, while maintaining the number

of support points for βC �xed at 2. Again, the issue of confounding largely

disappeared when using �ve or more support points.

The results for the �rst subsample are presented in Table 9, with Table

10 presenting a summary of the results across all ten subsamples. Along

with the MNL model, two MMNL models were estimated, where MMNLA
and MMNLB again di�er by using a randomly distributed and �xed βC
coe�cient respectively. Although the standard deviation for βC is signi�-

cantly di�erent from zero in model MMNLA, it is very small compared to

the mean value, such that it is no surprise that the e�ect of using a �xed

coe�cient is very small, with very similar model performance for MMNLB.

In the DM models, we experience a very small, and insigni�cant drop in

model �t when constraining βC to a single value. Here, two further obser-

vations can be made. In model DM(5)A, the di�erence between βC,1 and

βC,2 is not signi�cant beyond the 48% level of con�dence, while, in model

M
N
L
M
M
N
L
A
M
M
N
L
B
D
M
(5
) A
D
M
(5
) B
D
M
(6
) A
D
M
(6
) B
F
in
a
l
L
L

-5
2
9
6
.8
4

-4
4
0
5
.5
4

-4
4
0
6
.1
1

-4
3
5
9
.0
7

-4
3
6
3
.2
3

-4
3
5
9
.0
3

-4
3
6
3
.2
3
p
a
r.
2
4
3
1
4
1
1
1
6
1
3
a
d
j.
ρ
2

0
.2
9
0
6

0
.4
0
9
6

0
.4
0
9
7

0
.4
1
4
5

0
.4
1
4
4

0
.4
1
4
3

0
.4
1
4
1

es
t.
ti
m
e
(s
)

7
3
4
1

2
1
3

1
9
7

3
3

1
7
4

7
5
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
es
t.
a
sy
.t
-r
a
t
β
T

-0
.2
1
5
3

-2
7
.8
0

–
–
–
–
–
–
–
–
–
–
–
–
β
T
,µ
–
–

-0
.5
1
1
5

-3
0
.2
3

-0
.5
1
5
5

-3
0
.6
2

–
–
–
–
–
–
–
–
β
T
,σ
–
–

0
.2
9
3
0

2
9
.2
3

0
.2
9
5
4

2
9
.8
2

–
–
–
–
–
–
–
–
β
T
,1
–
–
–
–
–
–

-0
.0
8
4
4

-1
.3
3

-0
.0
7
8
7

-1
.2
7

-0
.0
8
5
5

-1
.3
6

-0
.0
7
8
7
-1
.2
7
π
T
,1
–
–
–
–
–
–

0
.0
3
7
7

1
.4
2

0
.0
3
6
3

1
.4
7

0
.0
3
8
5

1
.4
1

0
.0
3
6
3
1
.4
7
β
T
,2
–
–
–
–
–
–

-0
.2
8
3
1

-7
.9
0

-0
.2
7
1
3

-1
6
.1
9

-0
.2
8
3
5

-7
.0
1

-0
.2
7
1
3
-1
6
.1
9
π
T
,2
–
–
–
–
–
–

0
.3
4
6
1

0
.9
8

0
.3
7
6
1

6
.3
2

0
.3
3
9
5

0
.7
7
0
.3
7
6
1
6
.3
2
β
T
,3
–
–
–
–
–
–

-0
.9
8
3
3

-2
0
.7
6

-0
.3
7
8
8

-1
1
.9
4

-0
.3
2
1
6

-3
.9
5

-0
.3
7
8
8
-1
1
.9
4
π
T
,3
–
–
–
–
–
–

0
.1
9
0
0

4
.7
1

0
.0
9
2
1

1
.4
6

0
.1
1
0
2

0
.2
5
0
.0
9
2
1
1
.4
6
β
T
,4
–
–
–
–
–
–

-0
.3
2
5
3

-4
.5
8

-0
.6
5
4
3

-2
8
.1
5

-0
.7
2
4
7

-9
.8
2

-0
.4
6
7
6

0
.0
0
π
T
,4
–
–
–
–
–
–

0
.1
0
4
7

0
.2
9

0
.2
8
4
8

1
0
.1
6

0
.0
7
4
9

0
.3
3
0
.0
0
0
0
0
.0
0
β
T
,5
–
–
–
–
–
–

-0
.6
6
9
0

-2
0
.6
1

-0
.9
5
4
6

-3
0
.0
0

-0
.6
5
5
5

-1
1
.7
2

-0
.6
5
4
3
-2
8
.1
5
π
T
,5
–
–
–
–
–
–

0
.3
2
1
5

8
.3
6

0
.2
1
0
7

8
.9
4

0
.2
5
0
8

1
.1
3

0
.2
8
4
8
1
0
.1
6
β
T
,6
–
–
–
–
–
–
–
–
–
–

-0
.9
8
4
2

-2
1
.7
9

-0
.9
5
4
6
-3
0
.0
0
π
T
,6
–
–
–
–
–
–
–
–
–
–

0
.1
8
6
0

4
.7
7

0
.2
1
0
7
8
.9
4
β
C

-0
.4
1
9
4

-3
1
.8
9

–
–

-0
.9
7
2
1

-3
7
.8
9

–
–

-0
.9
7
3
7

-3
7
.4
0

–
–
-0
.9
7
3
7
-3
7
.4
0
β
C
,µ
–

–
-0
.9
7
1
8

-3
8
.1
7

–
–
–
–
–
–
–
–
–
–
β
C
,σ
–

–
0
.0
3
5
2

1
.9
7

–
–
–
–
–
–
–
–
–
–
β
C
,1
–
–
–
–
–

–
-1
.0
7
8
1

-2
2
.6
6

–
–

-0
.8
7
2
5

-1
5
.6
8

–
–
π
C
,1
–
–
–
–
–

–
0
.5
9
6
5

4
.0
3
–
–

0
.3
5
7
2

1
.7
0
–
–
β
C
,2
–
–
–
–
–

–
-0
.8
8
0
1

-2
0
.8
3

–
–

-1
.0
6
6
2

-1
8
.5
8

–
–
π
C
,2
–
–
–
–
–

–
0
.4
0
3
5

2
.7
3

–
–

0
.6
4
2
8

3
.0
6

–
–
µ
V
T
T
S

3
0
.8
0

3
1
.6
0

3
1
.8
2

3
2
.6
0

3
2
.4
9

3
2
.6
4
3
2
.4
9
σ
V
T
T
S

–
1
8
.1
5

1
8
.2
3

1
7
.3
9

1
7
.1
0

1
7
.3
8

1
7
.1
0

T
a
b
le
9
:
D
et
a
il
ed

es
ti
m
a
ti
o
n
re
su
lt
s
fo
r
�
rs
t
su
b
sa
m
p
le
fo
r
fo
u
rt
h
si
m
u
la
te
d
d
a
ta
se
t

DM(6)A, it is not signi�cant beyond the 50% level of di�erence. It can

also be seen that, on average, when moving from DM(5)A to DM(5)B and

from DM(6)A to DM(6)B, the standard errors associated with the various

πT,k parameters decrease. Finally, model DM(6)B can be seen to reduce

to model DM(5)B; the additional support point, as well as its associated

probability, are not signi�cantly di�erent from zero. All seven models again

o�er good performance in the retrieval of the true mean VTTS, where the

six mixture models also perform well for the standard deviation. The DM

models maintain their advantages in terms of estimation cost, where these

are naturally smaller than before given the higher number of parameters.

In terms of model performance, the MMNL models clearly outperform the

MNL model, while the various DM models have a small advantage over the

MMNL models. The results from Table 10 again show stable performance

over the ten subsamples.

When looking at the retrieval of the true shape for the distribution of βT ,

it can be seen that the MMNL models using a single Normal distribution

produce a mean that is the weighted average of the mean of the two Normal

distributions. The DM models on the other hand do recover the multi-

modality of the true distribution10. These �ndings are re
ected in the shape

of the distributions for βT in Figure 4, where the DM models (DM(5)A
and DM(6)B) are better able to account for the multi-modality of the true

distribution.

In closing, it should be noted that, in this example, the uni-modal

MMNL model still manages to retrieve the true mean and standard de-

viation of the multi-modal true distribution of the VTTS. This can be

explained by the fact that the probabilities for the two Normal distribu-

tions were set evenly to 0.5, where the di�erence in the standard deviation

for βT1 and βT2 was also rather small. Di�erent patterns could be expected

in a more asymmetrical scenario.

10It should be noted that, in the retained DM model, DM(5)B, two of the probabilities

for support points, πT,1 and πT,3, are only signi�cant at the 85% level of con�dence.

Figure 4: Cumulative distribution function for βT �rst subsample for fourth

simulated dataset

5 Summary and Conclusions

With the availability of powerful computers and estimation tools, researchers

and practitioners are increasingly making use of continuous mixture struc-

tures, such as Mixed Logit, in the representation of random taste hetero-

geneity across respondents. Despite the gains in estimation power, the cost

of using such mixture models remains high, especially in large scale stud-

ies. Furthermore, several issues arise due to the models’ reliance on speci�c

distribution functions, whose shape is not necessarily consistent with that

MNL MMNLA MMNLB DM(5)A DM(5)B DM(6)A DM(6)B
mean 0.2896 0.4082 0.4082 0.4118 0.4115 0.4116 0.4118

adj.ρ2
std.dev. 0.0036 0.0041 0.0041 0.0045 0.0040 0.0046 0.0045

mean 8.8 363.7 235.4 142.1 40.8 168 59.7
Est.time (s)

std.dev. 2.4 14.8 10.9 37.28 10.0 18.0 11.8

mean 31.32 31.90 32.27 32.87 32.74 32.83 32.76
VTTS µ

std.dev. 0.38 0.48 0.44 0.51 0.46 0.46 0.47

mean – 18.48 18.67 17.71 17.58 17.73 17.66
VTTS σ

std.dev. – 0.33 0.31 0.31 0.35 0.29 0.34

Table 10: Summary of results across subsamples for fourth simulated

dataset

of the true, unobserved distribution.

In this paper, we have discussed an alternative approach for the repre-

sentation of random taste heterogeneity, making use of discrete mixtures

instead of continuous mixtures. Although several issues can also arise in the

estimation of such models, they have the advantage of a closed form solu-

tion, and can hence be estimated and applied without relying on simulation

processes. Furthermore, the models are free from a priori assumptions as

to the shape of the true distribution.

The paper presents several case studies o�ering an in-depth comparison

of the two modelling approaches, making use of real data as well as four

separate simulated datasets. The results of these analyses clearly show the

major advantage of the discrete mixture approach in terms of estimation

cost. They also show that, across scenarios, the discrete mixture models are

able to attain similar or indeed better performance than their continuous

counterparts. Finally, they are better able to deal with complicated true

distributions, such as the presence of multiple modes.

Although further comparisons between the two modelling approaches

are required, the results from this paper do suggest that discrete mixture

models present a viable alternative, partly thanks to their lower cost in

estimation and application, but also due to the absence of a priori shape

assumptions, which is of great interest in the context of recent discussions

of the issue of the speci�cation of continuous heterogeneity by Hess et al.

(2005).
28

Acknowledgements

Part of the work described in this paper was carried out during a guest

stay by the �rst author in the Institute of Transport and Logistics Studies

at the University of Sydney.

References

Bierlaire, M. (2003). BIOGEME: a free package for the estimation of

discrete choice models, Proceedings of the 3rd Swiss Transport Re-

search Conference, Monte Verit�a, Ascona.

Burge, P. and Rohr, C. (2004). DATIV: SP Design: Proposed approach

for pilot survey, Tetra-Plan in cooperation with RAND Europe and

Gallup A/S.

Chintagunta, P., Jain, D. and Vilcassim, N. (1991). Investigating hetero-

geneity in brand preference in logit models for panel data, Journal of

Marketing Research 28: 417{428.

Cirillo, C. and Axhausen, K. W. (2006). Evidence on the distribution of

values of travel time savings from a six-week diary, Transportation

Research Part A: Policy and Practice 40(5): 444{457.

Dong, X. and Koppelman, F. S. (2003). Mass Point Mixed Logit Model:

Development and Application, paper presented at the 10th Interna-

tional Conference on Travel Behaviour Research,

Lucerne.

Fosgerau, M. (2004). Nonparametric and semiparametric estimation of

the distribution of the value of travel time, paper presented at the

European Transport Conference, Strasbourg.

Fosgerau, M. and Bierlaire, M. (2006). Discrete choice models with multi-

plicative error terms, Technical Report TRANSP-OR 060831, Trans-

port and Mobility Laboratory, School of Architecture, Civil and En-

vironmental Engineering, Ecole Polytechnique F�ed�erale de Lausanne.

Gopinath, D. (1995). Modeling Heterogeneity in Discrete Choice Pro-

cesses: Application to Travel Demand, PhD thesis, MIT, Cam-

bridge, MA.

Greene, W. H. and Hensher, D. A. (2003). A latent class model for discrete

choice analysis: contrasts with mixed logit, Transportation Research

Part B: Methodological 37(8): 681{698.

Hess, S., Bierlaire, M. and Polak, J. W. (2005). Estimation of value of

travel-time savings using mixed logit models, Transportation Re-

search Part A: Policy and Practice 39(2-3): 221{236.

Hess, S., Train, K. and Polak, J. W. (2006). On the use of a Modi�ed

Latin Hypercube Sampling (MLHS) method in the estimation of a

Mixed Logit model for vehicle choice, Transportation Research Part

B: Methodological 40(2): 147{163.

Kamakura, K. W. and Russell, G. (1989). A probabilistic choice model for

market segmentation and elasticity structure, Journal of Marketing

Research 26: 379{390.

Lee, B. J., Fujiwara, A., Zhang, J. and Sugie, Y. (2003). Analysis of Mode

Choice Behaviours based on Latent Class Models, paper presented

at the 10th International Conference on Travel Behaviour Research,

Lucerne.

Wedel, M., Kamakura, W., Arora, N., Bemmaor, A., Chiang, J., Elrod, T.,

Johnson, R., Lenk, P., Neslin, S. and Poulsen, C. S. (1999). Discrete

and continuous representations of unobserved heterogeneity in choice

modeling, Marketing Letters 10(3): 219{232.

Materiali di discussione

Viale Jacopo Berengario 51 – 41100 MODENA (Italy) tel. 39-059.2056711Centralino) 39-059.2056942/3 fax. 39-059.2056947

Dipartimento di Economia Politica

\\ 663 \\

Estimating nonparametric mixed
Logit Models via EM algorithm

Daniele Pacifico

September 2011

Italian Department of the Treasury
e-mail: daniele.paci_co@tesoro.it

ISSN: 2039-1439 a stampa
ISSN: 2039-1447 on line

The Stata Journal (yyyy) vv, Number ii, pp. 1–17

Estimating nonparametric mixed logit models
via EM algorithm

Daniele Pacifico
Italian Department of the Treasury

daniele.pacifico@tesoro.it

Abstract. The aim of this paper is to describe a Stata routine for the nonparamet-
ric estimation of mixed logit models with an Expectation-Maximisation algorithm
proposed in Train (2008). We also show how to use the Stata command lclogit,
which performs the estimation automatically.

Keywords: st0001, lclogit, latent classes, EM algorithm, mixed logit

1 Introduction

In a recent contribution Train (2008) has showed how EM algorithms can be used for
the nonparametric estimation of mixing distributions in discrete choice models. In this
paper we consider one of the three nonparametric methods he proposes and show how
it can be implemented in Stata. The method presented here allows estimating discrete
mixing distributions with mass points and share probabilities as parameters. Therefore,
the nonparametric estimation is based on a typical latent class model and is reached
by increasing the number of mass points of each coefficient so as to approximate their
mixing distributions.

Traditionally, latent class models have been estimated using gradient-based algo-
rithms, such as Newton-Raphson or BHHH. However, the estimation based on standard
optimization procedures becomes difficult when the number of mass points increases, as
the higher the number of latent classes the more difficult the empirical inversion of the
Hessian matrix, with the possibility of singularity at some iterations. In such situation
an EM algorithm could help as it implies the repeated evaluation of a function that is
far easier to maximize. Moreover, EM recursions have proved to be particularly stable
and – under conditions given by Boyles (1983) and Wu (1983) – they always climb
uphill until convergence to a local maximum.

The paper is structured as follows: section 2 presents a mixed logit model based on
discrete mixed distributions; section 3 shows how this model can be estimated via EM
algorithm; section 4 presents a detailed step-by-step description of EM estimation in
Stata; section 5 introduces the Stata command lclogit; section 6 contains an empirical
application based on accessible data; section 7 concludes.

c© yyyy StataCorp LP st0001

2 An EM algorithm for nonparametric mixed logit models

2 A mixed logit model with discrete mixing distributions

Assume there are N agents who face J alternatives on T choice occasions. Agents
choose the alternative that maximizes their utility in each choice occasion. The random
utility of agent i from choosing alternative j in period t is defined as follows:

Uijt=βixijt + �ijt (1)

Where xijt is a vector of alternative-specific attributes and �ijt is a stochastic term,
which is assumed to be distributed IID extreme value. Importantly, each βi is assumed
to be random with unconditional density f(β |ϑ), where ϑ collects the parameters that
characterize the distribution.

Conditional on knowing βi the probability of the observed sequence of choices for
agent i is given by the traditional McFadden’s choice model:1

Pri(β) =

T∏
t=1

J∏
j=1

(
exp(βixijt)∑J

k=1 exp(βixikt))

)dijt
(2)

Where dijt is a dummy that selects the chosen alternative in each choice occasion.
However, since βi is unknown the conditional probability of the sequence of observed
choices has to be evaluated for any possible value of βi. Hence, assuming that f(β |ϑ)
has a continuous distribution, the unconditional probability becomes:

Pri(ϑ) =

∫ T∏
t=1

J∏
j=1

(
exp(βxijt)∑J

k=1 exp(βxikt))

)dijt
f(β|ϑ) (3)

Typically, the log likelihood function derived from this model is estimated with simula-
tion methods.2

If the distribution of each βi is discrete, the probability in equation 3 becomes:

Pri(ϑ) =

C∑
c=1

πc

T∏
t=1
J∏
j=1

(
exp(βcxijt)∑J

k=1 exp(βcxikt))

)dijt
(4)

Where πc = f(βc|ϑ) represents the share of the population with coefficients βc.

Equation 4 is a typical latent class model. Nevertheless, here we follow the classifica-
tion proposed by McFadden and Train (2000) and define it as a discrete mixed model,
in order to emphasize the similarities with the continuous mixed model of equation 3.

The estimation of discrete mixed models is usually based on standard gradient-based
methods. However, these methods often fail to achieve convergence when the number of
latent classes becomes high. In this case an EM algorithm could help, as it requires the
repeated maximization of a function that is far simpler with respect to the log likelihood
derived from equation 4.

1. See McFadden (1973).
2. See Train (2003). In Stata, continuous mixed logit models can be estimated with Simulated

Maximum Likelihood through the program mixlogit, written by Hole (2007).

D. Pacifico 3

3 An EM algorithm for the estimation of mixed logit
models with discrete mixing distributions

EM algorithms were initially proposed in the literature to deal with missing data prob-
lems. Nevertheless, they turned out to be an efficient method to estimate latent class
models, where the missing information consists of the class share probabilities. Nowa-
days, they are widely used in many economic fields where the assumption that people
can be grouped in classes with different unobserved preference heterogeneity is reason-
able.

The recursion is known as “E-M” because it consists of two steps, namely an “Expec-
tation” and a “Maximization”. As explained in Train (2008), the term to be maximized
is the expectation of the missing-data log likelihood – i.e the joint density of the ob-
served choice and the missing data – whilst the expectation is over the distribution of the
missing information, conditional on the density of the data and the previous parameter
estimates.

Consider the conditional logit model with discrete mixing distributions outlined in
the previous section. Following equation 4, the log likelihood is defined as:

LL =

N∑
i=1

C∑
c=1
πc
T∏
t=1
J∏
j=1
(
exp(βcxijt)∑J
k=1 exp(βcxikt))

)dijt
(5)

Which can be maximized by means of standard, gradient-based optimization methods.
However, the same log likelihood can be also maximized by repeatedly updating the
following recursion:

ηs+1 = argmaxη

∑
i

∑
c Ci(η

s)lnπc
∏
t

∏
j

(
exp(βcxijt)∑J
k=1 exp(βcxikt))

)dijt
= argmaxη

∑
i
∑
c Ci(η

s)ln(Li |classi = c)
(6)

Where η is as vector that contains the whole set of parameters to be estimated – i.e.
those that enter the probability of the observed choice plus those that may define the
class shares – Li is the missing-data likelihood function and Ci(η

s) is the conditional
probability that household i belongs to class c, which depends on the density of the
data and the previous value of the parameters.

This conditional probability – Ci(η
s) – is the key future of the EM recursion and can

be computed by means of the Bayes’ theorem:

Ci(η
s) =

Li|classi = c∑C
c=1 Li|classi = c

(7)

Now, given that:

ln(Li |classi = c) = lnπc + ln
T∏
t=1

J∏
j=1
(
exp(βcxijt)∑J
k=1 exp(βcxikt))

)dijt
(8)

4 An EM algorithm for nonparametric mixed logit models

The recursion in equation 6 can be split into the following steps:

1. Form the contribution to the likelihood (Li |classi = c) as defined in equation 6
for each class;3

2. Form the individual-specific conditional probabilities of class membership as in
equation 7;

3. Following equation 8, update the sets of βc, c = 1, 2, …,C by maximizing – for
each class – a conditional logit model with weighed observations, with weights
given by the conditional probabilities of class membership:

βs+1c = argmaxβ

N∑
i=1

C(ηs)ln

T∏
t=1
J∏
j=1
(
exp(βcxijt)∑J
k=1 exp(βcxikt))

)dijt
(9)

4. Maximize the other part of the log likelihood in equation 6 and get the updated
vector of class shares:

πs+1 = argmaxπ

N∑
i=1
C∑
c=1

Ci(η
s)lnπc (10)

• If the class share probabilities depend on a vector of demographics – zi – the
relative parameters are updated as:

αs+1 = argmaxα

N∑
i=1
C∑
c=1

Ci(η
s)ln

exp(αczi)∑
c exp(αczi)

, αC = 0 (11)

This is a grouped-data log likelihood, where we have used a logit specification
so as to constrain the estimated class share probabilities into the right range.4

The updated class share probabilities – πc, c = 1, 2, …,C – are then computed
as:

πs+1c =

exp(α̂s+1c zi)∑
c exp(α̂

s+1
c zi)

, c = 1, 2, …,C (12)

Which, in turn, allows updating the conditional probabilities of class mem-
bership by means of the new vectors βs+1c and π

s+1
c , c = 1, 2, …,C.

• If the class share probabilities do not depend on demographics the empirical
maximization of the function in equation 11 can be avoided, as its analytical
solution would be given by:

πs+1c =

∑
i Ci(η

s+1)∑
i

∑
c Ci(η

s+1)
, c = 1, …,C (13)

3. For the first iteration, starting values have to be used for the densities that enter the model. Note
that these starting values must be different in every class. Otherwise, the recursion estimates the
same set of parameters for all the classes.

4. Differently from the βcs, the vectors αc, c = 1, 2, …,C are jointly estimated. This is needed in
order to ensure that

∑
c πc = 1.

D. Pacifico 5

Where the updated conditional probabilities – Ci(η
s+1) – are computed by

using the updated values of βc, c=1,2,…,C and the previous values of the
class shares.

5. Once the conditional probabilities of class membership have been updated – either
in models with or without covariates on zi – the recursion can start again from
point 3 until convergence.

4 A Stata routine for the estimation of mixed logit mod-
els with discrete mixing distributions

In this section we show how the EM algorithm outlined above can be coded into Stata.
We propose a detailed step-by-step procedure that can be easily implemented in a simple
do file and work with accessible data from Huber and Train (2000) on household’s choice
of electricity supplier. Note that this is the same database used by Hole (2007) for an
application of his Stata program mixlogit, which performs the estimation of parametric
mixed logit models via Simulated Maximum Likelihood.

The data collects information on 100 residential electricity customers, who were
asked up to 12 choice experiments.5 In each experiment the customer was asked which
of the 4 suppliers he/she would prefer among four hypothetical electricity suppliers.

The following characteristics of each offer were stated:

• The price of the contract (in cents per kWh) whenever the supplier offers a contract
with a fixed rate (price)

• The length of contract that the supplier offered, expressed in years (contract)

• Whether the supplier is a local company (local)

• Whether the supplier is a well-known company (wknown)

• Whether the supplier offers a time-of-day rate instead of a fixed rate (tod)

• Whether the supplier offers a seasonal rate instead of a fixed rate (seasonal)

Each costumer is identified by the variable pid. For each costumer, the variable gid
identifies a given choice occasion, while the dummy variable y identifies the stated choice
in each choice occasion.

The data setup required for estimating the model is as follows:

. use http://fmwww.bc.edu/repec/bocode/t/traindata.dta, clear

. list in 1/12, sepby(gid)

5. Since some customers stopped before answering all 12 experiments, there is a total of 1195 choice
occasions in the sample.

6 An EM algorithm for nonparametric mixed logit models

y price contract local wknown tod seasonal gid pid

1. 0 7 5 0 1 0 0 1 1
2. 0 9 1 1 0 0 0 1 1
3. 0 0 0 0 0 0 1 1 1
4. 1 0 5 0 1 1 0 1 1

5. 0 7 0 0 1 0 0 2 1
6. 0 9 5 0 1 0 0 2 1
7. 1 0 1 1 0 1 0 2 1
8. 0 0 5 0 0 0 1 2 1

9. 0 9 5 0 0 0 0 3 1
10. 0 7 1 0 1 0 0 3 1
11. 0 0 0 0 1 1 0 3 1
12. 1 0 0 1 0 0 1 3 1

The next subsection presents the steps for estimating a model with covariates on the
class share porbabilities. Alternatively, the optimization of the algorithm when only a
constant term is included among the class probabilities is presented in subsection 4.2.

4.1 A model with covariates on the class share probabilities

In order to present a flexible routine we work with global variables, so that the code
can be easily adapted to other databases. The dependent variable is called $depvar,
the list of covariates that enter the probability of the observed choice $varlist; the
list of variables that enter the grouped-data log likelihood $varlist2; the variable that
identifies the panel dimension – i.e. the choice makers – $id; the variable that defines
the choice situations for each choice maker $group. We also define the number of latent
classes $nclasses and the number of maximum iterations $niter.6

. **(1) Set the estimation setup**

. global depvar “y”

. global varlist “price contract local wknown tod seasonal”

. gen _con=1

. global varlist2 “_con x2”

. global id “pid”

. global group “gid”

. global nclasses “2”

. global niter “35”

In order to compute the starting values, we randomly split the sample into C different
sub-samples – one for each class – and estimate a separate clogit for each of them.7

6. In the following estimation setup the covariate that we included in $varlist2 – x2 – can be created
manually, as the database does not contain individual-level variables.

7. If the same starting values were used for all the classes the EM algorithm would perform the same
computations for each class and return the same results at each iteration.

D. Pacifico 7

After each clogit estimation we use the command predict to obtain the probability
of every alternative in each class and we store them in the variables l1, l2,…, lC.

As for the starting values for the probability of class membership we simply define
equal shares, that is 1

C
:

. **(2) Split the sample**

. bysort $id: gen double _p=runiform() if _n==_N

. bysort $id: egen double _pr=sum(_p)

. local prop 1/$nclasses

. gen double _ss=1 if _pr<=`prop´

. forvalues s=2/$nclasses {

. replace _ss=`s´ if _pr>(`s´-1)*`prop´ & _pr<=`s´*`prop´

. }

. **(3) Get starting values for the beta coefficients and the class shares**

. forvalues s=1/$nclasses {

. gen double _prob`s´=1/$nclasses

. clogit $depvar $varlist if _ss==`s´, group($group) technique(nr dfp)

. predict double _l`s´

. }

In what follows, the steps to calculate the conditional probabilities of equation 7
from these starting values are presented.

First, for each latent class we multiply the variables l1, l2,…, lC by the dummy
variable that identifies the observed choice in each choice situation, i.e. $depvar. Note
that this allows storing the probabilities of the observed choice for each class.

Second, for each latent class we multiply the probabilities of the observed choices
in each choice situation in order to obtain the probability of the agent’s sequence of
choices:8

. **(4) Compute the probability of the agent´s sequence of choices**

. forvalues s=1/$nclasses {

. gen double _kbb`s´=_l`s´*$depvar

. recode _kbb`s´ 0=.

. bysort $id: egen double _kbbb`s´=prod(_kbb`s´)

. bysort $id: replace _kbbb`s´=. if _n!=_N

. }

Third, we construct the denominator of equation 7, i.e. the unconditional choice
probability for each choice maker. This is done by computing a weighted average of the
probabilities of the agent’s sequence of choices in each class, with weights given by the
class shares, i.e. the variables prob1, prob2,…, probC:

. **(5) Compute the choice probability**

8. This last step is done through the user-written program gprod. Type findit gprod and install
the package dm71.

8 An EM algorithm for nonparametric mixed logit models

. gen double _den=_prob1*_kbbb1

. forvalues s=2/$nclasses {

. replace _den=_den+_prob`s´*_kbbb`s´

. }

Finally, we compute the ratios defined in equation 7 and rearrange them in order
to create individual-level variables. These are the conditional probabilities of class
membership as defined in the previous section:

. **(6) Compute the conditional probabilities of class membership**

. forvalues s=1/$nclasses {

. gen double _h`s´=(_prob`s´*_kbbb`s´)/_den

. bysort $id: egen double _H`s´=sum(_h`s´)

. }

Before starting the loop that iterates the EM recursion until convergence, we need
to specify a Stata ml program that performs the estimation of the grouped-data model
defined in equation 11:9

**(7) Provide Stata with the ML command for the grouped-data model**

. bysort $group: gen _alt=sum(1)

. su _alt

. bysort $id: gen double _id1=1 if _n<=r(mean)

. program logit_lf

. args lnf a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20

. tempvar denom

. gen double `denom´=1

. forvalues c=2/$nclasses {

. replace `denom´=`denom´+exp(`a`c´´)

. }

. replace `lnf´=_H1*ln(1/`denom´) if $depvar==1 & _id1==1

. forvalues c=2/$nclasses {

. replace `lnf´=`lnf´+_H`c´*ln(exp(`a`c´´)/`denom´) if $depvar==1 & _id1==1

. }

. replace `lnf´=0 if `lnf´==.

. **Note: the ML command updates the class shares internally:**

. capture drop _prob*

. gen double _prob1=1/`denom´

. forvalues c=2/$nclasses {

. gen double _prob`c´=exp(`a`c´´)/(`denom´)

. }

. end

There are two important remarks about the above-mentioned routine. First, we set
to zero one vector of parameters for identification. Second, the class shares – prob1,

9. The ml program presented in the following lines allows for up to 20 latent classes. However, the
routine can be easily modified so as to account for more – or less – classes.

D. Pacifico 9

prob2,…, probC – are updated internally when the ml program is called.

We now present the loop that repeats the steps above until convergence:

. local i=1

. while `i´<= $niter{

To begin with, estimate again the C clogit models (one for each class) using the
conditional probabilities of class membership as weights. Then, recompute the proba-
bility of the chosen alternative in each choice occasion so as to update probabilities of
the agent’s sequence of choices:

**(8) Update the probability of the agent´s sequence of choices**
. capture drop _l* _kbbb*

. forvalues s=1/$nclasses {

. clogit $depvar $varlist [iw=_H`s´], group($group) technique(nr dfp)

. predict double _l`s´

. replace _kbb`s´=_l`s´*$depvar

. recode _kbb`s´ 0=.
. bysort $id: egen double _kbbb`s´=prod(_kbb`s´)
. bysort $id: replace _kbbb`s´=. if _n!=_N
. }

Now launch the ml model for the estimation of the grouped-data log likelihood:

**(9) Update the class share probability**

. global classes=”($varlist2)”

. forvalues s=3/$nclasses {

. global classes=”$classes ($varlist2)”

. }

. ml model lf logit_lf $classes, max

Once the class share probabilities have been updated within the ml program, the
next step requires updating both the choice probability – i.e. the variable den – and
the conditional probabilities of class membership, i.e. the variables H*:

**(10) Update the choice probability**

. replace _den=_prob1*_kbbb1

. forvalues s=2/$nclasses {
. replace _den=_den+_prob`s´*_kbbb`s´

. }
**(11) Update the conditional probabilities of class membership**

. drop _H*

. forvalues s=1/$nclasses {

. replace _h`s´=(_prob`s´*_kbbb`s´)/_den

. bysort $id: egen double _H`s´=sum(_h`s´)
. }

Once the parameters and the conditional probabilities have been updated the routine
computes the log likelihood and restarts the loop until convergence.

10 An EM algorithm for nonparametric mixed logit models

As pointed out in Train (2008), convergence of EM algorithms for nonparametric
estimation is still controversial and constitutes an area of future research. As in Train
(2008) and Weeks and Lange (1989) here we define convergence when the change in the
log likelihood function from one iteration to the other is sufficiently small. Hence, the
routine automatically stops the internal loop before the selected number of iterations
provided that the log likelihood has not changed more than 0.001% during the last five
iterations:10

. **(12) Update the log likelihood**

. capture drop _sumll

. egen _sumll=sum(ln(_den))

. sum _sumll

**Check if the log likelihood has increased from the previous 5 iterations:**
. global z=r(mean)

. local _sl`i´=$z

. if `i´>=6 {

. local a=`i´-5

. }
**Stop the loop if the LL has not change of more than 0.001% over the last 5 iterations:**

. if `i´>=6 {

. local `_vpsl`i´´= -(`_sl`i´´ – `_sl`a´´)/`_sl`a´´

. if `_vpsl`i´´<= 0.00001 {

. local i=$niter

. }

. }
**Restart the loop and display the log likelihood**
. local i=`i´ +1

. display as green “Iteration ” `i´ “: log likelihood = ” as yellow $z

. }

Once the algorithm has converged the results can be displayed by typing:

. forvalues s=1/$nclasses {
. clogit $depvar $varlist [iw=_H`s´], group($group) technique(nr dfp)
. }

4.2 A model without covariates on the class shares probabilities

When the model does not include demographics on the class share probabilities the
maximization of the grouped-data log likelihood can be avoided. In fact, its solution
can be provided analytically as it is shown in equation 13. This is important because
the maximization of the grouped-data model slows down the overall estimation process,
which could become time-consuming when the number of latent classes is high.

If there are no covariates on the class share probabilities the EM algorithm can be
optimized by simply dropping out the 9th step from the loop presented in the previous

10. Such a small value is advisable because – as discussed in Dempster et al. (1977) – EM algorithms
may move very slowly when they are close to a maximum.

D. Pacifico 11

subsection. Therefore – following the routine – the choice probability is updated by
means of the previous values of the class shares and this, in turn, allows updating the
conditional probabilities of class membership during the 11th step.

Once the conditional probabilities have been updated they are used to construct the
numerator and the denominator of equation 13, so that also the class share probabilities
can be updated to the next iteration. Hence, the following lines should be added after
the 11th step:

**Compute the numerator of equation 13**

. forvalues s=1/$nclasses {

. capture drop _nums`s´

. egen double _nums`s´=sum(_h`s´)

. }
**Compute the denominator of equation 13**

. capture drop _dens

. gen double _dens=_nums1

. forvalues s=2/$nclasses {

. replace _dens=_dens+_nums`s´

. }

. **Update the class shares**

. forvalues s=1/$nclasses {

. replace prob`s´=nums`s´/dens

. }

Subsequently, the loop continues as before from point 12 till the end.

5 The lclogit command

lclogit is a Stata command that generalizes the routine outlined above. Therefore, this
command does not contain its own ml evaluator and it simply makes use of the Stata
clogit estimation command at each maximization step. This reduces significantly the
estimation time and – importantly – it bases the EM estimation on the quality and the
efficiency of the clogit evaluator.

Following the routine outlined in the previous section, lclogit declares convergence
when the variation in the last 5 values of the maximized log likelihood is smaller than
a given threshold.11 When this happens lclogit stops the internal loop and displays
the estimated coefficients and the class shares.

The results are displayed in a formatted table, with the columns containing the
results by classes.12 When there are more than 5 latent classes the table of results is
divided in blocks of 6 columns. This should allow for a better view of the results and it

11. Users are allowed to change this threshold, the default is 0.001%.
12. lclogit uses the built-in programs estimates store and estimates table in order to display the

table of results. Moreover, the command estimates store allows displaying the last set of internal
clogit estimations by simply typing estimates dir after lclogit.

12 An EM algorithm for nonparametric mixed logit models

also avoid visualization problems when the table becomes too wide.13

The syntax for lclogit is:

lclogit depvar
[
varlist

] [
if
][

in
]
, group(varname) id(varname) nclasses(#)[

options
]

The options allowed by lclogit are:

• group(varname) is required; it specifies the variable for the choice occasions.

• id(varname) is required; it specifies the variable for the choice-makers

• nclasses(#) is required; it sets the number of latent classes to be estimated.

• seed(#) is used to set the starting values, the default is 1234567890;14

• niter(#) specifies the number of maximum iterations, the default is 150;

• convergence(#) specifies the minimum variation with respect to the last 5 values
of the log likelihood in order to declare convergence, the default is 0.001%;

• cpname(newvarname), tells lclogit to save C new variables called newvarname1,
newvarname2,.., newvarnameC containing the individual conditional probabilities
of class membership.

6 Application

For our application we use the data illustrated in the previous section and, therefore,
a model without covariates on the class share probabilities. We begin by estimating a
conditional logit model using the Stata command clogit:

. use http://fmwww.bc.edu/repec/bocode/t/traindata.dta, clear

. clogit y price contract local wknown tod seasonal, group(gid)

Iteration 0: log likelihood = -1379.3159

(output omitted )

Iteration 4: log likelihood = -1356.3867

Conditional (fixed-effects) logistic regression Number of obs = 4780
LR chi2(6) = 600.47
Prob > chi2 = 0.0000

Log likelihood = -1356.3867 Pseudo R2 = 0.1812

y Coef. Std. Err. z P>|z| [95% Conf. Interval]

price -.6354853 .0439523 -14.46 0.000 -.7216302 -.5493403

13. For models with more than 20 latent classes lclogit does not show the results in a formatted table
but in a plain matrix, which can be also found in the ereturn list.

14. Note that the seed is set internally only when computing the starting values. Thereafter, the seed
is set back to the original value. This allows for comparable results and simulation-based inference.

D. Pacifico 13

contract -.13964 .0161887 -8.63 0.000 -.1713693 -.1079107
local 1.430578 .0963826 14.84 0.000 1.241672 1.619485
wknown 1.054535 .086482 12.19 0.000 .8850338 1.224037

tod -5.698954 .3494016 -16.31 0.000 -6.383769 -5.01414
seasonal -5.899944 .35485 -16.63 0.000 -6.595437 -5.204451

From the results above we can see that – on average – costumers prefer lower prices,
shorter contracts length, a local and well-known company and fixed rate plans.

The Stata command mixlogit from Hole (2007) can be used to estimate a para-
metric mixed logit model with independent, normally-distributed coefficients:15

. mixlogit y, id(pid) group(gid) rand(price contract local wknown tod seasonal)
> nrep(300)

Iteration 0: log likelihood = -1249.8219 (not concave)

(output omitted )

Iteration 7: log likelihood = -1101.6085

Mixed logit model Number of obs = 4780
LR chi2(6) = 509.56

Log likelihood = -1101.6085 Prob > chi2 = 0.0000

y Coef. Std. Err. z P>|z| [95% Conf. Interval]

Mean
price -1.004329 .0721185 -13.93 0.000 -1.145679 -.8629798

contract -.2274985 .047386 -4.80 0.000 -.3203735 -.1346236
local 2.208746 .2439681 9.05 0.000 1.730578 2.686915
wknown 1.656329 .1707167 9.70 0.000 1.32173 1.990927

tod -9.364151 .5858618 -15.98 0.000 -10.51242 -8.215883
seasonal -9.496181 .5792009 -16.40 0.000 -10.63139 -8.360968

SD
price .2151655 .0311095 6.92 0.000 .154192 .2761389

contract .384136 .044778 8.58 0.000 .2963728 .4718992
local 1.788806 .2370063 7.55 0.000 1.324282 2.25333
wknown 1.185838 .1731652 6.85 0.000 .8464401 1.525235

tod 1.6553 .2094545 7.90 0.000 1.244777 2.065824
seasonal -1.119371 .2836182 -3.95 0.000 -1.675252 -.5634893

As it can be seen from the maximized log likelihood, we can reject the conditional
logit specification in favor of a random coefficient model. Moreover, the magnitude of
the coefficients is significantly different when compared to the estimates from clogit.16

We now show how to use lclogit to estimate a nonparametric mixed logit model
by means of the EM algorithm outlined in the previous sections. As explained in section
1, the main idea is to use a latent class model with a relatively high number of classes
so as to approximate he mixing distribution nonparametrically.

15. The Simulated Maximum Likelihood estimation is done with 300 Halton draws; mixlogit can be
installed in Stata by typing findit mixlogit.

16. This is an indication of the bias produced by the IIA property of standard conditional logit models.
See Bhat (2000) for this point.

14 An EM algorithm for nonparametric mixed logit models

As stated in Greene and Hensher (2003) and Train (2008), the choice of the
appropriate number of classes is made by means of some information criteria. Here
we opt for the BIC and the CAIC, which penalize more heavily models with a large
number of parameters.17 The next lines show how to use lclogit to estimate a latent
class model with an increasing number of classes:

. forvalues c=2/15 {

. lclogit y price contract local wknown tod seasonal, gr(gid) id(pid) ncl(`c´)

. display e(bic)

. display e(caic)

. display e(ll)

. }

The routine took about 30 minutes to estimate the whole set of 14 models on our
standard-issue PC.18 The next table shows – for an increasing number of latent classes
– the maximized log likelihood, the number of parameters, the BIC and the CAIC:

Classes Log Likelihood N.param. CAIC BIC

Cl.1 -1356.39 6 2746.40 2740.40
Cl.2 -1211.35 13 2495.57 2482.57
Cl.3 -1118.23 20 2348.57 2328.57
Cl.4 -1085.30 27 2321.95 2294.95
Cl.5* -1040.49 34 2271.55* 2237.55
Cl.6 -1028.56 41 2286.93 2245.93
Cl.7 -1006.37 48 2281.79 2233.79
Cl.8* -990.24 55 2288.76 2233.76*
Cl.9 -983.64 62 2314.80 2252.80
Cl.10 -979.23 69 2345.22 2276.22
Cl.11 -965.76 76 2357.52 2281.52
Cl.12 -952.68 83 2370.58 2287.58
Cl.13 -947.24 80 2398.94 2308.94
Cl.14 -945.59 97 2434.89 2337.89
Cl.15 -943.42 104 2469.78 2365.78

From these results we can infer that the 5-class model is optimal according to the
CAIC, whilst the Bayesian criterion points to a model with 8 latent classes.

The output below shows how the program works for a model with 8 latent classes:

. display c(current_time)
12:48:14

. lclogit y price contract local wknown tod seasonal, id(pid) gr(gid) ncl(8)
Iteration 1: log likelihood = -1252.5014
Iteration 2: log likelihood = -1094.1094
Iteration 3: log likelihood = -1043.4077
Iteration 4: log likelihood = -1031.4924
Iteration 5: log likelihood = -1024.0092
Iteration 6: log likelihood = -1016.9219

17. lclogit returns in e() three different information criteria: the AIC, the CAIC and the BIC.
18. We used a PC with a 2.2GHz Intel core 2 duo and 4MB RAM.

D. Pacifico 15

Iteration 7: log likelihood = -1008.2313
Iteration 8: log likelihood = -1002.0564
Iteration 9: log likelihood = -998.89545
Iteration 10: log likelihood = -996.875
Iteration 11: log likelihood = -995.84229
Iteration 12: log likelihood = -995.19885
Iteration 13: log likelihood = -994.69055
Iteration 14: log likelihood = -994.31946
Iteration 15: log likelihood = -994.08838

(output omitted )

Iteration 45: log likelihood = -990.23932
Iteration 46: log likelihood = -990.23895
Iteration 47: log likelihood = -990.23877
Iteration 48: log likelihood = -990.23865
Iteration 49: log likelihood = -990.23859
Iteration 151: log likelihood = -990.23853

Latent class model with 8 latent classes

Variable Class1 Class2 Class3 Class4 Class5

price -0.910 -0.737 -0.488 -2.110 -0.642
contract -0.438 0.218 -0.592 -0.662 0.096

local 0.370 2.416 0.782 0.717 2.186
wknown 0.369 2.840 0.710 0.241 1.207

tod -8.257 -6.690 -4.132 -14.191 -3.836
seasonal -6.440 -7.213 -6.560 -17.207 -4.052

Prob 0.120 0.097 0.091 0.070 0.096

Variable Class6 Class7 Class8

price -1.208 -1.533 -0.082
contract -0.198 -0.409 -0.156

local 6.578 0.621 4.937
wknown 5.103 0.930 3.444

tod -14.847 -16.007 -1.088
seasonal -15.334 -14.818 -1.060

Prob 0.111 0.236 0.178

Note: model estimated via EM algorithm

. display c(current_time)
12:49:20

As it can be seen, the EM algorithm took 49 iterations before reaching convergence,
i.e. about one minute in our standard-issue PC. However, the routine is already close
to the maximum at the 11th iteration, i.e. after less than 10 seconds. This is a common
feature of EM algorithms and it actually suggests another useful application of lclogit,
as it could be used to obtain good starting values for the estimation of latent class models
via gradient-based algorithms.19

19. This could be particularly useful either to speed up the estimation process or to avoid convergence
problems when estimating models with a high number of latent classes.

16 An EM algorithm for nonparametric mixed logit models

As Train (2008) points out, when the number of classes rises summary statistics for
the distribution of each coefficient could be more informative than the single point esti-
mates. For this reason, lclogit stores the predicted weighted average of each coefficient
in the vector e(PB):

. matrix list e(PB)

e(PB)[1,6]
Average of Average of Average of Average of Average of Average of

price contract local wknown tod seasonal
Coeff -.94577 -.26901 2.3690 1.9184 -9.0036 -9.058

Interestingly, although the parameters are rather different from class to class their
weighted averages are close to the correspondent values obtained from the parametric
estimation via mixlogit.

7 Conclusions

This article has shown how to estimate a nonparametric mixed logit model in Stata
with one of the methods proposed in Train (2008). The method makes use of an EM
algorithm that – thanks to its desirable properties – allows estimating a latent class model
with a high number of classes so that the unobserved distribution of the coefficients can
be approximated nonparametrically. We have also shown how to use the Stata command
lclogit, which performs the EM estimation for latent class logit models automatically.

8 Acknowledgements

I am grateful to Kenneth Train for helpful comments on the implementation of the EM
algorithm. Thanks also to an anonymous referee for relevant comments and suggestions
on how to improve both this paper and the lclogit routine.

9 References
Bhat, C., 2000. Flexible model structures for discrete choice analysis. Handbook of

transport modelling 1, 71-90.

Boyles, R., 1983. On the convergence of the em algorithm, Journal of the Royal Statis-
tical Society B 45, 47-50.

Dempster, A., Laird, N., Rubin, D., et al., 1977. Maximum likelihood from incom-
plete data via the EM algorithm. Journal of the Royal Statistical Society. Series B
(Methodological) 39 (1), 1-38.

Greene, W., Hensher, D., 2003. A latent class model for discrete choice analysis: con-
trasts with mixed logit. Transportation Research Part B 37 (8), 681-698.

D. Pacifico 17

Hole, A.R., 2007. Fitting mixed logit models by using maximum simulated likelihood.
Stata Journal, 7 (3), 388-401.

Huber, J. and K. Train, 2001. On the similarity of classical and bayesian estimates of
individual mean partworths, Marketing Letters 12, 259-269

McFadden, D., 1973. Conditional logit analysis of qualitative choice models. Frontiers
of Econometrics, ed. P. Zarembka. New York: Academic Press.

McFadden, D., Train, K., 2000. Mixed MNL models for discrete responses. Journal of
Applied Econometrics 15 (5), 447-470.

Train, K., 2003. Discrete choice methods with simulation. Cambridge Books.

Train, K., 2008. EM Algorithms for Nonparametric Estimation of Mixing Distributions.
Journal of Choice Modelling 1 (1).

Weeks, D. and K. Lange 2007. Trials,tribulations, and triumphs of the EM algorithm
in pegigree analysis. Journal of Mathematics Applied in Medicine and Biology 6,
209-232.

Wu, C., 1983. On the convergence properties of the EM algorithm. The Annals of
Statistics 11 (1), 95-103.

About the author

Daniele Pacifico works with the Italian Department of the Treasury (Rome, Italy) and is a

research fellow of the Centre for the Analysis of Public Policies (Modena, Italy).

“Materiali di Discussione” LATER PUBLISHED ELSEWHE RE

N. 546 – M. Murat and B. Pistoresi, Emigrants and immigrants networks in FDI,
Applied Economics letters, April 2008, http://www.informaworld.com/
/content~content=a789737803~db=all~order=author (electronic
publication), WP No. 546 (December 2006).

N. 545 – M.Brunetti and C. Torricelli, The Population Ageing in Italy:Facts and

Impact on Household Portfolios, in M. Balling & E. Gnan & F. Lierman
(eds.), Money, Finance and Demography: The Consequences of Ageing,
Vienna, Suerf (2007), WP No. 545 (November 2006).

N. 532 – M: Montanari, Between European Integration and Regional Autonomy:
The Case of Italy from an Economic Perspective, Constitutional Political
Economy, Vol. 17, 4, pp. 277-301 (2006), WP No. 532 (March 2006).

N. 529 – M. Montanari, Knocking on the EU’s door: the Political Economy of EU-
Ukraine Relations, Journal of Contemporary European Research, Vol.3,1,
pp.64-78 (2007), WP No. 529 (February 2006).

N. 518 – M.Brunetti and C. Torricelli, Economic Activity and Recession

Probabilities: information content and predictive power of the term spread
in Italy, Applied Economics (2009), WP No. 518 (December 2005).

N. 517 – M. Murat and S. Paba (2006), I distretti industriali tra immigrazioni e

internazionalizzazione produttiva, in B. Quintieri (ed.) I distretti italiani
dal locale al globale, Rubbettino (2006), WP No. 517 (December 2005).

N. 491 – V. Moriggia, S. Muzzioli and C. Torricelli, On the no arbitrage condition
in option implied trees, European Journal of Operational Research (2009),
WP No. 491 (May 2005).

N. 482 – G. Di Lorenzo and G. Marotta, A less effective monetary transmission in
the wake of EMU? Evidence from lending rates passthrough, ICFAI
Journal of Monetary Economics, Vol. 4, 2, pp. 6-31 (2006), WP No. 482
(February 2005).

N. 472 – M.Brunetti and C. Torricelli, The internal and cross market efficiency in
index option markets: an investigation of the Italian market, Applied
Financial Economics, Vol. 17, 1, pp. 25-33 (2007), WP No. 472
(November 2004).

N. 466 – G. Marotta, La finanza del settore non profit tra ritardi nei pagamenti e

Basilea 2, Banca Impresa Società , Vol. XXIV, 1, pp. 35-51 (2005), WP
No. 466 (September 2004).

N. 453 – Pederzoli and C. Torricelli, Capital requirements and Business Cycle
Regimes: Forward-looking modelling of Default Probabilities , Journal of
Banking and Finance, Vl. 29, 12, pp. 3121-3140 (2005), WP No. 453
(February 2004).

N. 448 – V. Moriggia, S. Muzzioli, C. Torricelli, Call and put implied volatilities

and the derivation of option implied trees, Frontiers In Finance and
Economics, vol.4, 1, pp. 35-64 (2007), WP No. 448 (November 2003).

N. 436 – M.Brunetti and C. Torricelli, Put-Call Parity and cross-market efficiency

in the Index Options Markets: evidence from the Italian market,
International Review of Financial Analysis, Vl.14, 5, pp. 508-532 (2005),
WP No. 436 (July 2003).

N. 429 – G. Marotta, When do trade credit discounts matter? Evidence from
Italian Firm-Level Data, Applied Economics, Vol. 37, 4, pp. 403-416
(2005), WP No. 429 (February 2003).

N. 426 – A. Rinaldi and M. Vasta, The Structure of Italian Capitalism, 1952-1972:
New Evidence Using the Interlocking Directorates Technique, Financial
History Review, vol, 12, 2, pp. 173-198 (2005), WP No. 426 (January
2003).

N. 417 – A. Rinaldi, The Emilian Model Revisited: Twenty Years After, Business

History, vol. 47, 2, pp. 244-226 (2005), WP No. 417 (September 2002).

N. 375 – G. Marotta, La direttiva comunitaria contro i ritardi nei pagamenti tra
imprese. Alcune riflessioni sul caso italiano, Banca, Impresa, Società,
Vol. XX, 3, pp. 451-71 (2001), WP No. 375 (September 2001).

N. 303 – G. Marotta and M. Mazzoli, Fattori di mutamento nella domanda di
prestiti ed effetti sulla trasmissione della politica monetaria, in P.
ALESSANDRINI (ed.) Il sistema finanziario italiano tra globalizzazione
e localismo, Bologna, Il Mulino, pp. 223-260 (2001), WP No. 303 (April

2000)

N. 131 – G. Marotta, Does trade credit redistribution thwart monetary policy?
Evidence from Italy, Applied Economics, Vol. 29, December, pp. 1619-
29 (1997), WP No. 131 (1996).

N. 121 – G. Marotta, Il credito commerciale in Italia: una nota su alcuni aspetti
strutturali e sulle implicazioni di politica monetaria, L’Industria, Vol.
XVIII, 1, pp. 193-210 (1997), WP No. 121 (1995)

N. 105 – G. Marotta, Credito commerciale e “lending view”, Giornale degli
Economisti e Annali di Economia, Vol. LIV, 1-3, gennaio-marzo, pp. 79-
102; anche in G. Vaciago (a cura di) Moneta e finanza, Bologna, Il
Mulino (1995), WP No. 105 (1994).

RECENTLY PUBLISHED “Materiali di Discussione”

N. 662 – From SHIW to IT-SILC: construction and representativeness of the new CAPP_DYN
first-year population, by Emanuele Ciani and Donatella Fresu [July 2011].

N. 661 – Trends and dynamics in the Italian labour market. An empirical evaluation using

RFL data, 1993-2007, by Sara Flisi and Marcello Morciano [July 2011].

N. 660 – Estimation and Simulation of Earnings in IT-SILC, by Emanuele Ciani e Marcello
Morciano [July 2011].

N. 659 – Growth, Colonization, and Institutional Development In and Out of Africa, by

Graziella Bertocchi [July 2011].

N. 658 – Economia e finanza nelle imprese meccaniche delle regioni del Nord Italia 2000-
2008. Primo rapporto di ricerca, by Luciana Canovi, Margherita Russo, Monica
Baracchi and Daniela Bigarelli [June 2011].

N. 657 – Innovation networks: theory and policy implications at regional level. The case of

Tuscany’ innovation policies 2000-2006, by Annalisa Caloffi, Federica Rossi and
Margherita Russo [May 2011].

N. 656 – Il sistema di istruzione nella promozione dello sviluppo economico. Strategie

pubbliche e interventi privati a Modena, by Paola Mengoli and Alberto Rinaldi
[May 2011].

N. 655 – Exports and Italy’s economic development: a long-run perspective (1863-2004), by

Barbara Pistoresi and Alberto Rinaldi [May 2011].

N. 654 – Transnational social capital and FDI. Evidence from Italian associations worldwide,

by Marina Murat, Barbara Pistoresi and Alberto Rinaldi [May 2011].

N. 653 – Le difficoltà d’accesso all’abitazione da parte degli immigrati. Una indagine di

campo, by Paola Bertolini, Francesco Pagliacci and Serena Giannuzzi [April 2011].

N. 652 – Il social housing in Europa, by Massimo Baldini Baldini and Marta Federici
[April 2011].

N. 651 – L’edilizia sociale di fronte all’immigrazione a Modena: politiche abitative e domanda

potenziale (o inevasa), by Marta Federici and Giuseppe Fiorani [April 2011].

N. 650 – Dynamic Adverse Selection and the Size of the Informed Side of the Market, by Ennio

Bilancini and Leonardo Boncinelli [March 2011].

N. 649 – Present and Future of the Chinese labour Market, by Michele Bruni and Claudio

Tabacchi [March 2011].

N. 648 – Lisbon strategy and EU countries’ performance: social inclusion and sustainability,
by Paola Bertolini and Francesco Pagliacci [March 2011].

N. 647 – Schools choices of foreign youth in Italian territorial areas, by Paola Bertolini,

Valentina Toscano and Linda Tosarelli [March 2011].

N. 646 – Allocation of Time within Italian Couples: Exploring the Role of Institutional
Factors and their Effects on Household’s Wellbeing, by Tindara Addabbo Antonella
Caiumi and Anna Maccagnan [February 2011].

Estimating nonparametric mixed logit models via EM algorithmto.44em.D. Pacifico

www.asb.unsw.edu.au

Last updated: 12/11/12 CRICOS Code: 00098G

Australian School of Business Research Paper No. 2012 ECON 49

lclogit: A Stata module for estimating latent class conditional logit models via the
Expectation-Maximization algorithm

Daniele Pacifico
Hong il Yoo

This paper can be downloaded without charge from
The Social Science Research Network Electronic Paper Collection:
http://ssrn.com/abstract=2174146

Australian School of Business

Working Paper

http://ssrn.com/abstract=2174146�

The Stata Journal (yyyy) vv, Number ii, pp. 1–15

lclogit: A Stata module for estimating latent
class conditional logit models via the
Expectation-Maximization algorithm

Daniele Pacifico
Italian Department of the Treasury

daniele.pacifico@tesoro.it

Hong il Yoo
University of New South Wales

h.yoo@unsw.edu.au

Abstract. This paper describes lclogit, a Stata module for estimating a discrete
mixture or latent class logit model via the EM algorithm.

Keywords: st0001, lclogit, latent class model, EM algorithm, mixed logit

1 Introduction

Mixed logit or random parameter logit is used in many empirical applications to cap-
ture more realistic substitution patterns than traditional conditional logit. The random
parameters are usually assumed to follow a normal distribution and the resulting model
is estimated through simulated maximum likelihood, as in Hole (2007)’s Stata module
mixlogit. Several recent studies, however, note potential gains from specifying a dis-
crete instead of normal mixing distribution, including the ability to approximate the
true parameter distribution more flexibly at lower computational costs.1

Pacifico (2012) implements the Expectation-Maximization (EM) algorithm for esti-
mating a discrete mixture logit model, also known as latent class logit model, in Stata.
As Bhat (1997) and Train (2008) emphasize, the EM algorithm is an attractive alterna-
tive to the usual (quasi-)Newton methods in the present context, because it guarantees
numerical stability and convergence to a local maximum even when the number of la-
tent classes is large. In contrast, the usual optimization procedures often fail to achieve
convergence since inversion of the (approximate) Hessian becomes numerically difficult.

With this contribution, we aim at generalizing Pacifico (2012)’s code with a Stata
module that introduces a series of important functionalities and provides an improved
performance in terms of run time and stability.

2 EM algorithm for latent class logit

This section recapitulates the EM algorithm for estimating a latent class logit model
(LCL).2 Suppose that each of N agents faces, for notational simplicity, J alternatives

1. For example, see Hess et al. (2011), Shen (2009) and Greene and Hensher (2003).
2. Further details are available in Bhat (1997) and Train (2008).

2 Latent class logit model

in each of T choice scenarios.3 Let ynjt denote a binary variable which equals 1 if agent
n chooses alternative j in scenario t and 0 otherwise. Each alternative is described by
alternative-specific characteristics, xnjt, and each agent by agent-specific characteristics
including a constant, zn.

LCL assumes that there are C distinct sets (or classes) of taste parameters, β =
(β1,β2, …,βC ). If agent n is in class c, the probability of observing her sequence of
choices is a product of conditional logit formulas:

Pn(βc) =

T∏
t=1

J∏
j=1

(
exp(βcxnjt)∑J

k=1 exp(βcxnkt))

)ynjt
(1)

Since the class membership status is unknown, the researcher needs to specify the
unconditional likelihood of agent n’s choices, which equals the weighted average of
equation 1 over classes. The weight for class c, πcn(θ), is the population share of that
class and usually modeled as fractional multinomial logit:

πcn(θ) =
exp(θczn)

1 +
∑C−1

l=1 exp(θlzn)
(2)

where θ = (θ1,θ2, …,θC−1) are class membership model parameters; note that θC has
been normalized to zero for identification.

The sample log likelihood is then obtained by summing each agent’s log uncondi-
tional likelihood:

ln L(β,θ) =

N∑

n=1

C∑

c=1

πcn(θ)Pn(βc) (3)

Bhat (1997) and Train (2008) note numerical difficulties associated with maximizing
equation 3 directly. They show that β and θ can be more conveniently estimated via
a well-known EM algorithm for likelihood maximization in the presence of incomplete
data, treating each agent’s class membership status as the missing information. Let
superscript s denote the estimates obtained at the sth iteration of this algorithm. Then,
at iteration s + 1, the estimates are updated as:

βs+1 = argmaxβ
∑N

n=1

∑C
c=1 ηcn(β

s,θs) ln Pn(βc)

θs+1 = argmaxθ
∑N

n=1
∑C
c=1 ηcn(β

s,θs) ln πcn(θ)
(4)

where ηcn(β
s,θs) is the posterior probability that agent n is in class c evaluated at the

3. lclogit is also applicable when the number of scenarios varies across agents, and that of alternatives
varies both across agents and over scenarios.

D. Pacifico and H. Yoo 3

sth estimates:

ηcn(β
s,θs) =

πcn(θ
s)Pn(β

s
c)∑C

l=1 πln(θ
s)Pn(β

s
l )

(5)

The updating procedure can be implemented easily in Stata, exploiting clogit and
fmlogit routines as follows.4 βs+1 is computed by estimating a conditional logit model
(clogit) C times, each time using ηcn(β

s,θs) for a particular c to weight observations on
each n. θs+1 is obtained by estimating a fractional multinomial logit model (fmlogit)
which takes η1n(β

s,θs), η2n(β
s,θs), · · · , ηCn(βs,θs) as dependent variables. When zn

only includes the constant term so that each class share is the same for all agents, ie
πcn(θ) = πc(θ), each class share can be directly updated using the following analytical
solution:

πc(θ
s+1) =

∑N
n=1 ηcn(β

s,θs)∑C
l=1

∑N
n=1 ηln(β

s,θs)
(6)

without estimating the fractional multinomial logit model.

With a suitable selection of starting values, the updating procedure can be repeated
until changes in the estimates and/or improvement in the log likelihood between itera-
tions are small enough.

An often highlighted feature of LCL is its ability to accommodate unobserved inter-
personal taste variation without restricting the shape of the underlying taste distribu-
tion. Hess et al. (2011) have recently emphasized that LCL also provides convenient
means to account for observed interpersonal heterogeneity in correlations among tastes
for different attributes. For example, let βq and βh denote taste coefficients on the q

and hth attributes respectively. Each coefficient may take one of C distinct values, and
is a random parameter from the researcher’s perspective. Their covariance is given by:

covn(βq,βh) =

C∑
c=1

πcn(θ)βc,qβc,h −

(
C∑

c=1

πcn(θ)βc,q

)(
C∑

c=1

πcn(θ)βc,h

)
(7)

where βc,q is the value of βq when agent n is in class c, and βc,h is defined similarly.
As long as zn in equation 2 includes a non-constant variable, this covariance will vary
across agents with different observed characteristics through the variation in πcn(θ).

3 The lclogit command

lclogit is a Stata module which implements the EM iterative scheme outlined in the
previous section. This module generalizes Pacifico (2012)’s step-by-step procedure and

4. fmlogit is a user-written program. See footnote 5 for a further description.

4 Latent class logit model

introduces an improved internal loop along with other important functionalities. The
overall effect is to make the estimation process more convenient, significantly faster and
more stable numerically.

To give a few examples, the internal code of lclogit executes fewer algebraic oper-
ations per iteration to update the estimates; uses the standard generate command to
perform tasks which were previously executed with slightly slower egen functions; and
works with log probabilities instead of probabilities when possible. All these changes
substantially reduce the estimation run time, especially in the presence of a large num-
ber of parameters and/or observations. Taking the 8-class model estimated by Pacifico
(2012) for example, lclogit produces the same results as the step-by-step procedure
while using less than a half of the latter’s run time.

The data setup for lclogit is identical to that required by clogit.

The generic syntax for lclogit is:

lclogit depvar
[
varlist

] [
if
][

in
]
, group(varname) id(varname) nclasses(#)[

options
]

The options for lclogit are:

• group(varname) is required and specifies a numeric identifier variable for the
choice scenarios.

• id(varname) is required and specifies a numeric identifier variable for the choice
makers or agents. When only one choice scenario is available for each agent, users
may specify the same variable for both group() and id().

• nclasses(#) is required and specifies the number of latent classes. A minimum
of 2 latent classes is required.

• membership(varlist) specifies independent variables that enter the fractional multi-
nomial logit model of class membership, i.e. the variables included in the vector
zn of equation 2. These variables must be constant within the same agent as iden-
tified by id().5 When this option is not specified, the class shares are updated
algebraically following equation 6.

• convergence(#) specifies the tolerance for the log likelihood. When the propor-
tional increase in the log likelihood over the last five iterations is less than the
specified criterion, lclogit declares convergence. The default is 0.00001.

5. Pacifico (2012) specified a ml program with the method lf to estimate the class membership model.
lclogit uses another user-written program from Maarten L. Buis – fmlogit – which performs the
same estimation with the significantly faster and more accurate d2 method. lclogit is downloaded
with a modified version of the prediction command of fmlogit – fmlogit pr – since we had to
modify this command to obtain double-precision class shares.

D. Pacifico and H. Yoo 5

• iterate(#) specifies the maximum number of iterations. If convergence is not
achieved after the selected number of iterations, lclogit stops the recursion and
notes this fact before displaying the estimation results. The default is 150.

• seed(#) sets the seed for pseudo uniform random numbers. The default is
c(seed).

The starting values for taste parameters are obtained by splitting the sample into
nclasses() different subsamples and estimating a clogit model for each of them.
During this process, a pseudo uniform random number is generated for each agent
to assign the agent into a particular subsample.6 As for the starting values for
the class shares, the module assumes equal shares, i.e. 1/nclasses().

• constraints(Class#1numlist: Class#2 numlist: ..) specifies the constraints
that are imposed on the taste parameters of the designated classes, i.e. βc in equa-
tion 1. For instance, suppose that x1 and x2 are alternative-specific characteristics
included in indepvars for lclogit and the user wishes to restrict the coefficient
on x1 to zero for Class1 and Class4, and the coefficient on x2 to 2 for Class4.
Then, the relevant series of commands would look like:

. constraint 1 x1 = 0

. constraint 2 x2 = 2

. lclogit depvar indepvars, gr() id() ncl() constraints(Class1 1: Class4 1 2)

• nolog suppresses the display of the iteration log.

4 Post-estimation command: lclogitpr

lclogitpr predicts the probabilities of choosing each alternative in a choice situation
(choice probabilities hereafter), the class shares or prior probabilities of class member-
ship, and the posterior probabilities of class membership. The predicted probabilities
are stored in a variable named stubname# where # refers to the relevant class number;
the only exception is the unconditional choice probability, as it is stored in a variable
named stubname. The syntax for lclogitpr is:

lclogitpr stubname
[
if
][

in
]
,
[
options

]
The options for lclogitpr are:

• class(numlist) specifies the classes for which the probabilities are going to be
predicted. The default setting assumes all classes.

6. More specifically, the unit interval is divided into nclasses() equal parts and if the agent’s pseudo
random draw is in the cth part, the agent is allocated to the subsample whose clogit results serve
as the initial estimates of Class c’s taste parameters. Note that lclogit is identical to asmprobit in
that the current seed as at the beginning of the command’s execution is restored once all necessary
pseudo random draws have been made.

6 Latent class logit model

• pr0 predicts the unconditional choice probability, which equals the average of
class-specific choice probabilities weighted by the corresponding class shares. That
is,
∑C

c=1 πcn(θ)[exp(βcxnjt)/(
∑J

k=1 exp(βcxnkt))] in the context of Section 2.

• pr predicts the unconditional choice probability and the choice probabilities condi-
tional on being in particular classes; exp(βcxnjt)/(

∑J
k=1 exp(βcxnkt)) in equation

1 corresponds to the choice probability conditional on being in class c. This is the
default option when no other option is specified.

• up predicts the class shares or prior probabilities that the agent is in particular
classes. They correspond to the class shares predicted by using the class member-
ship model parameter estimates; see equation 2 in Section 2.

• cp predicts the posterior probabilities that the agent is in particular classes taking
into account her sequence of choices. They are computed by evaluating equation
5 at the final estimates for each c = 1, 2, · · · ,C.

5 Post-estimation command: lclogitcov

lclogitcov predicts the implied variances and covariances of taste parameters by eval-
uating equation 7 at the active lclogit estimates. They could be a useful tool for
studying the underlying taste patterns; see Hess et al. (2011) for a related application.

The generic syntax for lclogitcov is:

lclogitcov
[
varlist

] [
if
][
in
]
,
[
options

]
The default is to store the predicted variances in a set of hard-coded variables named

var 1, var 2, … where var q is the predicted variance of the coefficient on the qth variable
listed in varlist, and the predicted covariances in cov 12, cov 13, …, cov 23, … where
cov qh is the predicted covariance between the coefficients on the qth variable and the
hth variable in varlist.

The averages of these variance and covariances over agents – as identified by the
required option id() of lclogit – in the prediction sample are reported as a covariance
matrix at the end of lclogitcov’s execution.

The options for lclogitcov are:

• nokeep drops the predicted variances and covariances from the data set at the end
of the command’s execution. The average covariance matrix is still displayed.

• varname(stubname) requests the predicted variances to be stored as stubname1,
stubname2,…

• covname(stubname) requests the predicted covariances to be stored as stubname12,
stubname13,…

D. Pacifico and H. Yoo 7

• matrix(name) stores the reported average covariance matrix in a Stata matrix
called name.

6 Post-estimation command: lclogitml

lclogitml is a wrapper for gllamm (Rabe-Hesketh et al., 2002) which uses the d0
method to fit generalised linear latent class and mixed models including LCL via the
Newton-Rhapson (NR) algorithm for likelihood maximization.7 This post-estimation
command passes active lclogit specification and estimates to gllamm, and its primary
usage mainly depends on how iterate() option is specified; see below for details.

The default setting relabels and transforms the ereturn results of gllamm in ac-
cordance with those of lclogit, before reporting and posting them. Users can exploit
lclogitpr and lclogitcov, as well as Stata’s usual post-estimation commands re-
quiring the asymptotic covariance matrix such as nlcom. When switch is specified,
the original ereturn results of gllamm are reported and posted; users gain access to
gllamm’s post-estimation commands, but lose access to lclogitpr and lclogitcov.

lclogitml can also be used as its own post-estimation command, for example to
pass the currently active lclogitml results to gllamm for further NR iterations.

The generic syntax for lclogitml is:

lclogitml
[
if
][

in
]
,
[
options

]
The options for lclogitml are:

• iterate(#) specifies the maximum number of NR iterations for gllamm’s likeli-
hood maximization process. The default is 0 in which case the likelihood function
and its derivatives are evaluated at the current lclogit estimates; this allows
obtaining standard errors associated with the current estimates without boot-
strapping.

With a non-zero argument, this option can implement a hybrid estimation strategy
similar to Bhat (1997)’s. He executes a relatively small number of EM iterations to
obtain intermediate estimates, and use them as starting values for direct likelihood
maximization via a quasi-Newton algorithm until convergence, because the EM
algorithm tends to slow down near a local maximum.

Specifying a non-zero argument for this option can also be a useful tool for check-
ing whether lclogit has declared convergence prematurely, for instance because
convergence() has not been set stringently enough for an application at hand.

• level(#) sets confidence level; the default is 95.

• nopost restores the currently active ereturn results at the end of the command’s
execution.

7. gllamm can be downloaded by entering ssc install gllamm into the command window.

8 Latent class logit model

• switch displays and posts the original gllamm estimation results, without relabel-
ing and transforming them in accordance with the lclogit output.

• compatible gllamm options refer to gllamm’s estimation options which are com-
patible with the LCL model specification. See gllamm’s own help menu for more
information.

7 Application

We illustrate the use of lclogit and its companion post-estimation commands by ex-
panding upon the example Pacifico (2012) uses to demonstrate his step-by-step proce-
dure for estimating LCL in Stata. This example analyzes the stated preference data on
household’s electricity supplier choice accompanying Hole (2007)’s mixlogit module,
which in turn are a subset of data used in Huber and Train (2001). There are 100
customers who face up to 12 different choice occasions, each of them consisting of a
single choice among 4 suppliers with the following characteristics:

• The price of the contract (in cents per kWh) whenever the supplier offers a contract
with a fixed rate (price)

• The length of contract that the supplier offered, expressed in years (contract)

• Whether the supplier is a local company (local)

• Whether the supplier is a well-known company (wknown)

• Whether the supplier offers a time-of-day rate instead of a fixed rate (tod)

• Whether the supplier offers a seasonal rate instead of a fixed rate (seasonal)

The dummy variable y collects the stated choice in each choice occasion whilst the
numeric variables pid and gid identify customers and choice occasions respectively. To
illustrate the use of membership() option, we generate a pseudo random regressor x1
which mimics a demographic variable. The data are organized as follows:

. use http://fmwww.bc.edu/repec/bocode/t/traindata.dta, clear

. set seed 1234567890

. bysort pid: egen _x1=sum(round(rnormal(0.5),1))

. list in 1/12, sepby(gid)

y price contract local wknown tod seasonal gid pid _x1

1. 0 7 5 0 1 0 0 1 1 26
2. 0 9 1 1 0 0 0 1 1 26
3. 0 0 0 0 0 0 1 1 1 26
4. 1 0 5 0 1 1 0 1 1 26

5. 0 7 0 0 1 0 0 2 1 26
6. 0 9 5 0 1 0 0 2 1 26

D. Pacifico and H. Yoo 9

7. 1 0 1 1 0 1 0 2 1 26
8. 0 0 5 0 0 0 1 2 1 26

9. 0 9 5 0 0 0 0 3 1 26
10. 0 7 1 0 1 0 0 3 1 26
11. 0 0 0 0 1 1 0 3 1 26
12. 1 0 0 1 0 0 1 3 1 26

In empirical applications, it is common to choose the optimal number of latent classes
by examining information criteria such as BIC and CAIC. The next lines show how to
estimate 9 LCL specifications repeatedly and obtain the related information criteria: 8

. forvalues c = 2/10 {
2. lclogit y price contract local wknown tod seasonal, group(gid) id(pid)

> nclasses(`c´) membership(_x1) seed(1234567890)
3. matrix b = e(b)
4. matrix ic = nullmat(ic)\`e(nclasses)´,`e(ll)´,`=colsof(b)´,`e(caic)´,`e(bic)´
5. }

(output omitted)

. matrix colnames ic = “Classes” “LLF” “Nparam” “CAIC” “BIC”

. matlist ic, name(columns)

Classes LLF Nparam CAIC BIC

2 -1211.232 14 2500.935 2486.935
3 -1117.521 22 2358.356 2336.356
4 -1084.559 30 2337.273 2307.273
5 -1039.771 38 2292.538 2254.538
6 -1027.633 46 2313.103 2267.103
7 -999.9628 54 2302.605 2248.605
8 -987.7199 62 2322.96 2260.96
9 -985.1933 70 2362.748 2292.748
10 -966.3487 78 2369.901 2291.901

CAIC and BIC are minimized with 5 and 7 classes respectively. In the remainder of
this section, our analysis focuses on the 5-class specification to economize on space.

lclogit reports the estimation results as follows:

. lclogit y price contract local wknown tod seasonal, group(gid) id(pid) nclass
> es(5) membership(_x1) seed(1234567890)

Iteration 0: log likelihood = -1313.967
Iteration 1: log likelihood = -1195.5476
(output omitted)
Iteration 22: log likelihood = -1039.7709

Latent class model with 5 latent classes

Choice model parameters and average classs shares

Variable Class1 Class2 Class3 Class4 Class5

8. lclogit saves three information criteria in its ereturn list: AIC, BIC and CAIC. AIC equals
−2 ln L + 2m, where ln L is the maximized sample log likelihood and m is the total number of
estimated model parameters. BIC and CAIC penalize models with extra parameters more heavily,
by using penalty functions increasing in the number of choice makers, N: BIC = −2 ln L + m ln N
and CAIC = −2 ln L + m(1 + ln N).

10 Latent class logit model

price -0.902 -0.325 -0.763 -1.526 -0.520
contract -0.470 0.011 -0.533 -0.405 -0.016

local 0.424 3.120 0.527 0.743 3.921
wknown 0.437 2.258 0.325 1.031 3.063

tod -8.422 -2.162 -5.379 -15.677 -6.957
seasonal -6.354 -2.475 -7.763 -14.783 -6.941

Class Share 0.113 0.282 0.162 0.243 0.200

Class membership model parameters : Class5 = Reference class

Variable Class1 Class2 Class3 Class4 Class5

_x1 0.045 0.040 0.047 0.048 0.000
_cons -1.562 -0.544 -1.260 -0.878 0.000

Note: Model estimated via EM algorithm

It is worth noting that the reported class shares are the average shares over agents,
because the class shares vary across agents when the membership() option is included
in the syntax. If needed, agent-specific class shares can be easily computed by using the
post-estimation command lclogitpr with the up option.

In order to obtain a quantitative measure of how well the model does in differen-
tiating several classes of preferences, we use lclogitpr to compute the average (over
respondents) of the highest posterior probability of class membership:9

. bys `e(id)´: gen first = _n==1

. lclogitpr cp, cp

. egen double cpmax = rowmax(cp1-cp5)

. sum cpmax if first, sep(0)

Variable Obs Mean Std. Dev. Min Max

cpmax 100 .9596674 .0860159 .5899004 1

As it can be seen, the mean highest posterior probability is about 0.96, meaning that
the model does very well in distinguishing among different underlying taste patterns for
the observed choice behavior.

We next examine the model’s ability to make in-sample predictions of the actual
choice outcomes. For this purpose, we first classify a respondent as a member of class c
if class c gives her highest posterior membership probability. Then, for each subsample
of such respondents, we predict the unconditional probability of actual choice and the
probability of actual choice conditional on being in class c:

. lclogitpr pr, pr

. gen byte class = .
(4780 missing values generated)

. forvalues c = 1/`e(nclasses)´ {

9. A dummy variable which equals 1 for the first observation on each respondent is generated because
not every agent faces the same number of choice situations in this specific experiment.

D. Pacifico and H. Yoo 11

2. quietly replace class = `c´ if cpmax==cp`c´
3. }

. forvalues c = 1/`e(nclasses)´ {
2. qui sum pr if class == `c´ & y==1
3. local n=r(N)
4. local a=r(mean)
5. qui sum pr`c´ if class == `c´ & y==1
6. local b=r(mean)
7. matrix pr = nullmat(pr) \ `n´, `c´, `a´, `b´
8. }

. matrix colnames pr = “Obs” “Class” “Uncond_Pr” “Cond_PR”

. matlist pr, name(columns)

Obs Class Uncond_Pr Cond_PR

129 1 .3364491 .5387555
336 2 .3344088 .4585939
191 3 .3407353 .5261553
300 4 .4562778 .7557497
239 5 .4321717 .6582177

In general, the average unconditional choice probability is much higher than 0.25
which is what a naive model would predict given that there are 4 alternatives per choice
occasion. The average conditional probability is even better and higher than 0.5 in
all but one classes. Once again we see that the model describes the observed choice
behavior very well.

When taste parameters are modeled as draws from a normal distribution, the esti-
mated preference heterogeneity is described by their mean and covariances. The same
summary statistics can be easily computed for LCL by combining class shares and
taste parameters; see Hess et al. (2011) for a detailed discussion. lclogit saves these
statistics as part of its ereturn list:

. matrix list e(PB)

e(PB)[1,6]
Average Average Average Average Average Average
price contract local wknown tod seasonal

Coefficients -.79129 -.23756 1.9795 1.6029 -7.62728 -7.64949

. matrix list e(CB)

symmetric e(CB)[6,6]
price contract local wknown tod seasonal

price .20833629
contract .07611239 .05436665

local .48852574 .32683725 2.1078043
wknown .27611961 .22587673 1.4558029 1.045789

tod 2.2090348 .65296465 4.0426714 1.9610973 25.12504
seasonal 1.9728148 .65573999 3.8801716 2.0070985 21.845013 20.189302

Since we estimated a model with the membership() option, the class shares (hence
the covariances; see equation 7) now vary across respondents and the matrix e(CB) above
is an average covariance matrix. In this case, the post-estimation command lclogitcov
can be very useful for studying variation in taste correlation patterns within and across
different demographic groups. To illustrate this point, we compute the covariances of
the coefficients on price and contract, and then summarize the results for two groups

12 Latent class logit model

defined by whether x1 is greater or less than 20:

. quietly lclogitcov price contract

. sum var_1 cov_12 var_2 if _x1 >20 & first

Variable Obs Mean Std. Dev. Min Max

var_1 62 .2151655 .0061303 .2065048 .2301424
cov_12 62 .0765989 .000348 .0760533 .0773176
var_2 62 .0545157 .0000987 .0543549 .0547015

. sum var_1 cov_12 var_2 if _x1 <=20 & first

Variable Obs Mean Std. Dev. Min Max

var_1 38 .1971939 .0053252 .1841499 .2050795
cov_12 38 .0753185 .0004483 .0741831 .075949
var_2 38 .0541235 .0001431 .0537589 .0543226

Standard errors associated with any results provided by lclogit can be obtained
via bootstrap. However, the bootstrapped standard errors of class-specific results are
much less reliable than those of averaged results because the class labeling may vary
arbitrarily across bootstrapped samples; see Train (2008) for a detailed discussion.

Users interested in class-specific inferences may consider passing the lclogit re-
sults to user-written ml programs like gllamm (Rabe-Hesketh et al., 2002), to take
advantage of the EM algorithm and obtain conventional standard errors at the same
time. lclogitml simplifies this process.

. lclogitml, iter(5)
-gllamm- is initializing. This process may take a few minutes.

Iteration 0: log likelihood = -1039.7709 (not concave)
Iteration 1: log likelihood = -1039.7709
Iteration 2: log likelihood = -1039.7706
Iteration 3: log likelihood = -1039.7706

Latent class model with 5 latent classes

y Coef. Std. Err. z P>|z| [95% Conf. Interval]

choice1
price -.9023068 .2012345 -4.48 0.000 -1.296719 -.5078945

contract -.4698861 .089774 -5.23 0.000 -.64584 -.2939322
local .4241342 .3579407 1.18 0.236 -.2774167 1.125685
wknown .4370318 .2864782 1.53 0.127 -.1244552 .9985188

tod -8.422232 1.584777 -5.31 0.000 -11.52834 -5.316126
seasonal -6.354626 1.569516 -4.05 0.000 -9.43082 -3.278431

choice2
price -.3249095 .1090047 -2.98 0.003 -.5385548 -.1112642

contract .0108523 .0384404 0.28 0.778 -.0644894 .0861941
local 3.122255 .2842558 10.98 0.000 2.565124 3.679387
wknown 2.258772 .2553446 8.85 0.000 1.758306 2.759238

tod -2.157726 .8906932 -2.42 0.015 -3.903453 -.4119997
seasonal -2.470511 .894278 -2.76 0.006 -4.223264 -.7177582

choice3
price -.7629762 .1415072 -5.39 0.000 -1.040325 -.4856271

contract -.5331056 .0739354 -7.21 0.000 -.6780162 -.3881949

D. Pacifico and H. Yoo 13

local .526889 .2633905 2.00 0.045 .0106531 1.043125
wknown .3249201 .2391513 1.36 0.174 -.1438078 .793648

tod -5.379464 1.100915 -4.89 0.000 -7.537218 -3.22171
seasonal -7.763171 1.191777 -6.51 0.000 -10.09901 -5.427331

choice4
price -1.526036 .1613542 -9.46 0.000 -1.842284 -1.209787

contract -.4051809 .0754784 -5.37 0.000 -.5531158 -.2572459
local .7413859 .3599632 2.06 0.039 .0358711 1.446901
wknown 1.029899 .3032522 3.40 0.001 .4355353 1.624262

tod -15.68543 1.523334 -10.30 0.000 -18.67111 -12.69975
seasonal -14.78921 1.463165 -10.11 0.000 -17.65696 -11.92146

choice5
price -.5194972 .1357407 -3.83 0.000 -.7855442 -.2534503

contract -.0141426 .0915433 -0.15 0.877 -.1935641 .165279
local 3.907502 .7079699 5.52 0.000 2.519907 5.295097
wknown 3.055901 .4653005 6.57 0.000 2.143928 3.967873

tod -6.939564 1.428877 -4.86 0.000 -9.740112 -4.139015
seasonal -6.92799 1.363322 -5.08 0.000 -9.600052 -4.255928

share1
_x1 .0443861 .0510411 0.87 0.385 -.0556525 .1444247

_cons -1.562361 1.197298 -1.30 0.192 -3.909022 .7843001

share2
_x1 .0400449 .0427769 0.94 0.349 -.0437962 .1238861

_cons -.5443567 .956636 -0.57 0.569 -2.419329 1.330615

share3
_x1 .0470822 .0458336 1.03 0.304 -.0427501 .1369144

_cons -1.260251 1.061043 -1.19 0.235 -3.339857 .8193542

share4
_x1 .0479228 .042103 1.14 0.255 -.0345976 .1304431

_cons -.8794649 .9718415 -0.90 0.365 -2.784239 1.025309

The estimated choice model or taste parameters, βc, and class membership model pa-
rameters, θc, are grouped under equations choicec and sharec respectively. lclogitml
relabels and transforms the original gllamm estimation results in accordance with the
lclogit’s ereturn list (see Section 6), facilitating interpretation of the new output
table.10 The active lclogitml coefficient estimates can also be displayed in the stan-
dard lclogit output format, by entering lclogit into the command window without
any additional statement.

Note that the log likelihood increases slightly after 3 iterations, though the parameter
estimates remain almost the same. This may happen since lclogit uses only the relative
change in the log likelihood as convergence criterion. gllamm works with the standard ml
command with a d0 evaluator, which declares convergence in a more stringent manner:
specifically, when the relative changes in both the scaled gradient and either the log

10. The original output table gllamm reports is lengthier and somewhat less intuitive in comparison. For
instance, it splits the six estimates displayed under equation choice1 over six different equations,
labeled z 1 1, z 2 1, z 3 1, z 4 1, z 5 1 and z 6 1 respectively.

14 Latent class logit model

likelihood or the parameter vector are smaller than a given tolerance level.11

When lclogit is used in a final production run, it is advisable to specify more
stringent convergence() than the default, and experiment with alternative starting
values by changing seed(). Train (2008) contains references highlighting the importance
of these issues for applications exploiting EM algorithms.

8 Acknowledgments

We thank an anonymous referee for useful comments and suggestions. Hong il Yoo’s
work was supported under Australian Research Council’s Discovery Projects funding
scheme (project number: DP0881205).

9 References
Bhat, C., 1997. An endogenous segmentation mode choice model with an application to

intercity travel. Transportation Science, 3, pp. 34-48.

Greene, W. and Hensher, D., 2003. A latent class model for discrete choice analysis:
contrasts with mixed logit. Transportation Research Part B, 37 (8), pp. 681-698.

Hess, S., Ben-Akiva, M, Gopinath, D. and Walker, J., 2011. Advantages of latent class
over mixture of logit models, mimeo.

Hole, A.R., 2007. Fitting mixed logit models by using maximum simulated likelihood.
Stata Journal, 7 (3), pp. 388-401.

Huber, J. and K. Train, 2001. On the similarity of classical and bayesian estimates of
individual mean partworths. Marketing Letters, 12, pp. 259-269.

Pacifico, D., 2012. Estimating nonparametric mixed logit models via EM algorithm.
Stata Journal 12 (2), pp. 284-298.

Rabe-Hesketh, S., Skrondal, A. and Pickles, A. 2002. Reliable estimation of generalized
linear mixed models using adaptive quadrature. Stata Journal, 2 (1), pp. 1-21.

Shen, J., 2009. Latent class model or mixed logit model? A comparison by transport
mode choice data. Applied Economics, 41 (22), pp. 2915-2924.

Train, K., 2008. EM Algorithms for Nonparametric Estimation of Mixing Distributions.
Journal of Choice Modelling, 1 (1), pp. 40-69.

11. The benefit of pre-use of lclogit cannot be overstated. Since gllamm uses the d0 evaluator and
the LCL log likelihood is not amenable to direct maximization, each iteration tends to last for
long and finding initial values which lead to convergence often involves a laborious search. lclogit
exploits the EM algorithm that guarantees convergence to a local maximum in theory, and takes
the estimates to a local maximum or its close neighborhood in a relatively fast way in practice.

D. Pacifico and H. Yoo 15

About the author

Daniele Pacifico works with the Italian Department of the Treasury (Rome, Italy). Hong il

Yoo is a PhD student at the School of Economics, the University of New South Wales (Sydney,

Australia).

lclogit: A Stata module for estimating latent class conditional logit models via the Expectation-Maximization algorithmto.44em.D. Pacifico and H. Yoo

An Endogenous Segmentation Mode Choice Model with an Application to Intercity Travel
Author(s): CHANDRA R. BHAT
Source: Transportation Science, Vol. 31, No. 1 (February 1997), pp. 34-48
Published by: INFORMS
Stable URL: https://www.jstor.org/stable/25768750
Accessed: 07-11-2019 05:27 UTC

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide

range of content in a trusted digital archive. We use information technology and tools to increase productivity and

facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

Terms and Conditions of Use

INFORMS is collaborating with JSTOR to digitize, preserve and extend access to
Transportation Science

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

An Endogenous Segmentation Mode Choice
Model with an Application to Intercity

Travel
CHANDRA R. BHAT

Department of Civil and Environmental Engineering, University of Massachusetts, Amherst

This article uses an endogenous segmentation approach to model mode choice. This approach
jointly determines the number of market segments in the travel population, assigns individuals
probabilistically to each segment, and develops a distinct mode choice model for each segment
group. The author proposes a stable and effective hybrid estimation approach for the endoge
nous segmentation model that combines an Expectation-Maximization algorithm with stan
dard likelihood maximization routines. If access to general maximum-likelihood software is not
available, the multinomial-logit based Expectation-Maximization algorithm can be used in
isolation. The endogenous segmentation model, and other commonly used models in the travel
demand field to capture systematic heterogeneity, are estimated using a Canadian intercity
mode choice dataset. The results show that the endogenous segmentation model fits the data
best and provides intuitively more reasonable results compared to the other approaches.

T J. he estimation of travel mode choice models is an
important component of urban and intercity travel
demand analysis and has received substantial at
tention in the transportation literature (see BEN
AKIVA and LERMAN, 1985). The most widely used
model for urban as well as intercity mode choice is
the multinomial logit model (MNL). The MNL model
is derived from random utility maximizing behavior
at the disaggregate individual level. Therefore, ide
ally, we should estimate the logit model at the indi
vidual level and obtain individual-specific parame
ters for the intrinsic mode biases and for the mode
level-of-service attributes. However, the data used
for mode choice estimation is usually cross-sectional;
that is, there is only one observation per individual.
This precludes estimation of the logit parameters at
the individual level and constrains the modeler to
pool the data across individuals (even in panel data
comprising repeated choices from the same individ
ual, the number of observations per individual is
rarely sufficient for consistent and efficient estima
tion of individual-specific parameters). In such
pooled estimations, it is important to accommodate
differences in intrinsic mode biases (preference het
erogeneity) and differences in responsiveness to lev
el-of-service attributes (response heterogeneity)

across individuals. In particular, imposing an as
sumption of preference and response homogeneity in
the population is rather strong and is untenable in
most cases (hensher, 1981). Further, if the as
sumption of homogeneity is imposed when, in fact,
there is heterogeneity, the result is biased and in
consistent parameter and choice probability esti
mates (see Chamberlain, 1980).
An issue of interest then is: how can preference

and response heterogeneity be incorporated into the
multinomial logit model when studying mode choice
behavior from cross-sectional data?
One approach is to estimate a model with (pure)

random coefficients where the logit mode bias and
level-of-service parameters are assumed to be ran
domly distributed in the population. This approach
ignores any systematic variations in preferences and
response across individuals. As such, it cannot be
considered as a substitute for careful identification
of systematic variations in the population. The ran
dom coefficients cannot even be considered as an
alternative approach (i.e., alternative to accommo
dating systematic effects) to account for heterogene
ity in choice models; it can only be considered as a
potential “add-on” to a model that has attributed as
much heterogeneity to systematic variations as pos

Transportation Science
Vol. 31, No. 1, February 1997

0041-1655/97/3101-0034 $05.00
? 1997 Institute for Operations Research and the Management Sciences

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

AN ENDOGENOUS SEGMENTATION MODE CHOICE MODEL / 35

sible (HOROWITZ, 1991 makes a similar point in the
context of the use of multinomial probit models in
travel demand modeling). Typical applications of the
random-coefficients specification have used a base
systematic model with some allowance for system
atic preference heterogeneity (by including individ
ual-related variables directly in the utility function),
but with little (or no) allowance for systematic re
sponse heterogeneity.1

It is clear from above that accommodating system
atic preference and response heterogeneity in multi
nomial logit models must be a critical focus in mode
choice modeling. Systematic heterogeneity may be
accommodated in one of two broad ways: exogenous
market segmentation or endogenous market seg
mentation. We discuss each of these two approaches
in the subsequent two paragraphs (as indicated ear
lier, a random coefficients specification can be su
perimposed over the systematic model in each of
these approaches; however, in the rest of this paper,
we focus only on the systematic specifications).

The exogenous market segmentation approach to
capturing systematic heterogeneity assumes the ex
istence of a fixed, finite number of mutually exclu
sive market segments (each individual can belong to
one and only one segment). The segmentation is
based on key sociodemographic variables (sex, in
come, etc.) and possibly trip characteristics (whether
an individual travels alone, distance of trip, etc.).

Within each segment, all individuals are assumed to
have identical preferences and identical sensitivities
to level-of-service variables (i.e., the same utility
function).2 The total number of segments is a func
tion of the number of segmentation variables and
the number of segments defined for each segmenta
tion variable. Ideally, the analyst would consider all
sociodemographic and trip-related variables avail
able in the data for segmentation (we will refer to
such a segmentation scheme as full-dimensional ex
ogenous market segmentation). However, a practi
cal problem with the full-dimensional exogenous

xThe reader is referred to a report by electric power re
search institute, 1977, for an application of the random-coef
ficients logit formulation using cross-sectional data; gonul and
srinivasan (1993) applied the random-coefficient logit formula
tion to estimate brand choice models from panel data. The ran
dom-coefficients formulation has also been used in combination
with the probit model of choice (Hausman and Wise, 1978;
Fischer and Nagin, 1981; Horowitz, 1993).

2To be precise, all individuals in a segment are assumed to have
identical preference patterns and sensitivity to level-of-service
patterns, not necessarily identical preferences and level-of-service
sensitivity. This is because continuous variables such as income
and distance can appear in the segment-specific utility functions
even if they are used as segmentation variables. For example,
within each segment, preference may be a function of income, and
level-of-service sensitivity may be a function of trip distance.

segmentation scheme is that the number of seg
ments grows very fast with the number of segmen
tation variables, creating both interpretational and
estimation problems due to inadequate observations
in each segment. To overcome this limitation, re
searchers have used one of two methods. The first
method, which we label as the Refined Utility Func
tion Specification method, accommodates preference
heterogeneity by introducing key segmentation vari
ables directly into the utility function as alternative
specific variables and recognizes response heteroge
neity by interacting the level-of-service variables
with the segmentation variables (we will refer to
sociodemographic and trip characteristics that are
likely to impact mode preferences and level of ser
vice sensitivity as segmentation variables). The re
fined utility function specification method is a re
strictive version of the full-dimensional exogenous
market segmentation approach where only lower
order interaction effects of the segmentation vari
ables on preference and response are allowed. The
second method, which we label the Limited-Dimen
sional Exogenous Market Segmentation method,
overcomes the practical problem of the full-segmen
tation approach by using only a subset of the demo
graphic and trip variables (typically one or two) for
segmentation. The advantage of the two methods
just discussed is that they are practical (the param
eters can be efficiently estimated with data sizes
generally available for mode choice analysis) and are
easy to implement (requires only the MNL soft
ware). The disadvantage is that their practicality
comes at the expense of suppressing potentially
higher-order interaction effects of the segmentation
variables on preference and response to level-of-ser
vice measures. In addition, an intrinsic problem
with all exogenous market segmentation methods is
that the threshold values of the continuous segmen
tation variables (for example, income) which define
segments have to be established in a rather ad hoc
fashion.

The endogenous market segmentation approach,
on the other hand, attempts to accommodate sys
tematic heterogeneity in a practical manner not by
suppressing higher-order interaction effects of seg

mentation variables (on preference and response to
level-of-service measures), but by reducing the di

mensionality of the segment-space. Each segment,
however, is allowed to be characterized by a large
number of segmentation variables. The appropriate
number of segments representing the reduced seg
ment-space is determined statistically by succes
sively adding an additional segment till a point is
reached where an additional segment does not re
sult in a significant improvement in fit. Individuals

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

36 / C. R. BHAT

are assigned to segments in a probabilistic fashion
based on the segmentation variables. The approach
jointly determines the number of segments, the as
signment of individuals to segments, and segment
specific choice model parameters. Since this ap
proach identifies segments without requiring a
multi-way partition of data as in the full-dimen
sional exogenous market segmentation method, it
allows the use of many segmentation variables in
practice and, therefore, facilitates incorporation of
the full order of interaction effects of the segmenta
tion variables on preference and level-of-service sen
sitivity. The method also obviates the need to (arbi
trarily) establish the threshold values defining
segments for continuous segmentation variables. As
we indicate later, the approach also does not exhibit
the individual-level independence from irrelevant
alternatives (IIA) property of the exogenous segmen
tation approach. A potential disadvantage is that
the model cannot be estimated directly using the

MNL software (we overcome this issue in the cur
rent paper).

The model formulation in this paper falls under
the endogenous segmentation approach to accommo
dating systematic heterogeneity. We use a multino
mial logit formulation for modeling both segment
membership as well as mode choice. To the author’s
knowledge, no previous market segmentation anal
ysis in the travel demand field has adopted this
approach. But the approach has been applied earlier
in the marketing field. kamakura and russell
(1989) originally proposed such a method to model
brand choice. Their model assumed that the seg
ment membership probabilities were invariant
across households (i.e., the only variables in their
multinomial logit model for segment membership
were segment-specific constants). Such a model is of
limited value since the objective of segmentation
schemes is to associate heterogeneity with observ
able individual characteristics. gupta and chinta
gunta (1994) extended the work of Kamakura and
Russell by including segmentation variables in the
segment membership MNL model (dayton and Ma
cready, 1988 also propose an endogenous segmen
tation model in which segment membership is func
tionally related to individual-related variables;
however, the segment membership and choice prob
abilities do not take a MNL structure).

The model in this paper takes the same form as
the model of Gupta and Chintagunta. However, the
paper proposes and applies an efficient, stable, hy
brid estimation procedure that combines an Expec
tation-Maximization (EM) formulation (which ex
ploits the special structure of the model) with a
traditional quasi-Newton maximization algorithm

(both Kamakura and Russell, 1989 and Gupta and
Chintagunta, 1994 use a direct maximum likelihood
method, which we found to be rather unstable and to
require more time than the method proposed here).
The EM formulation requires only the standard
MNL software and can be used in isolation to esti
mate the endogenous segmentation model. Hence, it
should be of interest to researchers and practitio
ners who do not have access to, or who are not in a
position to invest time and effort in learning to use,
general-purpose maximum likelihood software. The
endogenous segmentation model is applied in a
travel demand context to study intercity business
mode choice behavior in the Toronto-Montreal cor
ridor of Canada and its empirical performance is
assessed relative to alternative methods to account
for systematic preference and response heterogene
ity.

The rest of this paper is structured as follows. The
next section presents the structure for the mode
choice model with endogenous market segmenta
tion. Section 2 outlines the estimation procedure.
Section 3 discusses the empirical results obtained
from applying the model to intercity mode choice
modeling. Section 4 presents the choice elasticities
and examines the policy implications of the results.
The final section provides a summary of the re
search findings.

1. MODEL STRUCTURE

THE MODE CHOICE MODEL with endogenous segmen
tation rests on the assumption that there are S
relatively homogenous segments in the intercity
travel market (S is to be determined); within each
segment, the pattern of intrinsic mode preferences
and the sensitivity to level of service measures are
identical across individuals. However, there are dif
ferences in intrinsic preference patterns and level
of-service sensitivity among the segments. Thus,
there is a distinct mode choice model for each seg
ment s (s = 1, 2, 3, . . . , S).

We assume a random utility framework as the
basis for individuals’ choice of mode. We also assume
that the random components in the mode utilities
have a type I extreme value distribution and are
independent and identically distributed. Then, the
probability that an individual q chooses mode i from
the set Cq of available alternatives, conditional on
the individual belonging to segment s, takes the
familiar multinomial logit form (McFADDEN, 1973),

Po(i)\s = ^-n (1)

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

AN ENDOGENOUS SEGMENTATION MODE CHOICE MODEL / 37

where xqi is a vector comprising level-of-service and
alternative-specific variables associated with alter
native i and individual q, and J8S is a parameter
vector to be estimated.

The probability that individual q belongs to seg
ment s is next written as a function of a vector zq of
sociodemographic and trip-related variables associ
ated with the individual (zq includes a constant).
Using a multinomial logit formulation, this proba
bility can be expressed as

ey’sZq

The unconditional (on segment membership)
probability of individual q choosing mode i from the
set Cq of available alternatives can be written from
equations (1) and (2) as

Pqd) = l PqsxlPq(i)\s] (3)

The assignment of individuals to segments based
on sociodemographic and trip characteristics is a
critical part of market segmentation, as discussed in
section 1. In the current model, this assignment is
probabilistic and is based on equation (2) after re
placing the yss with their estimated counterparts.
The size of each segment (in terms of share), Rs, may
be obtained as

y p
RS = ?Q (4)

where Q is the total number of individuals in the
estimation sample.
The sociodemographic and trip-related attributes

that characterize each segment can be inferred from
the signs of the coefficients in equation (2). A more
intuitive way is to estimate the mean of the at
tributes in each segment as follows:

2 q PqsZq
zs= y p (5) Z^q * qs

The model can be used to predict the choice of
mode at the individual level, segment level, or the
market level. The individual-level choice probabili
ties can be obtained from equation (3). The segment
level mode choice shares can be obtained as

LqPqsX [Pa(l)\s] Ws(i) = q y p q (6) 2Liq * qs
Finally, the market-level mode choice shares may be

computed as

W/’\ Pqs X [Pq(i)\S] V D v/Tir/-\
w(0 =-q-= 2j Rs x

(7)
The elasticities of the effect of level-of-service at

tributes can also be computed at the individual
level, the segment-level, or the market-level. These
can be derived in a straight-forward fashion from
the expressions above and are not presented here
due to space considerations. An examination of the
individual-level cross-elasticities from the model

will show that the endogenous segmentation model
is not saddled with the independence from irrele
vant alternatives property of the MNL model.

2. MODEL ESTIMATION

THE PARAMETERS to be estimated in the mode choice

model with endogenous market segmentation are
the parameter vectors j8s and ys for each s, and the
number of segments S. The log likelihood function to
be maximized for given S can be written as (we
discuss the procedure employed to determine S in
section 2.2)

q=Q ( s I \

%=iz log 2 [pqsx n ipqa)\sy? q = l [ 8 = 1 \ i

where Cq is the choice set of alternatives for the gth
individual and 8qi is defined as follows:

1 if the qth individual chooses
alternative i

(9 = 1,2.q;iEC,) (9)
0 otherwise,

Equation (8) represents an unconditional likeli
hood of an observed choice sample and is character
istic of finite probability mixture models. It has been
noted earlier (see McLACHLAN and basford, 1988
and redner and walker, 1984) that maximization
of the likelihood function using the usual Newton or
quasi-Newton (secant) routines in such mixture
models can be computationally unstable (we docu
ment our own experience of this unstable behavior
in section 3.4). A critical issue in such cases is to
start the maximum likelihood iterations with good
start parameter estimates. To obtain good start val
ues, we develop a two-stage iterative method which
belongs to the EM family of algorithms (dempster,
Laird and Rubin, 1977). This family of algorithms
has been suggested earlier as a natural candidate
for estimation in the general class of finite-mixture
models since it is stable and tends to increase the

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

38 / C. R. BHAT

log-likelihood function more than usual quadratic
maximization routines in areas of the parameter
space distant from the likelihood maximum (RUUD,
1991). The specific EM method is discussed below.

2.1. EM Method for Start Values

Consider the likelihood function in equation (8)
and rewrite it as

2 = I logf 1 PqsLq} (10)

where

lqs = n [pm\s]”*. Jqs

The value Ln? is the choice likelihood for individual
q conditional on the individual belonging to segment
s. Next, note that the expected value of membership
in segment s for individual q conditional on sociode

mographic/trip characteristics and on the observed
mode choice of the individual (i.e., the posterior seg
ment membership probability Pqs) can be obtained
by revising the expected value of segment member
ship conditional only on the individual’s sociodemo
graphic/trip characteristics (i.e., the prior segment
membership probability Pqs) in a Bayesian fashion
as

Pqs=y p~^t * s’ * qs’qs’

The necessary first-order conditions for maximizing
the likelihood function can be written from equation
(10) and (11) as:

_ / Pm \ dL
= 1 dPs q 2s> PqsLj a/3,

J qs

Pqs\ dLn

I \Lqsj dBs
dlogLgs= (12)

for s = 1, 2, . . . , S

where

L*=UlPqsf- (14)
s

Equation (12) indicates that for given values of Pqs,
the maximum likelihood estimate of each /3S vector
(s = 1, 2, . . . , S) is obtained from a standard multi
nomial logit estimation for choice of mode with the
corresponding Pqs values as weights (thus, there are
S individual multinomial logit estimations). Equa
tion (13) shows that for given values of Pqs, the
maximum likelihood estimate of the ys vector is
obtained from a single multinomial logit estimation
for choice of segment with the posterior probabilities
Pqs being used as the “dependent variable” instead
of the “missing” segment choice data. A more effi
cient way to simultaneously obtain the estimates of
j3s for each segment and estimates of ys (for given
values of Pqs) is to construct a new log likelihood
function as follows:

^* = l[lPqslogLqs + \ogL*] (15)
= 2 2 P?{log Lqs + log PJ. q s

Since the log-likelihood function above is a combina
tion of MNL log-likelihood functions, it is globally
concave for given Pqs values.
Equations (11) and (15) lend themselves gainfully
to the application of a EM algorithm. The E-step
involves the computation of the a posteriori mem
bership probabilities using equation (11) for some
given initial guesses for the parameters. In the M
step, equation (15) is maximized with respect to
parameters using the a posteriori membership prob
abilities computed in the E-step. Since the function
in equation (15) is globally concave for given Pqs, a
simple Newton-Raphson search will converge
quickly. The two steps are then alternated. Demp
ster, Laird and Rubin (1977) prove (under a very
general setting) that, at each iteration, the EM al
gorithm provides monotone increasing values of the
original log-likelihood function in equation (8).
In principle, the EM steps can be iterated until

convergence (at convergence, the EM algorithm will
return the maximum-likelihood estimates). But, as
we show in section 3.4, the rate of convergence of the
EM algorithm is slow in the neighborhood of the
likelihood maximum (this has also been documented
in other applications of the EM algorithm; see Red
ner and Walker, 1984 and WATSON and ENGLE,
1983). So we stopped the EM algorithm prematurely
after a few iterations and used the resulting param

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

AN ENDOGENOUS SEGMENTATION MODE CHOICE MODEL / 39

eter estimates as the start values for the direct
maximization of the log-likelihood in equation (8).
However, the reader will note that the EM algo
rithm proposed here can be used in isolation (with
out combining with the direct maximum likelihood
procedure) to obtain the parameters of the endog
enously segmented mode choice model if access to
general maximum-likelihood software is not avail
able (the EM method requires only the standard
multinomial logit software).

2.2. Determination of Number of Segments
The estimation procedure discussed above is ap

plicable for a given number of segments S. To iden
tify the most appropriate value for S, we estimate
the model for increasing values of S (S = 2, 3, 4,
. . .) until a point is reached where an additional
segment does not significantly improve model fit.
The evaluation of model fit is based on the Bayesian
Information Criterion (BIC)

BIC= -X + 0.5-iMn(iV). (16)
The first term on the right side is the negative of the
log-likelihood value at convergence; R is the number
of parameters estimated, and N is the number of
observations (see ALLENBY, 1990). As the number of
segments, S, increases, the BIC value keeps declin
ing till a point is reached where an increase in S
results in an increase in the BIC value. Estimation
is terminated at this point and the number of seg

ments corresponding to the lowest value of BIC is
considered the appropriate number for S.3

3. AN APPLICATION TO INTERCITY MODE CHOICE
MODELING

3.1. Background
In this section, we report the empirical results

obtained from applying the endogenous segmenta
tion model and other alternative methods of accom
modating systematic heterogeneity to an analysis of
intercity travel mode choice behavior. The data used
in the current empirical analysis was assembled by
VIA Rail (the Canadian national rail carrier) in 1989
to develop travel demand models to forecast future
intercity travel in the Toronto-Montreal corridor

3An alternative procedure to determine the number of seg
ments is to apply the Akaike Information Criterion developed by
Akaike (1974). This is the procedure adopted by Kamakura and
Russell (1989). Gupta and Chintagunta (1994), on the other hand,
use the BIC. An advantage of the BIC is that it takes into
consideration the sample size used in the analysis. The Akaike
Information Criterion tends to unduly favor a model with more
number of segments for large data samples; the BIC corrects this
situation (see Allenby, 1990).

and to estimate shifts in mode split in response to a
variety of potential rail service improvements (see
KPMG Peat Marwick and Koppelman, 1990 for a
detailed description of this data). The data includes
sociodemographic and general trip-making charac
teristics of the traveler, and detailed information on
the current trip (purpose, party size, origin and des
tination cities, etc.). The set of modes available to
travelers for their intercity travel was determined
based on the geographic location of the trip (the
universal choice set included car, air, train, and
bus). Level of service data were generated for each
available mode and each trip based on the origin/
destination information of the trip (the assembly of
the level-of-service data was done by KPMG Peat
Marwick for VIA Rail; for information on detailed
procedures to compute level-of-service measures,
the reader is referred to KPMG Peat Marwick and
Koppelman, 1990).

In this article, we focus on intercity mode choice
for paid business travel in the corridor. The study is
further confined to a mode choice examination
among air, train, and car due to the very few number
of individuals choosing the bus mode in the sample
and also because of the poor quality of the bus data
(see forinash, 1992). This is not likely to affect the
applicability of the mode choice model in any serious
way since the bus share for paid business travel in
the corridor is less than one percent. The sample
used is a choice-based sample and comprises 3593
business travelers. The share of the choice of each
mode in the population of travelers was available
from supplementary aggregate data. Since the sam
ple is choice-based with known aggregate shares, we
employed the Weighted Exogenous Sample Maxi
mum Likelihood method proposed by manski and
lerman (1977) in estimation. The asymptotic co
variance matrix of parameters is computed as
H_1AH_1, where H is the hessian and A is the cross
product matrix of the gradients (H and A are eval
uated at the estimated parameter values). This pro
vides consistent standard errors of the parameters
(Borsch-Supan, 1987).
Five variables associated with individual sociode

mographics and trip characteristics were available
in the data, and all of these were considered for
market segmentation. The variables are: income,
sex (female or male), travel group size (traveling
alone or traveling in a group), day of travel (weekend
travel or weekday travel), and (one-way) trip dis
tance. The segmentation variables were introduced
as alternative-specific variables in the logit model of
equation (3) with the last segment being the base.

The level-of-service variables used to model choice
of mode included modal level-of-service measures

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

40 / C. R. BHAT

TABLE I
Mean (and Standard Deviation) of Modal Level-of-Service Variables

Frequency Total Cost In-Vehicle Time Out-of-Vehicle Time
Mode (departures/day) (in Canadian $) (in mins.) (in mins.)
Train 4.21 (2.3) 58.58 (17.7) 244.50 (115.0) 86.32 (22.0)
Air 25.24 (14.0) 157.33 (21.7) 57.72 (19.2) 106.74 (24.9)
Car Not applicable 70.56 (32.7) 249.60 (107.5) 0.00 (0.0)

Note: All measures are one-way values. Air and train frequency refer to total departures by all carriers.

(frequency of service, total cost, in-vehicle travel
time and out-of-vehicle travel time) and a large city
indicator which identified whether a trip originated,
terminated, or originated and terminated in a large
city. This variable may be viewed as a “proxy” level
of-service measure capturing mode characteristics
associated with large cities. Table I provides descrip
tive statistics on the modal level-of-service measures

in the sample.

3.2. Endogenous Market Segmentation Model
Results

We estimated the endogenous segmentation
model with S = 2, 3, and 4 segments. In all these
estimations, travel time was best represented sim
ply by decomposition into in-vehicle and out-of-vehi
cle travel times. For all S, we found that the use of
travel cost was preferred to travel cost deflated by
income. The final mode choice model specification
for the segmented models included the alternative
specific mode-bias constants and the level-of-service

variables (we examined if there were differences in
preferences based on income and trip distance
within each segment by including these variables as
alternative-specific variables in the mode choice
utility, but did not find any statistically significant
effects).
We computed the BIC value as shown in equation

(16) for each number of segments. The BIC value
decreased as we moved from two to three segments,
but increased when we added the fourth segment
(the BIC value was 2291 for the two-segment case,
2247.9 for the three-segment case and 2259.8 for the
four-segment case). Thus, we selected three seg
ments as the appropriate number of segments.

The estimation results for the three-segment so
lution are shown in Table II. The signs of all the
level-of-service variables in the segment-specific

mode choice models are consistent with a priori ex
pectations. However, there are clear differences in
intrinsic preferences and sensitivities to level-of-ser
vice measures among the segments. Turning to the
segmentation model, the coefficients in this model

TABLE II
Intercity Mode Choice Analysis: Parameter Estimates for Three-Segment Solution

Model
Segment 1 Segment 2 Segment 3

Variable Parameter ^-statistic Parameter ^-statistic Parameter

Segment-specific
mode choice
model

Segmentation
model

Segment size

Mode constants

Train

Air

Large city indicator
Train
Air

Frequency of service (deps./day)
Travel cost (Canadian $)
Travel time (minutes)

In-vehicle
Out-of-vehicle

Constant

Income

Female

Sex

Traveling Alone
Weekend Travel
Trip Distance

-3.0617
-1.0516

1.9273
2.2240
0.1615

-0.0591

-0.0254
-0.0436
4.4227
-0.0293
-0.7614
-0.1657
0.2423
-0.0047

-2.54
-1.82

2.20
3.46
6.38

-4.53

-3.25
-2.91
7.62

-5.73
-3.46
-1.70
0.65

-5.91

4.7763
-1.3691

0.2146
-1.3691
0.5784
-0.1728

-0.0030

-0.0239
1.5366

-0.0447
0.9703
-0.7226
1.5326

-0.0030

2.12
-1.01

0.32
-1.01
3.49

-3.27

-1.20
-1.84
2.56

-8.60
4.05

-4.07
4.71

-3.79

1.1737
4.3404

-0.0840
2.6892
0.1790
-0.0166

-0.0657
-0.1627

0.60
3.36

-0.12
2.37
3.92

-0.54

-5.21
-5.01

0.4866 0.1220

Base

Segment

0.3914

Note: The log-likelihood value at zero is -3947.3; the log-likelihood value for the unsegmented model with only alternative specific
constants is -3648.2. The log-likelihood value at convergence for the three segment solution is -2100.5.

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

AN ENDOGENOUS SEGMENTATION MODE CHOICE MODEL / 41

TABLE III
Qualitative Characterization of Mode Choice Preferences and Sociodemographic I Trip Attributes of Segments

Preferences/Attributes Characteristic Segment 1 Segment 2 Segment 3

Mode choice
preferences

Sociodemographic
and trip
attributes

Intrinsic preference for

Sensitivity to frequency
Sensitivity to cost
Sensitivity to IVTT

Sensitivity to OVTT

Income
Sex

Travel group size

Day of travel

Trip distance

Car; if trip originates or
ends in large city, air
is preferred

Low

Medium

(MV = $26/hour)
Medium

(MV = $44/hour)

About average market
income

High proportion of
males

Same proportion as
overall market

High proportion of
weekday travelers

Short

Train

High
High
Low

(MV = $1.05/hour)
Low

(MV = $8.30/hour)

Lower than average
market income

High proportion of
females

High proportion of
individuals
traveling in a group

High proportion of
weekend travelers

Medium
Air

Low
Low
High

(MV = $237/hour)
High

(MV = $588/hour)

Higher than average
market income

Same proportion as
overall market

High proportion of
individuals
traveling alone

High proportion of
weekday travelers

Long
Note: MV stands for “money value of time”.

are, in general, very significant.4 A comparison of
the coefficients across the segments provides impor
tant qualitative information regarding the charac
teristics of each segment vis-a-vis other segments.
The constants in the segmentation model contribute
to the size of each segment and do not have any
substantive interpretation. The segment sizes are
computed using equation (4); segment one is largest
(48.7% of intercity travelers) and segment two is the
smallest (12.2% of intercity travelers).

Table III summarizes (qualitatively) the mode
choice characteristics and the sociodemographic and
trip characteristics of individuals within each seg

ment. Segment two has a high overall preference for
train, is highly sensitive to cost, and is low in sen
sitivity to travel time. This result is reasonable since
individuals in segment two earn low incomes, and
travel in a group, and on weekends; the high intrin
sic preference for train in this segment can also be
attributed to the intermediate trip distance of travel
in the segment. Further, previous studies (for exam
ple, KPMG Peat Marwick et al., 1993) have indi
cated that women prefer common-carrier surface
modes more than men do, which is consistent with
the result obtained here (segment two has the high
est female proportion). The high frequency sensitiv
ity of segment two may be associated with group

4It is useful to note that had we adopted a full-dimensional
exogenous segmentation scheme, there would have been (assum
ing only two income segments and two trip-distance segments)
25 = 32 market segments with an average of only about 110
observations within each segment.

travel; individuals traveling together may have dif
ferent schedule conveniences and may therefore
place a high premium on frequency of service. Seg
ment three has a high overall preference for air, is
not very sensitive to cost, and is highly sensitive to
travel time; this is consistent with the high-income
of individuals in this segment and the single-indi
vidual, weekday, long distance travel characteristics
of the segment. The money value of time is very high
in this segment. Segment one is the intermediate
segment, both in level-of-service sensitivity and so
ciodemographic/trip characteristics. Individuals in
the first segment whose trips do not originate or end
in a large city have an intrinsic preference for the
car and air modes over the train mode, and for the
car mode over the air mode. For individuals whose
trips originate or end in a large city, there is a
greater intrinsic preference for air over the car mode
in this segment.

To characterize each segment more intuitively, we
compute the mean values of the segmentation vari

TABLE IV
Mean Values of Demographic and Trip Variables in Each

Segment

Variable
Overall

Segment 1 Segment 2 Segment 3 Market

Income (X 103 Can$)
Female
Traveling alone
Weekend travel
Trip distance (km)

52.16
0.13
0.69
0.20

311.80

44.09
0.48
0.57
0.62

373.37

60.28
0.20
0.77
0.19

54.36
0.20
0.70
0.25

444.76 371.35

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

42 / C. R. BHAT

TABLE V

Empirical Comparison of the Endogenous Segmentation Model with Refined Utility Function Specification Models

Statistic

First-Order
Preference
Interactions

First-Order Preference and First-Order
Response Interactions

Second-Order Preference and First-Order
Response Interactions

Full version Preferred version Full version Preferred version

Number of parameters
Log-likelihood at

convergence
Adjusted likelihood

ratio index, p2 a
Result of non-nested

likelihood ratio index
test with endogenous
segmentation model

18
-2299.38

0.4129

46
-2199.07

0.4312

24
-2211.48

0.4337

66
-2154.09

0.4376

32
-2174.71

0.4409

Reject very strongly the hypotheses that any (and each) of the above models is as good as the
endogenous segmentation model.

Note: The likelihood value at convergence for the endogenous segmentation model is -2100.5 and the adjusted likelihood ratio index
for this model is 0.4587 (number of parameters in the model is 36).

aThe adjusted likelihood ratio index is computed as p2 = 1 – [(i?(j3) – k)/!?(0)], where i?(j3), i?(0), and k are the log-likelihood function
value at convergence, the log-likelihood value at zero, and the number of model parameters, respectively.

ables within each segment (see equation 5). The
results are presented in Table IV and support our
previous observations regarding segment character
istics.

3.3. Comparison of Endogenous Market
Segmentation with Alternative
Approaches

In this section, we compare the empirical perfor
mance of the endogenous market segmentation ap
proach relative to the two alternative extant meth
ods to accommodating systematic heterogeneity in
preference and response across the population
(these alternative methods are the refined utility
function specification method and the limited
dimensional exogenous market segmentation
method).

Table V summarizes the results of models ob
tained using the refined utility function method. We
present the overall measures of fit for three different
specifications. The first specification includes a)
first-order interactions of all the five segmentation
variables with mode preferences, b) the large city
indicator variable, and c) the four modal level-of
service variables: frequency, cost, and in-vehicle and
out-of-vehicle travel times (variants of this specifi
cation that incorporate the cost level-of-service as
cost/income and the out-of-vehicle time measure as
out-of-vehicle over distance, as has been done in
some earlier mode choice analyses, were rejected by
the specification reported here). All the five segmen
tation variables had a statistically significant im
pact on train and air mode preferences (car is the
base; the impact of each of these variables was also
not the same across the train and air modes). The
second specification adds first-order interactions of
the five segmentation variables with the level-of

service measures.5 This specification, in its full ver
sion, effectively adds 28 level-of-service variables to
the first specification. However, of these, only six of
the interaction terms were significant and these are
retained in the preferred version. The third specifi
cation, in its full version, adds all (ten) second-order
interactions of the five segmentation variables with

mode preferences to the full version of the second
specification (since there are three modes, this third
specification adds 20 additional parameters to the
full version of the second specification). In this third
specification, only seven of the second-order interac
tions with preference and seven of the first-order
interactions with the level-of-service variables were
significant. These are retained in the preferred ver
sion. As can be observed from the table, the adjusted
likelihood ratio index of the refined utility function
specification models are smaller than that of the
endogenous segmentation model. The endogenous
segmentation model rejects the refined utility func
tion models in Table V in non-nested adjusted like
lihood ratio tests (see Ben-Akiva and Lerman, 1985;
page 172 for the structure of the test). Thus, it
provides a better fit to the data. We also found that
some of the statistically significant coefficients in
the refined utility models were non-intuitive. For
example, individuals in the lower income bracket
were found to be less sensitive to cost and more
sensitive to out-of-vehicle time than individuals in
the middle and higher income brackets.

5For the two continuous variables, income and trip distance, we
defined three distinct ranges and allowed the level-of-service
sensitivity to be different for the different ranges. The income
segments were: less than $40,000, $40,000-$65,000, and greater
than $65,000. The segments for trip-distance were less than 250

miles, 250-500 miles, and greater than 500 miles.

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

AN ENDOGENOUS SEGMENTATION MODE CHOICE MODEL / 43

TABLE VI
Empirical Comparison of the Endogenous Segmentation Model with Limited-Dimensional Exogenous Segmentation Models

Segmentation
Variable

Number of
Segments

Number of
Parameters

LL at
Convergence

Two-Way
Segmentation

With . . .
Number of Number of LL at
Segments Parameters Convergence

Incomed

Sex

Group size

Day of travel
Trip distance

52 -2242.9 0.4186 0.4272 Sex 6 92 -2214.8 0.4156
Group size 6 92 -2208.9 0.4186
Day of travel 6 92 -2216.6 0.4151
Trip distance 9 156 -2108.1 0.4264

32 -2287.4 0.4124 0.4160 Group size 4 64 -2262.0 0.4107
Day of travel 4 64 -2260.1 0.4112
Trip distance 6 96 -2179.7 0.4235

32 -2278.5 0.4147 0.4182 Day of travel 4 64 -2250.9 0.4135
Trip distance 6 96 -2163.2 0.4277

32 -2279.3 0.4145 0.4180 Trip distance 6 96 -2172.3 0.4253
54 -2212.4 0.4258 0.4350 – – – – –

0.4298
0.4313
0.4293
0.4568
0.4178
0.4183
0.4387
0.4206
0.4429
0.4406

Note: The adjusted likelihood ratio index for the endogenous segmentation model is 0.4587 and the number of parameters in this
model is 36.

aThe adjusted likelihood ratio index is computed as p2 = 1 – [(i?(/3) – k)/!?(0)], where ??(j3), i?(0), and k are the log-likelihood function
value at convergence, the log-likelihood value at zero, and the number of model parameters, respectively.

bThe “favorable” adjusted likelihood ratio index for the one-way segmentations is computed by using the log-likelihood value at
convergence for the segmented models, but by assuming the number of parameters to be equal to that in the unsegmented model.

cThe “favorable” adjusted likelihood ratio index for two-way segmentations is computed by using the log-likelihood at convergence from
the segmented models, but assuming the number of parameters to be the same as that in the endogenous segmentation model.

dThere was no variation in income in the highest income segment and hence income is not introduced as an alternative-specific
variable in the high-income segment.

Table VI presents model results for one-way seg
mentations and for all possible combinations of two
way variable segmentations in the spirit of the lim
ited-dimensional exogenous market segmentation

method. The specification adopted within each seg
ment is based on the first specification of the refined
utility function specification in Table V.

The one-way segmentation results show that, on
the basis of the adjusted likelihood ratio index p2,
distance is the most important segmentation vari
able (all segmentations reject the unsegmented
model in nested likelihood ratio tests at the 0.05
significance level or better). We also find that all the
exogenously segmented models have a p2 which is
significantly less than that of the endogenous seg

mentation model based on a non-nested test. Since
these results may be deceiving (the analyst can spec
ify improved exogenous segmentation models by
constraining coefficients across segments), we com
puted the p2 for the one-way segmented models un
der the most favorable (albeit unrealistic) scenario
where the likelihood value corresponds to the exog
enously segmented model, but the number of param
eters is assumed to be equal to that of the unseg
mented model. The resulting pfav values are shown
in the Table and indicate that, even in this most
favorable scenario, the p2 for the one-way segmen
tation models are significantly smaller than that of
the endogenous segmentation model.

The results of the two-way exogenous segmenta
tion models reject the corresponding one-way seg
mentation models in nested likelihood ratio tests.
This suggests that there is benefit to increasing the
order of interaction effects of the segmentation vari
ables on preference and response. But, even for
these two-way segmentations, the p2 values are sig
nificantly smaller than that of the endogenous seg
mentation model. As in the case of the one-way
segmentation model, we computed a “favorable” p2
value for the two-way models, this time assuming
that the number of parameters in the two-way seg
mentation models is the same as in the endogenous
segmentation model (i.e., 36 parameters). These val
ues are presented in the final column and indicate
that, even under this favorable scenario, the only
model that is close to the endogenous segmentation
model is the one that segments by income and trip
distance. However, a non-nested test convincingly
rejects even this best “favorable-scenario”-specifica
tion relative to the endogenous segmentation model.
In addition, as in the refined utility specifications,
we obtained non-intuitive results in the income
distance two-way segmentation model; the parame
ter on cost was positive in all the segments (and in
two segments, significantly so) except in the high
income-high trip distance segment. Further, in all
but two segments, the parameter on in-vehicle time
was statistically insignificant (in part, this is a con

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

44 / C. R. BHAT

sequence of imprecision in coefficient estimates due
to the small number of observations in the seg
ments).

In summary, the empirical results indicate clearly
that the endogenous segmentation model provides a
better fit to the data and offers more intuitively
appealing results compared to other extant ap
proaches to incorporating systematic heterogeneity.
This can be attributed to the inclusion of the full
order of interaction effects of the segmentation vari
ables on preference and level-of-service sensitivity.

3.4. Assessment of the Performance of the
Hybrid EM-DFP Algorithm

The EM algorithm produces iteration sequences
with good global convergence characteristics (i.e.,
the EM iteration sequence converges to a local max
imizer of the likelihood function) in mixture models,
even when the starting values are distant from the
maximizing values (the theory and proofs of this
result, and a review of an exhaustive body of empir
ical evidence to support this result, can be found in
Redner and Walker, 1984). On the other hand, the
complex nature of the dependence of the log-likeli
hood function on the mixture-model parameters can
produce Newton and quasi-Newton iteration se
quences that get stalled at some point and never
make further progress toward convergence, espe
cially when the starting parameter values are dis
tant from the maximizing values (this is because the
Newton and quasi-Newton iterative schemes rely on
approximations that are valid only in the neighbor
hood of the maximizing parameter values; see Red
ner and Walker, 1984 and CRAMER, 1986, page 72).

Our own empirical experience during the maximi
zation of the likelihood function of the endogenous
segmentation model reinforces the theoretical anal
ysis and empirical evidence discussed earlier. A nat
ural starting point for the iterations in the endoge
nous segmentation model is to use the unsegmented
mode choice model parameters for the mode choice
model in each segment and set all segment member
ship model parameters to zero.6 We considered three
algorithms for estimation: the EM method, the DFP
(Davidson, Fletcher, Powell) algorithm (which is a

widely used and effective quasi-Newton algorithm,
see GREENE, 1990; page 370), and a hybrid EM-DFP
algorithm that starts the iteration with the EM al
gorithm but switches to the DFP algorithm later.

6In practice, at least one mode choice parameter should be
shifted slightly (up or down) in one of the segments to start the
iterations; otherwise, there is no incentive for the segment mem
bership parameters to be updated and “convergence” will be
achieved immediately.

In terms of stability, we found that the EM algo
rithm outperformed the DFP algorithm. In particu
lar, the EM algorithm was predictable; it produced
very similar and monotonically increasing (in the
likelihood function) iteration sequences in repeated
trials. The DFP algorithm, on the other hand, was
quite unpredictable in its iteration sequence in re
peated trials. Some iteration sequences appeared to
proceed well, while others wandered aimlessly with
out any improvement to the log-likelihood function
and without ever converging. The cause of such be
havior appeared to be due to the line search; the
DFP algorithm had some difficulty in early itera
tions in finding an appropriate step length and,
therefore, entered a random direction-search mode.
The random search sometimes resulted in a subse
quent stable iteration sequence and sometimes did
not proceed anywhere. We never obtained a conver
gent iteration sequence with the DFP algorithm for
the endogenous segmentation model with four seg
ments. We obtained a convergent sequence in three
out of five trials for the two-segment solution and in
two of five trials for the three-segment solution.7 We
did not encounter any stability problems in the hy
brid EM-DFP method since the DFP iterations were
started with parameter values close to their maxi
mizing values.

We now compare the computational speeds of the
EM algorithm, the DFP algorithm, and the hybrid
EM-DFP algorithm. We focus on the three-segment
solution. For the DFP algorithm, we present the
results for the fastest successful run. All estimations

were done using the GAUSS matrix programming
language on a 486-DX2, 66 MHz. computer. The
gradient of the log-likelihood function was coded for
the DFP algorithm and the gradient and hessian
were coded for the MNL estimation in the EM algo
rithm. The results are shown in Figure 1. Each point
represents an iteration of the indicated algorithm.
The vertical axis represents the log-likelihood func
tion divided by the number of observations in the
sample.

The EM algorithm has a stable S-shaped pattern.
It takes a few small steps at the beginning before
taking large steps toward increasing the log-likeli

7Both Kamakura and Russell (1989) and Gupta and Chinta
gunta (1994) do not cite any problems in their estimations. It is
not clear whether they did not encounter any problems or
whether they just did not report them. It should also be empha
sized that whereas the study here and many earlier empirical
studies with mixture models have documented difficulties with
the usual maximization algorithms, this need not be a universal
finding in all data contexts. Theoretical and empirical examina
tion of the relationship between data contexts and algorithm
performance would be a useful direction for future research.

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

AN ENDOGENOUS SEGMENTATION MODE CHOICE MODEL / 45

Execution Time in Minutes

Convergence Value = -0.584609

Best DFP Algorithm Trial
Convergence Time = 149.1 minutes

20 30 40 50 60 70 80 90 100 110

Execution Time in Minutes
120 130 140 150 160

Convergence Value = -0.584609

EM-DFP Hybrid Algorithm
Convergence Time = 63.3 minutes

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160

Execution Time in Minutes

Fig. 1. Performance of alternative algorithms for the three
segment solution.

hood function (this is in contrast to some earlier
applications of the algorithm where the algorithm
makes the largest leap in the first iteration and the
rate of the leap decreases monotonically thereafter;
see Ruud, 1991 and Watson and Engle, 1983). The
secondary MNL iterations in each main EM itera
tion ranged from two to three with the largest EM
iteration leaps being associated with three MNL
iterations. As documented in other applications of
the EM algorithm (see Watson and Engle, 1983 and
Redner and Walker, 1984), most of the change in the
log-likelihood function is realized in the first few
iterations and the algorithm is very slow in conver
gence rate near the likelihood maximum.

Each iteration of the DFP algorithm, in general, is
less time-consuming than the EM iteration; how
ever, because of entering into a random direction
search mode, some early iterations take a substan
tial amount of time. The iteration sequence for the
DFP is not smooth; there are sequences with little
improvement in the log-likelihood function. But the
algorithm converges much more rapidly near the
likelihood maximum relative to the EM method.
In the hybrid EM-DFP algorithm, we used a cri

terion of less than 0.01 difference in successive val
ues of the average log-likelihood function of equation
(15) to switch from the EM algorithm to the DFP
algorithm (this criterion was applied only after a
minimum of five EM iterations). The results clearly
show the effectiveness of the hybrid algorithm which
combines the large and stable steps of the EM algo
rithm at the beginning of the maximization with the
stable and relatively rapid convergence of the DFP
algorithm near the likelihood maximum.

4. CHOICE ELASTICITIES AND POLICY
IMPLICATIONS

THE RESULTS IN TABLE VII present the choice elas
ticities and the mode shares obtained from the en
dogenous segmentation model at the segment-level
and the market level. We present the relevant elas
ticities for changes in rail level-of-service character
istics only. The entries in the segment-level columns
represent the contribution of the corresponding seg
ment to the overall percentage change in market
share of the modes due to a one percent increase in
the train level-of-service variables. We present these
values rather than the actual segment-level elastic
ities since the current entries provide a better basis
to examine the differential effects (across segments)
of policy actions to improve rail level-of-service (the
segment-level elasticities provide information only
on within-segment shifts; they are not useful for
cross-segment comparisons since they do not ac
count for the different segment sizes and because
they are a function of segment-level mode choice
probabilities).

The choice elasticities in the table reflect our ear
lier qualitative discussion of the level-of-service sen
sitivity differences among the segments. It is also
useful to note that the market share sensitivity of
car and air are about the same due to an increase in
rail frequency or rail cost for segments two and

8Since the objective of the original study for which the data
were collected was to examine the effect of alternative improve
ments in rail level-of-service characteristics, we focus on the
elasticity matrix corresponding to changes in rail level-of-service
here.

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

46 / C. R. BHAT

TABLE VII
Choice Elasticities (in Response to Change in Rail Service) and Mode Choice Shares from the Endogenous Segmentation Model

Segment 1 Segment 2 Segment 3 Market3 Rail Level of Service ?
Attribute Rail Air Car Rail Air Car Rail Air Car Rail Air Car

Frequency 0.15 -0.01 -0.04 0.20 -0.03 -0.04 0.06 -0.01 -0.01 0.41 -0.05 -0.10
Cost -0.57 0.06 0.14 -0.77 0.14 0.15 -0.14 0.03 0.03 -1.48 0.22 0.31
IVTTb -0.32 0.01 0.09 -0.03 0.00 0.00 -0.51 0.07 0.12 -0.86 0.09 0.21
0VTTb -0.35 0.03 0.09 -0.18 0.03 0.04 -0.67 0.08 0.16 -1.20 0.14 0.29

Mode choice 0.07 0.20 0.73 0.80 0.07 0.13 0.06 0.76 0.18 0.15 0.40 0.44
sharesc

aThe overall market elasticity is the sum of the “elasticity” entries in the individual segments. The “elasticity” entries at the segment
level represent the contribution of the corresponding segment to the overall percentage change in market share due to a 1% increase in
rail level-of-service attribute. The actual increase (decrease) in rail share at the market level is equal to the decrease (increase) in the air
and car shares at the market level. These conditions do not hold exactly in the table due to rounding of the elasticity entries.

bIVTT represents In-Vehicle Travel Time; OVTT represents Out-of-Vehicle Travel Time.
cThe overall market share of each mode in the joint segmentation/mode choice model is equal to the weighted average of the

corresponding segment-level mode shares, where the weights are the segment sizes.

three, but the sensitivity of the car mode is greater
than air for segment one.

The mode choice shares in individual segments in
the segmented model show that the car share is
highest for segment one, the rail share is highest for
segment two, and the air share is highest for seg

ment three. The market share predictions from the
segmented model are not exactly equal to the actual

market shares (as represented by the weighted es
timation sample) if we consider all significant digits;
they are however the same as the actual market
shares up to two decimal places in this empirical
analysis (if we use the a posteriori segment mem
bership probabilities instead of the a priori segment
membership probabilities in computing the mode
shares, then it is relatively straightforward to show
using equation (12) that the market share predic
tions from the segmented model will equal the ac
tual market shares represented by the weighted es
timation sample).

The choice elasticities of the segmented model
provide valuable insights for the targeting of infor
mational campaigns regarding rail service improve
ments and for positioning of rail improvements in
the travel market. In particular, the results indicate
that it is most efficient to target individuals who
have low-income earnings, who are female, and who
travel in a group (segment two) for informational
campaigns on rail frequency improvements or rail
cost decreases. Rail frequency improvements and
rail cost reductions are best positioned in the medi
um-distance intercity travel corridors and in the
weekend business travel market. The above strate
gies lead to the maximum increase in rail share in
response to frequency improvements and cost reduc
tions, but the increase in rail share is due to almost
equal percentage decreases in car and air shares. If

the primary objective is to decrease auto-congestion
on highways, while at the same time ensuring a
reasonable increase in rail share, then it may be
more appropriate to target the male traveling pop
ulation with medium incomes, and to position fre
quency improvements and cost reductions in the
short-distance travel corridors and in the weekday
travel market (segment one). Informational cam
paigns regarding rail travel time improvements are
best directed toward the high-income travelers; such
improvements are best positioned in the long-dis
tance travel corridors and in the weekday travel
market.

5. SUMMARY AND CONCLUSIONS

IN THIS PAPER, we propose the use of an endogenous
market segmentation model in a travel demand con
text. In this model, membership in a segment is
probabilistic and is a function of the sociodemo
graphic and trip characteristics of each individual.
The formulation enables the use of a number of
segmentation variables without requiring the multi
way partition of data, and obviates the need to es
tablish ad hoc threshold values of continuous seg
mentation variables. An examination of the
individual-level cross-elasticities indicates that the
endogenous segmentation model does not have the
independence from irrelevant alternatives property
of the standard multinomial logit formulation. We
have also developed a stable, efficient estimation
procedure for estimating the segmentation model.
This procedure involves the use of an EM algorithm
to obtain starting values for the subsequent direct
maximization of the likelihood function of the seg
mentation model. Whereas we have developed the
hybrid EM-DFP technique in the context of a multi

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

AN ENDOGENOUS SEGMENTATION MODE CHOICE MODEL / 47

nomial logit formulation for mode choice, the same
technique can be used for more complex mode choice
formulations such as the nested logit or multinomial
probit. A particularly valuable contribution of our
estimation approach is that the EM algorithm
(which requires only the standard multinomial logit
software for estimation) can be used in isolation to
obtain the parameters of the segmentation model
and their asymptotic standard errors if access to
general maximum-likelihood software is not avail
able (this approach will, however, be quite time
consuming due to the slow nature of convergence of
the EM algorithm in the vicinity of the likelihood
maximum).

We applied the endogenous segmentation model
to the estimation of inter-city travel mode choice in
the Toronto-Montreal corridor. Using the Bayesian
Information Criterion, we found that the preferred
specification used a three-segment solution. The
probability of belonging to any segment was a func
tion of income, sex, travel group size, day of travel,
and trip distance. In a comparative empirical assess
ment of the endogenous segmentation model with
other commonly used methods to account for sys
tematic heterogeneity, we found the endogenous
segmentation model to be the most appropriate,
based on fit to the data and reasonability of the
results. Our assessment of the performance of the
hybrid EM-DFP algorithm vis-a-vis the DFP
method also clearly indicated the superiority of the
former over the latter.

The results of the preferred three-segment solu
tion of the endogenous segmentation model showed
that the intrinsic preferences for modes and level
of-service sensitivity were quite different among the
three segment groups. These differences, in combi
nation with the sociodemographic and trip-related
characteristics of individuals in each segment, pro
vide important information for effective targeting
and strategic positioning of rail service improve
ments. More generally, the endogenous segmenta
tion model represents a valuable methodology for
evaluating the effects of intercity traffic congestion
alleviation strategies in urban and intercity travel
contexts.

REFERENCES

G. Allenby, “Hypothesis Testing with Scanner Data: The
Advantage of Bayesian Methods,” Journal of Market
ing Research 27, 379-389 (1990).

M. Ben-Akiva and s. R. lerman, Discrete Choice Analysis:
Theory and Application to Travel Demand, The MIT
Press, Cambridge, 1985.

A. b6rsch-supan, Econometric Analysis of Discrete
Choice, Lecture Notes in Economics and Mathematical

Systems, p. 34, Springer-Verlag, Berlin, Germany,
1987.

G. Chamberlain, “Analysis of Covariance with Qualita
tive Data,” Review of Economic Studies 47, 225-238
(1980).

J. S. cramer, Econometric Applications of Maximum Like
lihood Methods, Cambridge University Press, Cam
bridge, London, 1986.

M. C. Dayton and G. B. Macready, “Concomitant-Vari
able Latent Class Models,” Journal of the American
Statistical Society 83, 173-179 (1988).

A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum
Likelihood from Incomplete Data via the EM Algo
rithm,” Journal of the Royal Statistical Society B39,
1-38 (1977).

Electric Power Research Institute, Methodology for
Predicting the Demand for New Electricity-Using
Goods, EA-593, Project 488-1, Final Report, 1977.

G. W. Fischer and D. Nagin, “Random Versus Fixed
Coefficient Quantal Choice Models,” in Structural
Analysis of Discrete Data with Econometric Applica
tions, pp. 273-304, Manski, C. F. and D. McFadden
(eds.), MIT Press, Cambridge, Massachusetts, 1981.

C. V. forinash, A Comparison of Model Structures for
Intercity Travel Mode Choice, unpublished Master’s
Thesis, Department of Civil Engineering, Northwest
ern University, Evanston, Illinois, 1992.

F. Gonul and K. Srinwasan, “Modeling Multiple Sources
of Heterogeneity in Multinomial Logit Models: Meth
odological and Managerial Issues,” Marketing Science
12, 213-229 (1993).

W. H. Greene, Econometric Analysis, Macmillan Publish
ing Company, New York, 1990.

S. Gupta and P. K. Chintagunta, “On Using Demo
graphic Variables to Determine Segment Membership
in Logit Mixture Models,” Journal of Marketing Re
search 31, 128-136 (1994).

J. A. Hausman and D. Wise, “A Conditional Probit Model
for Qualitative Choice: Discrete Decisions Recognizing
Interdependence and Heterogenous Preferences,”
Econometrica 46, 403-426 (1978).

D. A. Hensher, “Two Contentions Related to Conceptual
Context in Behavioral Travel Modeling,” in New Hori
zons in Travel Behavior Research, pp. 509-514, P. R.
Stopher, A. H. Meyburg, and W. Brog (eds.), Lexington
Books, Lexington, Massachusetts, 1981.

J. L. Horowitz, “Reconsidering the Multinomial Probit
Model,” Transportation Research 25B, 433-438 (1991).

J. L. horowitz, “Semiparametric Estimation of a Work
Trip Mode Choice Model,” Journal of Econometrics 58,
49-70 (1993).

W. A. Kamakura and G. J. Russell, “A Probabilistic
Choice Model for Market Segmentation and Elasticity
Structure,” Journal of Marketing Research 26, 379
390 (1989).

KPMG Peat Marwick and F. S. Koppelman, Proposal for
Analysis of the Market Demand for High Speed Rail in
the Quebec/Ontario Corridor, submitted to Ontario/
Quebec Rapid Train Task Force, 1990.

KPMG Peat Marwick in association with ICF Kaiser En

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

48 / C. R. BHAT

gineers, Inc., Midwest System Sciences, Resource
Systems Group, Comsis Corporation and Transporta
tion Consulting Group, Florida High Speed and Inter
city Rail Market and Ridership Study: Final Report, sub
mitted to Florida Department of Transportation, 1993.

C. Manski and S. Lerman, “The Estimation of Choice
Probabilities from Choice-Based Samples,” Economet
rica 45, 1977-1988 (1977).

D. McFadden, “Conditional Logit Analysis of Qualitative
Choice Behavior” in Frontiers in Econometrics, Za
remmbka, P. (ed.), Academic Press, New York, 1973.

G. J. McLachlan and K. E. Basford, Mixture Models:
Inference and Applications to Clustering, Marcel Dek
ker, Inc., New York, 1988.

R. A. Redner and H. F. Walker, “Mixture densities, Max
imum Likelihood, and the EM Algorithm,” SIAM Re
view 2, 195-239 (1984).

P. A. Ruud, Extensions of Estimation Methods Using the
EM Algorithm, Journal of Econometrics 49, 305-341
(1991).

M. W. Watson and R. F. Engle, “Alternative Algorithms
for the Estimation of Dynamic Factor, MIMIC and
Varying Coefficient Regression Models,” Journal of
Econometrics 23, 385-400 (1983).

(Submitted: April 1995; revisions received: September 1995, Feb
ruary 1996; accepted: February 1996)

This content downloaded from 134.153.184.170 on Thu, 07 Nov 2019 05:27:50 UTC
All use subject to https://about.jstor.org/terms

Contents

p. 34
p. 35
p. 36
p. 37
p. 38
p. 39
p. 40
p. 41
p. 42
p. 43
p. 44
p. 45
p. 46
p. 47
p. 48

Issue Table of Contents

Transportation Science, Vol. 31, No. 1 (February 1997) pp. 1-100
Front Matter
þÿ�þ�ÿ��I��N�� M��E��M��O��R��I��A��M��:�� W��i��l��l��i��a��m�� S��.�� V��i��c��k��r��e��y��,�� 1��9��1��4��1��9��9��6�� [��p��p��.�� 1��-��2��]
In This Issue [pp. 3-4]
Controlled Optimization of Phases at an Intersection [pp. 5-17]
Analytical Models for Vehicle/Gap Distribution on Automated Highway Systems [pp. 18-33]
An Endogenous Segmentation Mode Choice Model with an Application to Intercity Travel [pp. 34-48]
A Tabu Search Heuristic for the Vehicle Routing Problem with Backhauls and Time Windows [pp. 49-59]
Heuristic Algorithms for the Handicapped Persons Transportation Problem [pp. 60-71]
Airline Scheduling for the Temporary Closure of Airports [pp. 72-82]
“TSS Dissertation Abstracts”: Abstracts for the 1996 Transportation Science Section Dissertation Prize Competition [pp. 83-94]
Bibliographic Section
Review: untitled [pp. 95-96]
Review: untitled [pp. 96-97]
Back Matter

Working Papers

N° 6 – July 201

Ministry of Economy and Finance

Department of the Treasury

ISSN 1972-411X

lclogit: a Stata module for estimating a mixed

logit model with discrete mixing distribution

via the Expectation-Maximization algorithm

Daniele Pacifico, Hong il Yoo

Working Papers

The working paper series promotes the dissemination of economic research produced in
the Department of the Treasury (DT) of the Italian Ministry of Economy and Finance
(MEF) or presented by external economists on the occasion of seminars organised by
MEF on topics of institutional interest to the DT, with the aim of stimulating comments
and suggestions.
The views expressed in the working papers are those of the authors and do not
necessarily reflect those of the MEF and the DT.

The document can be downloaded from the Website www.dt.tesoro.it and freely
used, providing that its source and author(s) are quoted.

Editorial Board: Lorenzo Codogno, Mauro Marè, Libero Monteforte, Francesco Nucci, Franco Peracchi

Organisational coordination: Marina Sabatini

CONTENTS

1 INTRODUCTION

…………………………………………………………………………………………………….. 2

2 EM ALGORITHM FOR LATENT CLASS LOGIT ………………………………………………………….

3 THE LCLOGIT COMMAND ………………………………………………………………………………………

4 POST-ESTIMATION COMMAND: LCLOGITPR ………………………………………………………….

5 POST-ESTIMATION COMMAND: LCLOGITCOV ………………………………………………………. 6

6 APPLICATION ………………………………………………………………………………………………………..

7 ACKNOWLEDGMENTS …………………………………………………………………………………………

8 REFERENCES ……………………………………………………………………………………………………… 12

lclogit: a Stata module for estimating a mixed
logit model with discrete mixing distribution
via the Expectation-Maximization algorithm

Daniele Pacifico (

), Hong il Yoo (


)

Abstract

This paper describe lclogit, a Stata module to fit latent class logit models through the

Expectation-Maximization algorithm. The stability of this estimation method allows

overcoming some of the computational difficulties that normally arise when fitting such models

with many latent classes. This, in turn, permits users to estimate nonparameterically the mixing

distribution of the random coefficients because the more the mass points of the latent class

model, the better the approximation of the unknown joint density of the random coefficients.

Keywords: st0001, lclogit, latent class model, EM algorithm, mixed logit.

(

) Works with the Italian Department of the Treasury (Rome, Italy). E-mail: daniele.pacifico@tesoro.it

(


) Is a PhD student at the School of Economics, the University of New South Wales (Sydney, Australia).

E-mail: h.yoo@unsw.edu.au

mailto:daniele.pacifico@tesoro.it

mailto:h.yoo@unsw.edu.au

Address:
Via XX Settembre, 97
00187 – Rome

Websites:
www.mef.gov.it
www.dt.tesoro.it

e-mail:
dt.segreteria.direzione1@tesoro.it

Telephone:
+39 06 47614202
+39 06 47614197

Fax:

+39 06 47821886

Ministry of Economy and Finance
Department of the Treasury

Directorate I: Economic and Financial Analysis

Turn in your highest-quality paper
Get a qualified writer to help you with

“ LCLogit, Latent class logit ”

Get high-quality paper

NEW! AI matching with writer

Order an Essay Now & Get These Features For Free:

Turnitin Report

Formatting

Title Page

Citation

Outline

Place an Order