ECONOMETRICS LECTURE: HECKMAN's SAMPLE SELECTION ...

Log pseudolikelihood = -16461.32 Prob > chi2 = 0.0000. --------------------------------
-------------------------------------------------------------------. | Coef. Std. Err. z P>|z| [95% ...

Part of the document


ECONOMETRICS LECTURE: HECKMAN's SAMPLE SELECTION MODEL Heckman J (1979) Sample selection bias as a specification error,
Econometrica, 47, pp. 153-61. Note: Heckman got the Nobel prize for this
paper. The model was developed within the context of a wage equation: THE WAGE EQUATION Wi = ?Xi + ?i (1) where Wi is the wage, Xi observed variables relating to the i'th person's
productivity and ?i is an error term. W is observed only for workers, i.e.
only people in work receive a wage. SAMPLE SELECTION (i.e. being in the labour force so W is observed) There is a second equation relating to employment: E*i = Zi? + ui (2) E*i = Wi - E'i is the difference between the wage and the reservation wage
E'i. The reservation wage is the minimum wage at which the ith individual
is prepared to work. If the wage is below that they choose not to work. We
observe only an indicator variable for employment defined as E=1 if E*i>0
and E=0 otherwise. ASSUMPTIONS
The Heckman model also uses the following assumptions: (?,u) ~ N(0,0,?2?, ?2u,??u) (3) That is both error terms are normally distributed with mean 0, variances as
indicated and the error terms are correlated where ??u indicates the
correlation coefficient. (?,u) is independent of X and Z (4) The error terms are independent of both sets of explanatory variables. Var(u) = ?2u = 1 (5) This is not so much an assumption as a simplification it normalises the
variance of the error term in what will be a probit regression. THE SAMPLE SELECTION PROBLEM Take the expected value of (1) conditional upon the individual working and
the values of X: E(Wi | Ei=1,Xi) = E(Wi | Xi Zi ui) (the right hand side comes from (2) Wi = ?Xi + ?i (1) E(Wi | Ei=1,Xi) = E(Wi | Xi Zi ui) = ?Xi + E(?i| Xi Zi ui)
(6) This comes from recognising that the expected value of X given X is simply
X (and the assumption that Xi is independent of the two error terms).
E(X|X)=X The final term in (6) {E(?i| Xi Zi ui) } can be simplified by noting that
selection into employment depends just on Zi and ui not upon Xi.
Specifically E(Wi | Ei=1,Xi) = ?Xi + E(?i| Ei =1) = ?Xi + E(?i| ui > -Zi?)
(7) This is from equation (2); Ei=1 iff E*i > 0 i.e. if Zi? + ui > 0, i.e. if
ui > -Zi? The key problem is that in regressing wages on characteristics for those in
employment we are not observing the equation for the population as a whole.
Those in employment will tend to have higher wages than those not in the
labour force would have (that is why they are not in the labour force).
Hence the results will tend to be biased (sample selection bias) and e.g.
we are likely to get biased results when estimating say the returns to
education. For example two groups of people (i) industrious; (ii) lazy.
Industrious people get higher wages and have jobs, lazy people do not. In
effect we are doing the regression in this simplified example on the
industrious part of the labour force. The returns to education will be
estimated on them alone not the whole of the population (which includes the
lazy people). In terms of (7) the problem comes from (?i| ui > -Zi?). The error term u is
restricted to be above a certain value, i.e. it is bounded from below.
Those individuals who do not satisfy this are excluded from the regression.
OK, but this becomes a problem because of the assumption in (3) that the
error terms are correlated where ??u indicates the correlation coefficient.
Hence a lower bound on u suggests it too is restricted. E(Wi | Ei=1,Xi) = ?Xi + E(?i| Ei =1) = ?Xi + E(?i| ui > -Zi?)
(7) HECKMAN's METHODOLOGY Heckman's first insight in his 1979 Econometrica paper was that this is can
be approached as an omitted variables problem (?i| ui > -Zi?) is the
'omitted variable' in (7). An estimate of the omitted variable would solve
this problem and hence solve the problem of sample selection bias.
Specifically we can model the omitted variable by: E[(?i| ui > - Zi?)] = ??u?? ?i(-Zi?) = ?? ?i(-Zi?)
(8) where ?i(-Zi?) is 'just' the inverse Mill's ratio evaluated at the
indicated value and ?? is an unknown parameter (=??u??) THE INVERSE MILL's RATIO
Many of the analyses stop there. Lets see if we can go a little further and
look at the inverse Mill's ratio. Named after John P. Mills, it is the
ratio of the probability density function over the cumulative distribution
function of a distribution. Use of the inverse Mills ratio is often
motivated by the following property of the truncated normal distribution.
If x is a random variable distributed normally with mean ? and variance ?2,
then it is possible to show that E(x|x>?) = ? + ?[{?((?-?)/?)}/{1-?((?-?)/?)}] (9) where ? is a constant, ? denotes the standard normal density function, and
? denotes the standard normal cumulative distribution function. The term in
red denotes the Inverse Mill's ratio. Compare (9) with (8). E[(?i| ui > - Zi?)] = ??u?? ?i(-Zi?) = ?? ?i(-Zi?)
(8)
x equates to u; hence ?, the mean of u (previously x) = 0 Also ?2 is the
variance of u (previously x) and by (5) has been standardized to equal 1.
? equates to - Zi? Hence: E(ui | ui > - Zi?) = [{?(- Zi?)}/{1-?(- Zi? )}]
(10)
However, but we want E[(?i| ui > - Zi?)] not E(ui | ui > - Zi?). Now ??u = ??u/(?? ?u); hence ??u?? ?u= ??u; ?u= 1 by definition; hence
??u?? = ??u We have found the expected value of ui to find the expected
value of ?i we must multiply by this covariance i.e. by ??u??. ??u is the
correlation between the two errors and thus in relative terms translates
the impact of specific error term for u on ?, ?? is then a scale factor.
This gives us E[(?i| ui > - Zi?)] = ??u??. [{?(- Zi?)}/{1-?(- Zi? )}]
(11) Compare with: E[(?i| ui > - Zi?)] = ??u?? ?i(-Zi?) = ?? ?i(-Zi?)
(8). The two are the same where ?i(-Zi?)= [{?(- Zi?)}/{1-?(- Zi? )}] USE IN STATA
What follows below is a special application of Heckman's sample selection
model. That is the second stage equation is also probit. To use the
standard Heckman model where the second stage estimation involves a
continuous variable the following type of command should be used: heckman wage educ age, select(married children educ age) i.e. heckman rather than heckprob as we now use: STATA COMMAND
heckprob intbankr lgnipc male age agesq rlaw estonia village town unemp
selfemp if missy==1, select(marrd educ2 lgnipc age agesq village town unemp
manual fphoneacd) intbankr lgnipc male age agesq rlaw estonia village town unemp selfemp:
specification of variables in internet banking equation (lgnipc=log GNI per
capita; educ2 =education; marrd=married, agesq =age2; unemp=unemployed) select(marrd educ2 lgnipc age agesq village town unemp manual fphoneacd)
specification of variables in sample selection equation (fphoneacd=quality
of fixed phone access) Probit model with sample selection Number of obs =
23446
Censored obs =
14706
Uncensored obs =
8740 Wald chi2(10) =
1066.68
Log pseudolikelihood = -16461.32 Prob > chi2 =
0.0000 ----------------------------------------------------------------------------
-----------------------
| Coef. Std. Err. z P>|z|
[95% Conf. Interval]
-------------+--------------------------------------------------------------
----------------------
intbankr |
lgnipc | -.1043315 .0599919 -1.74 0.082 -.2219134
.0132505
male | .1230764 .0270944 4.54 0.000 .0699723
.1761805
age | .0364993 .0059936 6.09 0.000
.0247522 .0482465
agesq | -.0332365 .0072216 -4.60 0.000 -.0473905
-.0190825
rlaw | .4961302 .0242105 20.49 0.000
.4486785 .5435819
estonia | 1.621941 .0761046 21.31 0.000 1.472779
1.771103
village | .0422248 .0356796 1.18 0.237 -.027706
.1121556
town | .0603227 .0332633 1.81 0.070 -.0048722
.1255175
unemp | -.0036408 .0693268 -0.05 0.958 -.1395189
.1322372
selfemp | .2013792 .0462062 4.36 0.000 .1108166
.2919418
_cons | -3.207285 .2232697 -14.37 0.000 -3.644886
-2.769685
-------------+--------------------------------------------------------------
------------------------
select |
marrd | .1168095 .0209772 5.57 0.000 .0756949
.1579241
educ2 | .678366 .0148053 45.82 0.000
.6493482 .7073838
lgnipc | .6928837 .0251465 27.55 0.000
.6435975 .7421699
age | .0294313 .003864 7.62 0.000
.021858 .0370047
agesq | -.0661635 .0041628 -15.89 0.000 -.0743223
-.0580046
village | -.2005996 .024718 -8.12 0.000 -.249046
-.1521532
town | -.0914685 .0243485 -3.76 0.000
-.1391906 -.0437464
unemp | -.6330489 .0393924 -16.07 0.000 -.7102567
-.5558412
manual | -.3387754 .0240658 -14.08 0.000 -.3859435
-.2916074
fphoneacd | -.3426305 .0343699 -9.97 0.000 -.4099943
-.2752668
_cons | -4.257136 .1210887 -35.16 0.000 -4.494465
-4.019806
-------------+--------------------------------------------------------------
----------------------
/athrho | -.4907283 .0492128 -9.97 0.000 -.5871836
-.394273
-------------+--------------------------------------------------------------
--
rho | -.4547943 .0390337