Moore–Penrose Estimators of Age–Period–Cohort Effects: Their Interrelationship and Properties

The intrinsic estimator (IE) has become a widely used tool for the analysis of age–period– cohort (APC) data in sociology, demography, and other fields. However, it has been recently recognized that the IE is a subtype of a larger class of estimators based on the Moore–Penrose generalized inverse (MP estimators) and that different estimators can lead to radically divergent estimates of the true, unknown APC effects. To clarify the differences and similarities of MP estimators, we introduce a canonical form of the linear constraints imposed on the true temporal effects. Using this canonical form, we compare the IE to related MP estimators, examining the conditions under which they recover the true temporal effects, the impact of the size and sign of nonlinearities on the estimated linear effects, and their sensitivity to the number of age, period, and cohort categories. We show that two MP estimators, which we call the difference estimator (DE) and the orthogonal estimator (OE), impose constraints that are both less sensitive and easier to interpret than those of the IE. We conclude with practical guidelines for researchers interested in using MP estimators to estimate temporal effects.

T HE intrinsic estimator (IE) has become a popular technique for estimating age- period-cohort (APC) effects 1 across a wide range of fields.In recent years, researchers have used the IE to examine temporal trends in pornography use (Price et al. 2016), behavioral problems in adolescents (Keyes et al. 2017), heart disease mortality (Kramer, Valderrama, and Casper 2015), breast cancer mortality (Li, Yu, and Wang 2015), obesity in China (Fu and Land 2015), and social trust (Hu 2015), among other topics.Although popular, the IE is just one of a broader class of estimators based on the Moore-Penrose generalized inverse (MP estimators hereafter). 2 As recently recognized in the methodological literature (Land et al. 2016), MP estimators are defined by applying the Moore-Penrose generalized inverse to different design matrices, and the IE is that particular MP estimator that uses a design matrix of sum-to-zero effect (or deviation) coding. 3 The fact that the IE is a subtype of a larger class of MP estimators raises a crucial question, especially in light of recent criticisms leveled against the IE (Luo 2013;O'Brien 2011O'Brien , 2015)): which MP estimator, if any, should researchers use when analyzing temporal trends?The desirable statistical properties typically used to justify the IE are of little guidance because these are shared by all MP estimators.That is, all MP estimators produce results that are estimable, 4 are unbiased given the linear constraint imposed by the estimator, and have minimum variance among estimators based on the same design matrix (Fu 2016;O'Brien 2015;Yang and Land 2013a).However, MP estimators are unalike in that they impose different linear constraints on the true, unknown APC effects, thereby generating divergent sets of parameter estimates, as shown by two recent articles (Luo et al. 2016;Pelzer et al. 2014).Unfortunately, the form of the linear constraints also differs across MP estimators, making it exceedingly difficult, if not impossible, for researchers to compare the specific assumptions of these estimators, let alone ascertain whether or not such assumptions are reasonable in any given application.This is especially problematic because the typical goal of APC analysis is to recover the true, unknown temporal effects (Luo 2013(Luo : 1951-52)-52).
To resolve this problem, we introduce a canonical form of the linear constraints imposed by the IE and related MP estimators on the true, unknown APC effects.Using this canonical form of the constraints, we compare various MP estimators, evaluating the conditions under which they recover the true temporal effects, the impact of the magnitude and direction of nonlinearities on their estimated linear effects, and their sensitivity to the number of age, period, and cohort categories.Importantly, we show that two MP estimators, which we call the orthogonal estimator (OE) and the difference estimator (DE), impose linear constraints that are both less sensitive and easier to interpret than those of the IE.However, we emphasize that there is, to our knowledge, no social, biological, or cultural theory that claims that the true, unknown linear effects should conform to the particular constraints imposed by MP estimators.In deriving the canonical constraints of MP estimators, we further prove that, mathematically, the APC identification problem is always restricted to the linear effects and that the solution line, which is a geometric representation of the identification problem, is always reducible to just three dimensions defined by the set of possible slopes.Accordingly, the IE and related MP estimators will, in general, produce the same set of nonlinear effects but different linear effects.
The rest of the article is organized as follows.First, we briefly review the APC identification problem and the Moore-Penrose generalized inverse.Second, we define the IE and discuss how all MP estimators are minimum-norm, 5 least-squares estimators.Third, we discuss the shared as well as divergent properties of MP estimators.Fourth, we illustrate how any APC design matrix can be separated into the linear and nonlinear components using a transformation matrix.We then prove that the APC identification problem is always restricted to the linear effects and that the solution line can always be simplified to three dimensions.Fifth, we introduce a canonical form of the linear constraints of the IE and related MP estimators, demonstrating that these estimators impose a linear constraint on the slopes but not the nonlinearities.Next, using the canonical form of the constraints, we compare the IE and related MP estimators mathematically, showing explicitly how the constraints can differ based on the size and direction of nonlinear effects, the choice of reference category, and the number of age, period, and cohort groups.Then, using simulated data, we evaluate the efficacy of several MP estimators, including the IE, in recovering the true temporal effects.Finally, we conclude with practical guidelines for applied researchers wishing to use the IE and related MP estimators to analyze APC data.

Generalized Inverses and the APC Identification Problem
To clarify the discussion that follows, suppose we have categorically coded age, period, and cohort data for a set of n respondents. 6We let i = 1, . . ., I denote the unique age groups, j = 1, . . ., J the unique period groups, and k = 1, . . ., K the unique cohort groups with k = j − i + I and K = I + J − 1. 7 We let n denote the number of respondents.The model we would like to run is which we refer to as the classical APC (C-APC) model, also known as the multiple classification or accounting model.For simplicity, we can express the C-APC in matrix terms as where y is an n × 1 outcome vector, which without loss of generality, we assume to be continuous; X is a design matrix of categorical age, period, and cohort variables with dimension n × p; b is a p × 1 parameter vector with elements corresponding to the age, period, and cohort groups; and is an n × 1 vector of random errors. 8If there were no linear dependence in X, then we could obtain a unique least-squares solution where the superscripted −1 indicates the regular inverse.However, due to linear dependence, X is rank deficient one, and a regular inverse of X T X does not exist.Thus, we cannot estimate b OLS , and any particular least-squares solution requires an additional constraint.
Using what is known as a generalized inverse, it is possible to produce constrained estimates of the parameters b that are consistent with the data (e.g., see O'Brien 2015: 27-29).Unfortunately, there are an infinite number of generalized inverses, and in general, different inverses will produce different sets of estimates.Each of the constrained estimates for a particular design matrix lie on what is called the solution line of estimates in multidimensional space (O'Brien 2015: 27-28).The solution line is a geometric representation of the identification problem, reflecting the lack of a unique set of estimates.To construct the solution line, we let b * denote any specific constrained least-squares solution to the least-squares normal equations.For any particular constraint, we can construct a generalized inverse of X T X to find a corresponding solution where the superscript * denotes the appropriate generalized inverse.The vector b * is a least-squares solution to the normal equations such that X T b * = X T y.We can then write 9 where s is an arbitrary scalar that can take on any real number and v is the eigenvector with the zero eigenvalue, or null vector, of a given design matrix.The null vector gives the form of the linear dependency for a particular design matrix and is unique up to a scalar (O'Brien 2015: 30-32,56-57).By varying s, we trace out the solution line for a given design matrix, resulting in an infinite number of constrained least-squares solutions.
Among the infinite number of generalized inverses for a particular design matrix, what is known as the Moore-Penrose generalized inverse has received significant attention due to its unique properties (Ben-Israel 2002).Using the Moore-Penrose generalized inverse gives us, parallel to the traditional ordinary least-squares (OLS) formula where the superscript + denotes the Moore-Penrose generalized inverse and, again, X is a design matrix of categorical age, period, and cohort variables.Equation 6underscores that the MP estimator is based on (X T X) + or, equivalently, X + .Formally, X + is defined as a generalized inverse meeting four Moore-Penrose conditions: 1. General condition: These conditions specify that for any particular matrix X, the Moore-Penrose generalized inverse always exists (i.e., it is well defined) and is unique (i.e., there is only one such generalized inverse that meets these conditions).From the four conditions above, we can express the solution in terms of the normal equations, and it can be shown that if X is of full rank, then X + y = (X T X) −1 X T y. 10 In the following section, we discuss how to interpret the MP solution in the case of APC data, which is rank deficient one.

Deriving and Defining MP Estimators
As mentioned previously, the application of the Moore-Penrose generalized inverse to different design matrices or coding schemes defines a broad class of constrained estimators, which we term MP estimators.As long as the design matrix represents the full range of possible APC categories, various coding schemes may be used, such as treatment coding with the first category omitted or sum-to-zero effect coding with the last category omitted. 11The IE is defined as that particular MP estimator based on a design matrix of sum-to-zero effect (or deviation) contrasts (Land et al. 2016: 964).However, it is important to clarify that the phrase "intrinsic estimator" has been used in at least two different ways by the proponents of the IE (e.g., Fu 2000Fu , 2008Fu , 2016;;Land et al. 2016;Yang and Land 2013a: 79).In particular, in his soleauthored work, Fu defines the IE as a general class of estimators derived from the Moore-Penrose generalized inverse without reference to the specific structure of the design matrix (Fu 2000;Fu 2008: 332-3;Fu 2016).In contrast, Land and colleagues (Land et al. 2016: 964-71) state explicitly that "only the sum-to-zero [effect] coding is used to define and estimate the IE."Furthermore, they assert that because there are any number of design matrices that can produce different sets of estimates, there are "infinitely many possible pseudo-IE estimators."To eliminate this ambiguity, we will refer to Fu's broad class of estimators simply as MP estimators and the narrower set of MP estimators with sum-to-zero effect coding as the IE, as preferred by Land and colleagues.We now turn to how to derive and define MP estimators formally.

Deriving MP Estimators
There are many ways to derive the estimates of an MP estimator (e.g., see Yang and Land 2013a: 79-80).An especially intuitive approach, which has not received significant attention in the APC literature, is to find the Moore-Penrose generalized inverse of X T X using a decomposition technique.For a given design matrix X, we can write where V ΛV T is the spectral (or eigenvalue) decomposition of X T X.The p × p diagonal matrix Λ consists of the eigenvalues of X T X in descending order λ 1 , λ 2 , . . ., λ r−1 , with the rank of the matrix r = p − 1.It is straightforward to find the Moore-Penrose generalized inverse of X T X using this decomposition.First, we take the reciprocal of the nonzero eigenvalues along the diagonal, keeping the zero eigenvalues.This will give the generalized inverse of Λ, denoted as Λ + .Second, we calculate V Λ + V T = (X T X) + .Finally, we use (X T X) + to find the Moore-Penrose generalized inverse estimates b + = (X T X) + X T y.
In addition to allowing for the estimation of MP estimators, the decomposition of X T X reveals two crucial features of the data.First, the number of nonzero eigenvalues in Λ gives the rank of the matrix X T X.Because the columns of the data are linearly dependent, there will always be one zero eigenvalue, and thus the design matrix for any APC data set is always rank deficient one.Second, along with the zero eigenvalue in Λ, there will always be a corresponding eigenvector in Λ, which is the orthonormal basis for the null space of X T X.This is the null vector, or the eigenvector with a zero eigenvalue, which encodes the linear dependency in the data. 12In the case of APC data, different design matrices have different forms of linear dependencies and thus different null vectors.As we discuss later, converting the null vector of an MP estimator into a canonical form is crucial for understanding and comparing the assumptions it requires about the true, unknown temporal effects.

Defining MP Estimators
An MP estimator can be defined formally as the solution orthogonal (or perpendicular) to the null vector of a particular design matrix (O'Brien 2015: 30-32).To understand this definition, note that like any constrained estimator, we can use an MP estimator to construct a solution line for a given design matrix such that b = b MP + sv.Among the values of s, an MP estimator assumes s = 0 in the equation b = b MP + sv (Fu 2016: 183).Geometrically, an MP estimator is the leastsquares solution corresponding to the point on the line closest to the origin in terms of Euclidean distance.This point coincides with the minimium (Euclidean) length of b MP + sv, which is at its minimum when s = 0. 13 Equivalently, the vector b MP is the projection of b on the nonnull space of X, which is orthogonal to the null space (Yang and Land 2013a: where I is a p × p identity matrix and v is the normalized null vector so that v T v = 1. 14That is, an MP estimator separates the true, unknown parameter b into two orthogonal components, the null vector v and the vector of MP estimates b MP .Because these two vectors are perpendicular to each other, v T b MP = 0 or, equivalently, s = 0 in the solution line defined by b MP + sv. Because an MP estimator finds that particular set of least-squares estimates on the solution line for which the (Euclidean) length is minimized, it is a minimumnorm, least-squares (MNLS) solution (Ben-Israel 2002: 109).Although it has not been explicitly stated as such in the APC literature, any MP estimator is fundamentally a two-stage minimization algorithm: where the operator . 2 denotes the L 2 (or Euclidean) norm and X is a given design matrix of age, period, and cohort variables.In the first stage, the MP estimator finds the least-squares solution to Xb = y.For full rank data, there is a unique solution, and the algorithm ends.However, in the case of APC data, the design matrix X is rank deficient one, and thus there is no unique least-squares solution; rather, as described previously, there are many such solutions lying on a line in multidimensional space.In the second stage, the MP estimator selects a solution by applying the minimum-norm constraint.Among the least-squares estimates on the solution line, the MP estimator selects that particular set of estimates with the minimum (Euclidean) length or, equivalently, that is closest to the origin in terms of Euclidean distance. 15

Properties of MP Estimators
We now turn to an examination of the shared as well as divergent properties of MP estimators, including the IE.We focus here on those shared properties that are considered desirable in the methodological literature (Fu 2000;Fu 2016;Fu and Hall 2006;Fu, Land, and Yang 2011;Yang et al. 2008;Yang, Fu, and Land 2004): first, an MP estimator has minimum sampling variance among all possible estimators based on its specific design matrix; second, it is an estimable function, meaning that it produces a unique set of estimates for the effects of age, period and cohort; finally, it is unbiased, meaning that the average of any estimates produced by an MP estimator over an infinite number of simple random samples will be equal to that estimator's values when it is applied to the full population data.However, MP estimators diverge in a critically important way: because they are based on different design matrices, they will have different null vectors and in turn impose different linear constraints on the true, unknown temporal effects (Luo et al. 2016;Pelzer et al. 2014).

Shared Desirable Statistical Properties
There are several desirable statistical properties shared by all MP estimators.First, for any given design matrix, an MP estimator will be that estimator with the minimum sampling variance (Yang et al. 2004: 102-3,108;Yang et al. 2008Yang et al. : 1709;;Yang and Land 2013a: 86,116-7).This is a function of the fact that any MP estimator will give that set of estimates that is shortest in Eucldean norm.Consequently, if we were to use another type of generalized inverse other than the Moore-Penrose for some fixed design matrix, the resulting estimates would have greater sampling variance.
Second, an MP estimator is always an estimable function (Fu et al. 2011: 456-8;Yang and Land 2013a: 84-85).To state that a function is estimable means that when applied to data, it produces a unique set of estimates.Intuitively, it means that it is possible for the data to tell us what the function equals.For example, if we have variables that are linearly dependent, then the standard OLS function is not estimable because it is based on a regular inverse that is not well defined (i.e., it does not exist) due to the linear dependence in the design matrix.In contrast, MP estimators are estimable because the Moore-Penrose generalized inverse is well defined (i.e., it exists) even when the variables are linearly dependent.Specifically, an MP estimator applies a particular mathematical constraint on the estimates that will only be satisfied by a single point on the solution line.As stated previously, the estimates will be that set under the Euclidean distance metric (or, equivalently, correspond with the minimum L 2 norm).The MP estimates are estimable in the very specific sense that for a particular design matrix and outcome, there is a set of points closest in Euclidean distance to the origin on the solution line. 16 Finally, all MP estimators are unbiased, meaning that the average of any estimates produced by an MP estimator over an infinite number of simple random samples will equal that estimator's values when it is applied to the full population data (Yang et al. 2004: 101-2,107;Yang et al. 2008Yang et al. : 1709;;Yang and Land 2013a: 86,115-6).A function is unbiased if, when it is calculated for an infinite number of random samples, its average is equal to its value when it is calculated on the population as a whole. 17In the context of any specific MP estimator, this means that if we had an infinite number of samples and applied this estimator in each sample, then the average of the estimates across these samples would equal the estimates produced by using this same estimator on the whole population.

The Linear Constraint on the True Temporal Effects
The foregoing underscores that the desirable statistical properties typically used to justify the IE are not unique to it but shared by all MP estimators. 18To emphasize, using various alternative design matrices with the Moore-Penrose generalized inverse will also produce estimates, like the IE, that are estimable, unbiased, and have minimum variance relative to other estimators based on that design matrix.However, results from MP estimators differ, sometimes radically so, due to the structure of the design matrix.
To recognize why this is the case, it is crucial to understand that the null vector encodes the linear dependency among the columns of a design matrix.For any given design matrix, at least one of the columns of X can be rewritten as a linear combination of the other columns. 19Formally, there is a nontrivial linear combination of the columns of X that results in a vector of zeros such that Xv = 0, where v is the p × 1 null vector and 0 is an n × 1 vector of zeros.As mentioned previously, the null vector v represents the null space of X and is unique up to multiplication by an arbitrary scalar s.Accordingly, the equation Xv = 0 generalizes to Xsv = 0.
The main assumption of any MP estimator is that the true, unknown temporal effects conform to the linear dependency of the particular design matrix on which it is based.To show this, note that the linear constraint imposed by an MP estimator will yield the true parameter of APC effects b only if s = 0 in the equation b = b MP + sv (see Land et al. 2016: 966;Yang and Land 2013a: 82) In other words, any MP estimator assumes that the true, unknown temporal effects conform to the linear dependency of its particular design matrix.
Unfortunately, the linear constraints imposed by MP estimators have divergent forms, precluding substantive interpretation or comparison across estimators.To illustrate the differing linear constraints, in Table 1, we show the linear constraints of nine MP estimators applied to data with I = 3 age groups and J = 3 period groups.We focus here on MP estimators based on contrast coding schemes commonly discussed in the applied statistics literature (e.g., Fox 2002: 126-30;Onyiah 2008: 121-36,148-61;Venables and Ripley 2013: 146-9) as well as those employed in the methodological literature on APC effects in sociology and demography (e.g., Luo et al. 2016;O'Brien 2015: 55-57;Pelzer et al. 2014).We present the parameters in detailed form in Table 1 so as to clarify their substantive interpretation.Specifically, for some parameter β, we let β k denote the average of the kth category, β the average of all k = 1, . . .k = K levels, β i:j the average of levels i through j, and β k the kth orthogonal polynomial term.For example, γ 2 is the average of the second cohort category; γ is the average across all cohort categories; γ 2:4 is the average of cohort categories two, three, and four; and γ 2 is the second-order orthogonal polynomial term for the cohort variable.
As shown in Table 1, we examine the linear constraints of the following estimators: (1) IE First , or the IE, which uses sum-to-zero effect coding with the first category of each variable omitted (e.g., α 2 − ᾱ compares the average of the second (0 Notes: Based on data with I = 3 age, J = 3 period, and K = I + J − 1 = 5 cohort groups.Linear constraints are calculated using v T b = 0, where b are the true, unknown temporal effects for a given design matrix.All MP estimators assume that v T b MP = v T b = 0.For some parameter β, we let β k denote the average of the kth category, β the average of all k = 1, . . .k = K levels, β i:j the average of levels i through j, and β k the kth orthogonal polynomial term.For example, γ 2 is the average of the second cohort category; γ is the average across all cohort categories; γ 2:4 is the average of cohort categories two, three, and four; and γ 2 is the second-order orthogonal polynomial term for the cohort variable.
age category with the average of all age categories); (2) IE Last , also the IE, which uses sum-to-zero effect coding with the last category omitted for each variable (e.g.,  1 reveals that MP estimators differ greatly in terms of the linear constraints they impose on the true temporal effects.The constraint v T b = 0 will differ across MP estimators because the parameter vector b as well as the null vector v both depend on the structure of the design matrix.As shown in Table 1, not only is it exceedingly difficult to interpret the linear constraint of any particular MP estimator but it is seemingly impossible to compare the constraints across various MP estimators.For instance, as indicated in Table 1, with I = 3 age groups, J = 3 period groups, and In contrast, when used on a data set with the identical number of age, period, and cohort groups, the TE Last assumes that Due to the disparate forms of these constraints, it is not at all clear how the IE Last and the TE Last are similar and how they are different, let alone how they compare to any of the other MP estimators in Table 1.Fortunately, in the following sections, we develop a canonical form of the linear constraints that facilitates the systematic interpretation and comparison of MP estimators.

The Transformation Matrix
An MP estimator consists of a particular design matrix in conjunction with the Moore-Penrose generalized inverse.Given that there are many such design matrices, potentially an infinite number, a technique is needed for comparing the results of different MP estimators.In this section, we show how, using a special transformation matrix, any APC design matrix can be converted into a canonical form that separates the linear from the nonlinear components.Using this transformation matrix, we then show that the APC identification problem is always restricted to the linear effects, that any particular parameter can be decomposed into linear and nonlinear components, and that the solution line can always be simplified to three dimensions.

Constructing the Transformation Matrix
If we have two design matrices, X and X * , each of which provides a full representation of the age, period, and cohort effects and as such are deficient rank one, then there will always be an invertible p × p transformation matrix T of full rank such that (Luo et al. 2016: 947) XT = X * and X = X * T −1 . (12) However, given a set of MP estimators defined by different design matrices along with their corresponding estimates, we need to decide on a common design matrix X * as a basis for comparison.We will let X O denote the canonical form of any APC design matrix.The corresponding parameter b O = (µ, α, α 2 . . .α I−1 , π, π 2 . . .α J−1 , γ, γ 2 . . .γ K−1 ) represents the full set of orthogonally separated linear and nonlinear components for age, period, and cohort.The null vector of X O has the simple where the first zero corresponds to the intercept; the elements one, negative one, and one correspond to the age, period, and cohort linear components; and the remaining zeros correspond to the (I − 2) + (J − 2) + (K − 2) nonlinear components.That is, the null vector v O encodes the fundamental linear relationship underlying all APC data, namely that a person's age minus their year of measurement plus their birth year equals zero.
To convert any APC data set into canonical form, we will construct a special transformation matrix T (cf.Luo et al. 2016: 947-52).Let A, P, and C denote the original, untransformed contrast matrices for age, period, and cohort terms, with dimensions I × (I − 1), J × (J − 1), and K × (K − 1), respectively.Although sumto-zero effect (or deviation) coding is the most frequently used in the APC literature, these matrices may be coded with any number of schemes without loss of generality.Each of these contrast matrices has full column rank but not full row rank.Hence, we can construct three left inverses where the superscript L denotes a left inverse.Let A O , P O , and C O denote corresponding contrast matrices for age, period, and cohort, in which the linear and nonlinear components are orthogonal to each other (Draper and Smith 2014: 461-72).We are now in position to construct the transformation matrix T, which is a block diagonal matrix in which the main diagonal blocks are square matrices and off-diagonal blocks are zero matrices.The transformation matrix T has the generic or, equivalently, where ⊕ is the direct sum.As we demonstrate in the next section, using T, we can convert any design matrix and parameter vector of an APC data set into canonical form.

Critical Insights Using the Transformation Matrix
Using the transformation matrix, we can make several important insights that will greatly simplify the interpretation and comparison of APC estimators.First, we can prove that the results of any constrained estimator, including any MP estimator, will differ only in terms of their estimated linear effects.Specifically, using T, we can convert any design matrix X as well as its corresponding parameter vector b into canonical form (cf. Luo et al. 2016: 947) where again X O is a design matrix of orthogonal linear and nonlinear components and b O is a vector of the new, transformed estimates expressed in terms of linear and nonlinear effects.Because the null vector of X O consists of nonzero elements only for the age, period, and cohort linear effects, 20 and we can convert any design matrix into the canonical form X O , this proves that the APC identification problem is restricted to the linear effects and that the intercept and nonlinear effects of any constrained APC estimator are identified. 21 Second, using T, we can decompose any particular parameter into its constituent linear and nonlinear components.To decompose a parameter vector, we use the equation b = Tb O , where b is the original set of temporal effects, T is the transformation matrix defined above, and b O are the effects expressed in canonical form.In Table 2, we show how to use the transformation matrix to decompose sum-to-zero effects (with the last category omitted) into a set of linear and nonlinear parameters.For example, in the case of I = 3 age groups, the first age parameter with sum-tozero effect coding represents α 1 − ᾱ, or the difference between the average of the first age category and the average across all age categories.This particular parameter is mathematically equal to a weighted sum of the age slope and the quadratic age effect: (α 1 − ᾱ) = (−1)α + (1)α 2 .As we show later, this decomposition allows us to express the linear constraints of MP estimators in a canonical form.
Finally, using the transformation matrix, we can derive a canonical form of the solution line.Researchers have noted that any particular design matrix defines a multidimensional solution line in parameter space (Luo et al. 2016;O'Brien 2011;O'Brien 2015: 59-91).However, it has not been fully appreciated that the solution line for any constrained estimator, including the MP estimator, can be greatly Notes: Decomposition based on I = 3 age, J = 3 period, and K = I + J − 1 = 5 cohort groups.Inverse of the transformation is used to decompose original sum-to-zero effect coding into linear and nonlinear components such that b Effect = Tb O .Last category is omitted for each variable.For some parameter β, we let β k denote the average of the kth category, β the average of all k = 1, . . .k = K levels, and β k the kth orthogonal polynomial term.For example, γ 2 is the average of the second cohort category, γ is the average across all cohort categories, and γ 2 is the second-order orthogonal polynomial term for the cohort variable.
simplified.Using the transformation matrix outlined previously, we can make an important generalization: all constrained APC estimators lie on the same simplified solution line in three-dimensional space, or what we call the canonical solution line.Specifically, because T −1 b = b O , we can express the equation for the solution line for any APC design matrix as where b * O is a constrained set of estimates separated into nonlinear and linear components.Moreover, because sT or, equivalently, where α, π, and γ denote the true, unknown data-generating slopes; s is an arbitrary scalar; and the asterisks denote any particular constrained set of estimates on the solution line.Because, as mentioned previously, s can take on any real number, Equation 20 defines a line in a three-dimensional space for the design matrix XT = X O or, in other words, the canonical design matrix.For example, Figure 1 visualizes the solution line for APC data generated using values of α = 1, π = −4, and γ = 6 along with a set of nonlinear effects.Because the nonlinearities are identified, all possible estimates of the temporal effects lie on the line in the three-dimensional space in Figure 1.

Canonical Linear Constraints of MP Estimators
The major obstacle to understanding and comparing the linear constraints imposed by various MP estimators is that they differ greatly because they are a function of the design matrix.However, using the transformation matrix described in the previous section, we can derive a canonical form of the linear constraints of the sociological science | www.sociologicalscience.com IE and related MP estimators.To obtain the canonical linear constraint of an MP estimator we can use the equation where again b is the original set of parameters; T is the transformation matrix; v is the original, untransformed null vector; 22 and b O is the set of the parameters expressed in canonical form, with separate linear and nonlinear effects.There are several steps to deriving the canonical linear constraint of a particular MP estimator.First, we use the transformation matrix to decompose each parameter into its linear and nonlinear components: b = Tb O .Next, we multiply the parameter vector of decomposed effects by the original, untransformed null vector v T .Finally, we rearrange terms so that the parameters representing the nonlinear effects are on the right-hand side of the equation and then simplify.This will give us us the canonical linear constraint for that particular MP estimator.Like linear constraints discussed previously, the canonical linear constraints will differ across estimators.However, the canonical linear constraints of the IE and other MP estimators have a general form that clarifies their assumptions and their sensitivity to changes in the data where w 1 , w 2 , and w 3 are weights and v is a scalar.Equation 22 is absolutely crucial in understanding the divergent properties of MP estimators, including the IE.In general, across MP estimators, the estimates of the linear effects vary depending on at least three aspects of the data: first, the number of APC groups in the data set, which alter the weights for α, π, and γ as well as the value of the scalar v; second, the size and sign of the nonlinearities, which shifts the value of v; finally, the choice of the reference category, which also shifts the value of v.If there are no nonlinearities in the data set, then regardless of the MP estimator, v = 0; otherwise, the IE and other MP estimators constrain the weighted sum of the slopes to equal some other arbitrary value of v.
We examine the canonical linear contraints of nine MP estimators in Tables 3, 4, and 5.For all tables, we keep the number of age groups fixed at I = 3 but vary the number of period groups (and accordingly, the number of cohort groups).Note that in practice, the constraints in Tables 3, 4, and 5 can be simplified because the nonlinearities are identified; that is, the right-hand side of the canonical linear constraints reduces to a scalar when applied to any given set of data.For example, the canonical linear constraint of the IE First in Table 3 Mathematically, the canonical linear constraints in these tables reveal that all of the MP estimators examined, except for the OE, will produce differing linear constraints and thus divergent estimates of the true temporal effects because of the following: changes in the category omitted (or similarly, the reference category used), the number of APC categories, and the size and direction of the nonlinearities.Out of all the estimators examined, only the DE Forward , DE Backward , and OE produce estimates that will not depend on whether or not the first or last is omitted.However, Notes: Decomposition based on I = 3 age, J = 3 period, and K = I + J − 1 = 5 cohort groups.The transformation matrix is used to decompose original effects into linear and nonlinear components, which is then multiplied by the original null vector.For ease of exposition, null vectors and orthogonal polynomial contrasts with integer elements are used to calculate the linear constraints.For some parameter β, we let β k denote the kth orthogonal polynomial term.For example, γ 2 is the second-order orthogonal polynomial term for the cohort variable.
only the OE is robust to the size and sign of the nonlinearities as well as the number of APC categories used.This analysis also informs us when MP estimators will appear to produce reliable results.MP estimators will perform well when (1) there are significant nonlinearities and there are zero or very small linear effects or (2) when the underlying true linear effects have the same relationship among each other as that of the canonical linear constraint.It is important to emphasize that the mathematical constraint imposed by MP estimators, including the IE, will not, generally speaking, recover the underlying data-generating parameters.The reason for this is that the estimates produced by the MP and the true data-generating parameters will equal each other only if the parameters happen to conform to the MP's specific mathematical constraint.We know of no reason why this should ever be the case in any particular substantive application.

Examples
We now turn to several simulations to supplement the mathematical discussion of the previous section.We first compare the various MP estimators when applied Notes: Decomposition based on I = 3 age, J = 5 period, and K = I + J − 1 = 7 cohort groups.The transformation matrix is used to decompose original effects into linear and nonlinear components, which is then multiplied by the original null vector.For some parameter β, we let β k denote the kth orthogonal polynomial term.For example, γ 2 is the second-order orthogonal polynomial term for the cohort variable.
to a set of simulated data.These results are shown in Table 6.For simplicity, and without loss of generality, we assume there is no random error.The results show that for all MP estimators examined, they only differ in their linear components; the IE and related MP estimators all recover the intercept and the nonlinear effects.Note that none of the the estimators recover the true linear effects. 23 The results in Table 6 raise an important issue regarding the use of fit statistics to determine whether or not an MP estimator should be used, as recommended by Land and colleagues (e.g., Yang and Land 2013a: 125-53;Yang and Land 2013b).They contend that a researcher should adopt a three-step procedure when considering using the IE.In the first step, the researcher conducts a descriptive analysis of the temporal effects using graphical techniques.In the second step, the researcher uses fit statistics to determine whether or not "the data are sufficiently well described by any single factor or two-way combination" of age, period, and cohort (Yang and Land 2013a: 126).If graphical techniques and model fitting suggest that "only one or two of the three effects are operative," then the researcher "can proceed with a reduced model that omits one or two groups of variables" because then "there is no identification problem" (Yang and Land 2013a: 126;Yang andLand 2013b: 1969).However, if these analyses suggest that "all three dimensions are at work," then one should proceed to step three and implement the IE (Yang and Land 2013a: 126).They emphasize that the IE should not be used unless "all three dimensions are operative" (Yang andLand 2013b: 1969).decompose original effects into linear and nonlinear components, which is then multiplied by the original null vector.For some parameter β, we let β k denote the kth orthogonal polynomial term.For example, γ 2 is the second-order orthogonal polynomial term for the cohort variable.
Unfortunately, it is impossible to determine from the data alone whether or not all three temporal variables are operating.Believing otherwise can seriously mislead researchers.For example, consider the models in Table 7.The underlying data-generating process consists of nonzero slopes for all three temporal variables: α = 0.500, π = 1.500, and γ = 2.000.However, whereas the age and period variables have nonzero nonlinear effects in the data-generating process, all of the cohort nonlinearities are zero.Land and colleagues argue that one should use a reduced model when it fits the data equally well or better than a full model with all three temporal variables (Yang and Land 2013a: 109).The fit statistics in Table 7 suggest that one should, according to Land and colleagues, fit a two-factor age-period (AP) model rather than a three-factor APC model.However, by doing so, one does not avoid the nonidentifiability of the three linear effects.Rather, by fitting the two-factor model with only age and period effects, one is imposing the identification assumption that the cohort linear effect is zero even though its true linear effect is γ = 2.000.This zero-linear trend constraint on the cohort variable is external to the data, imposed by the researcher.Depending on the substantive application, it may or may not be reasonable to assume that because the nonlinear effects of cohort are observed to be zero, its linear effect is also zero.However, this is an assumption that can only be justified by appealing to theory or the inclusion of additional data.
We now compare two estimators in particular: the IE with the last category omitted (the most widely used of the MP estimators) and the OE (which, as we have shown mathematically, is robust to the number of APC categories, the use of the reference group, as well as the size and direction of the nonlinearities).The general canonical linear constraint of the previous section tells us that mathematically, the IE Last is sensitive to the number of APC groups.Table 8 shows how the IE Last 's estimates of the slopes change as the number of period groups increases from J = 3 sociological science | www.sociologicalscience.com Notes: Based on simulated data with I = 5 age groups, J = 5 period groups, and I + J − 1 = K = 9 cohort groups.Sample size is N = 25,000.Data-generating parameters are µ = 1.000, α = 0.500, α 2 = −1.450,α 3 = −0.200,α 4 = −0.150,π = 1.500, π 2 = 0.900, π 3 = 0.700, π 4 = −0.200,and γ = 2.000.Data-generating model includes disturbances drawn from a normal distribution with a mean of zero and standard deviation of 3.000.Zero-sum effect coding with the last category omitted is used for the two-factor AP (age-period), AC (age-cohort), and PC (period-cohort) models.Three-factor age-period-cohort (APC) model is estimated using IE Last .Shaded column indicates model with lowest AIC and BIC scores.
to J = 1000. 24For all simulations in Table 8, we keep the data-generating process the same with α = −2.000,π = 4.000, and γ = 1.000.For simplicity, we keep the number of age groups constant at I = 3 as we increase the number of period groups (and accordingly, the number of cohorts).A different number of age groups does not alter our findings regarding the sensitivity of the IE Last to the number of APC groups.We purposely constructed the data-generating process so that it initially conforms to the IE's constraint.With I = 3 age groups and J = 3 period groups, as well as no nonlinearities, w 1 = 1, w 2 = 1, w 3 = 6, and v = 0. We purposely chose values of α, π, and γ so that α − π + 6γ = −2 − (4) + 6(1) = 0.This is indicated in the first row of Table 8, which is shaded.As we increase the number of period groups, the values of the weights change, thereby altering the IE Last 's canonical linear constraint.The canonical linear constraints are shown in last column of Table 8.Because the data-generating process contains no nonlinearities and remains the same as we increase the number of groups, the value of v for each constraint is zero.However, the values of the w's for the period and cohort slopes increase as we increase the number of period (and cohort) groups. 25Although our initial data set satisfies the IE Last 's canonical linear constraint, as the number of period groups increases, the IE Last estimates diverge dramatically from the true data-generating parameters.In this particular case, as we increase the number of period groups, the IE Last 's estimates of the age slope increases, the period slope decreases, and the cohort slope increases.The reason is that as the number of period and cohort groups increase, their values of w become very similar.With J = 1000 period groups, the weights for the period and cohort slopes are approximately the same.Thus, the claim that the IE's constraint on the true, unknown slopes is consistent as the number of time periods increases towards infinity is not generally true (Fu 2016).
The last column of Table 8 underscores the complicated, nonintuitive, and highly variable nature of the IE Last 's canonical linear constraint as well as the extreme difficulty of interpreting it substantively.For example, with J = 10 groups, the constraint in Table 8 is α − 249 4 π + 451 4 γ = 0, with weights of w 1 = 1, w 2 = 249 4 , and w 3 = 451 4 .However, with J = 15 groups, the IE constraint becomes α − 231π + 344γ, with weights of w 1 = 1, w 2 = 231, and w 3 = 344.It is exceedingly difficult, if not impossible, to muster a reason why these specific values of w have any theoretical importance.In contrast, as shown in Table 9, regardless of the number of groups, the OE's canonical linear constraint is α − π + γ = 0.The canonical linear constraint also clarifies that the IE Last 's estimates depend on the size (e.g., large or small in absolute value) and sign (e.g., positive or negative) of the nonlinearities.For example, using the same data-generating slopes as in the previous section, Table 10 shows how the IE Last 's estimates alter depending on the magnitude and direction of the age nonlinearity.For all simulations, we keep the number of age and period groups at I = 3 and J = 3, respectively.Again, because we specifically constructed a data-generating process that conforms to the IE constraint when there are I = 3 age groups and J = 3 period groups as well as zero nonlinearities, the IE Last indeed recovers the true slopes when the age nonlinearity is zero.This row is shaded in Table 10.
Because the number of age groups is set at I = 3 for all simulations in Table 10, the age weight is set at w 1 = 1, and the value of v changes directly with the age nonlinearity.As the age nonlinearity becomes more positive, the age and cohort slopes move towards positive infinity on the real number line, whereas the period slope moves towards negative infinity.In contrast, as the age nonlinearity becomes more negative, the age and cohort slopes move towards negative infinity on the real number line, whereas the period slope moves towards positive infinity.For the same underlying age, period, and cohort linear effects in the population, the IE Last will give radically different estimates of the slopes depending on the nonlinear True Linear Effects α = −2.000π = 4.000 γ = 1.000 Notes: For all simulations, number of age, period, and cohort groups is set at I = 3, J = 3, and K = I + J − 1 = 5, respectively.Sample size for each simulation is n = 1,000 ×(I × J) = 9,000.Shaded row indicates simulated data in which, by construction, the IE constraint is satisfied.
effects.This is in contrast to the OE, which (as shown in 11) provides the same constraint regardless of the nonlinearity of age.
The last column of Table 10 again brings to light the complex and variable nature of the IE Last 's constraint, even in canonical form.For instance, with α 2 = −1.000, the constraint in Table 10 is α − π + 6γ = −1.000,but with α 2 = +1.000, the constraint becomes α − π + 6γ = +1.000.Again, it is very difficult, if not impossible, to give a theoretical reason why this particular linear combination of the temporal slopes in the population should equal v = +1.000rather than v = −1.000simply because the quadratic age trend is a positive one rather than a negative one.In contrast, the OE imposes the same linear constraint regardless of the magnitude or direction of the nonlinearities.As demonstrated in Table 11, the OE's canonical linear constraint is always α − π + γ = 0 despite changes in the size or sign of the age-squared effect.
True Linear Effects α = −2.000π = 1.000 γ = 3.000 Notes: For all simulations, number of age, period, and cohort groups is set at I = 3, J = 3, and K = I + J − 1 = 5, respectively.Sample size for each simulation is n = 1,000 ×(I × J) = 9,000.Shaded row indicates simulated data in which, by construction, the OE constraint is satisfied.

Conclusion
This article makes a number of important conclusions not recognized or fully appreciated in the current APC literature.First, we compare the similarities and differences of MP estimators, clarifying that although all MP estimators share the same desirable statistical properties, they diverge in a decisive way: because they are based on varying design matrices, they impose differing linear constraints on the true, unknown temporal effects.Second, we show how to explicitly construct a transformation matrix that allows one to convert any APC design matrix in canonical form with the linear and nonlinear components orthogonally separated.Third, using this transformation matrix and canonical form of the design matrix, we then prove that all constrained estimators, including MP estimators, generate the same set of nonlinear effects but differing linear effects.Moreover, we show that the solution line of any APC data set can always be simplified to a canonical form that spans just three dimensions.Fourth, we show mathematically that the IE and related MP estimators have a general canonical linear constraint, parallel to the fundamental linear dependency among the time scales, such that w 1 α − w 2 π + w 3 γ = v, where the w's are weights and v is a scalar.To our knowledge, this is the first analysis to reveal this general canonical linear constraint and use it to compare multiple MP estimators.Finally, we show, both mathematically and using simulations, that the IE and a number of other MP estimators produce varying canonical linear constraints depending on the size and sign of the nonlinear effects, the number of APC categories, and the choice of reference category.However, two MP estimators, the OE and DE, are both easier to interpret and more robust than the IE.In particular, we find that the OE's constraint can always be expressed as α − π + γ = 0. We now conclude by offering a set of practical guidelines for APC researchers.In general, we do not recommend using MP estimators, including the IE, in an attempt to recover the true APC effects.However, if a researcher chooses to use an MP estimator, including the IE, then we have several recommendations.First, regardless of the MP estimator one uses, the full set of linear and nonlinear effects should be reported.This will allow the researcher to evaluate the legitimacy of the constraint imposed on the true, unknown linear APC effects.Fit statistics should be used judiciously, with full appreciation that they cannot discern whether or not two or three linear effects are operating in any particular data set.Ultimately, any linear constraint should be grounded in an underlying social, cultural, or biological theory.
Second, if using an MP estimator, then we recommend using the OE.The OE's linear constraint on the true temporal effects does not vary based on changes in the design matrix, such as the number of period groups or the size and sign of the nonlinearities.Moreover, the OE has the minimum sampling variance with respect to the canonical solution line spanning three dimensions and has the null vector with the smallest number of nonzero elements.In this way, the OE is the most parsimonious representation of the linear dependency inherent in any APC data set.
Finally, we caution that researchers who use the IE or related MP estimators in the hopes of uncovering the true data-generating parameters are unlikely to attain their goal.If using the IE, one should state explicitly why it is reasonable to assume that the model parameters in the population have the same linearly dependent relationship among each other as their corresponding columns in the design matrix.In general, we suspect that this constraint will be difficult, if not impossible, to justify.At this point, it is unclear what type of logic would be used to connect a set of theoretical considerations to a set of assumptions about the linear dependency in the design matrix.To emphasize, there is, to our knowledge, no social, biological, or cultural theory that claims that the true, unknown APC effects must conform to some specific linear dependence in the data.
Although we think that the OE is preferable to the IE, this in no way implies that it is preferable to other approaches to analyzing APC data.In general, one can divide the set of methods for identifying APC effects into statistical and theoretical approaches.The strength of a statistical approach, such as the OE or IE, is that it is not based on any explicit theoretical or substantive assumptions that researchers may strongly disagree about.The weakness of a statistical approach is that there is, in general, no reason to believe that it estimates the true parameters for the model that generated the data.A variety of theoretically based strategies for identifying APC effects are possible.For example, one could assume, using social, biological, or cultural theory, that the linear effect of the age, period, or cohort variables is positive or negative.As Fosse and Winship (2018) show, this can be used to bound estimates of the data-generating parameters, in some cases leading to quite narrow bounds despite weak assumptions.An alternative theoretical approach entails specifying the mechanisms through which age, period, and cohort impact the outcome.As Winship and Harding (2008) demonstrate, if one has variables measuring all the pathways through which at least one of the age, period, or cohort variables operate, then it is possible to identify the underlying data-generating parameters.To be sure, an appreciable issue with any theoretical approach is that researchers may well disagree on the validity of particular assumptions.Although this can be a serious problem, at least when there is disagreement, it will be clear why different approaches lead to different estimates.
Ultimately, a theoretical approach is to be preferred to a purely statistical one.In most cases, researchers are interested in the true but unknown APC effects.We appreciate that others see more value in a statistical approach.Nonetheless, if one is going to use a statistical approach to identification, there is merit in using an estimator, such as the OE, that provides separate estimates of the nonidentified linear and identified nonlinear parameters.As we have demonstrated, doing so produces estimates of the linear effects that are not affected by the number of periods and cohorts available to the researcher, the choice of reference category, or the size and sign of the nonlinear effects.
number of samples is equal to its value calculated in the population) does not indicate that it produces an unbiased estimate of the parameters of the underlying model that generated the data (that is, the true, unknown temporal effects).
18 This can be further revealed by simply examining the various proofs showing that the IE has these properties and noting that none depend on using a zero-sum, effects-coded design matrix (for example, see Fu 2000: 263-8,276-7;Yang et al. 2004: 107-8;Yang and Land 2013a: 75-123).
19 Recall that a linear combination is any mathematical expression that entails adding a set of terms each multiplied by a constant, where the constant can include one.For example, the well-known relationship between temperature in degrees Fahrenheit (F) and Celsius (C) is a linear combination: F = C × 9 5 + 32.When we convert from degrees Celsius to Fahrenheit, we are not changing the temperature; rather, we are simply recentering (by adding 32) and rescaling (by multiplying by 9 5 ) the distribution of the temperature in degrees Celsius.Crucially, once we know the temperature in degrees Celsius, we know the temperature in Fahrenheit because it is a simple linear combination.In a similar way, regardless of the coding scheme for a set of APC data, we can express at least one variable as a linear relationship of the other variables.

Stage 1 :
Find those values of b that minimize Xb − y 2 (9) Stage 2: Minimize b 2 among all solutions from Stage 1, , so b MP = b, when s = 0 and b MP = b, when s = 0. (11) Rearranging b = b MP + sv and taking the expectation, we know that E(b MP ) = b − sv.Thus, the expected value of the MP estimate will not equal the true, unknown temporal effects unless s
. The transformation matrix is used to

Table 1 :
Linear constraints of MP estimators with I = 3 and J = 3 groups.
the average of the first age category with the average of all age categories); (3) TE First , or the treatment estimator (TE), which uses treatment contrasts with the first category of each variable as a reference (e.g., α 3 − α 1 compares sociological science | www.sociologicalscience.com the average of the third age category with the average of the first age category); (4) TE Last , also the TE, which uses treatment contrasts with the last category of each variable as a reference (e.g., α 1 − α 3 compares the average of the first age category with the average of the last age category); (5) HE Backward , or the helmert estimator (HE), which uses backward helmert contrasts with the first category of each variable omitted (e.g., α 3 − α 1:2 compares the average of the last age category with the average of the two preceding age categories); (6) HE Forward , also the HE, which uses forward helmert contrasts with the last category of each variable omitted (e.g., α 1 − α 2:3 compares the average of the first age category with the average of the two subsequent age categories); (7) DE Backward , or the difference estimator (DE), which uses backward successive differences with the first category of each variable omitted (e.g., α 2 − α 1 and α 3 − α 2 compare the averages of adjacent age categories); (8) DE Forward , also the DE, which uses forward successive differences with the last category of each variable omitted (e.g., α 1 − α 2 and α 2 − α 3 compare the averages of adjacent age categories); and (9) OE, or the orthogonal estimator (OE), which uses sum-to-zero orthogonal polynomial contrasts or, more generally, sum-to-zero effect coding with the linear and nonlinear components orthogonalized (e.g., α is the age linear effect, α 2 is the quadratic age effect, and α 3 is the cubic age effect).Table

Table 2 :
Decomposition of sum-to-zero effects using the transformation matrix.

Table 3 :
Canonical form of linear constraints with I = 3 and J = 3 groups.

Table 4 :
Canonical form of linear constraints with I = 3 and J = 4 groups.

Table 5 :
Canonical form of linear constraints with I =

Table 6 :
Comparison of MP estimators: nonlinear and linear effects.

Table 7 :
Estimates and goodness-of-fit statistics for IE Last and two-factor models.

Table 8 :
Sensitivity of the IE Last to number of period (and cohort) groups.Number of age groups is set at I = 3 for all simulations.Sample size for each simulation is n = 100 × (I × J).Shaded row indicates initial simulated data in which, by construction, the IE constraint is satisfied.Due to rounding, some IE constraints displayed here will not equal zero exactly.

Table 9 :
Robustness of the OE to number of period (and cohort) groups.Number of age groups is set at I = 3 for all simulations.Sample size for each simulation is n = 100 × (I × J).Shaded row indicates initial simulated data in which, by construction, the OE constraint is satisfied.

Table 10 :
Sensitivity of the IE Last to age nonlinearities.

Table 11 :
Robustness of the OE to age nonlinearities.