LOG LINEAR ANALYSIS

This is a multivariate statistical technique used to incorporate the use of ordinal or even nominal level variables in multivariate analysis. The underlying assumptions of techniques such as ordinary least squares regression (OLS) or path analysis suggest that problems of spuriousness may be involved when dummy variables and ordinal variables are included in these types of analyses. For example, one assumption of OLS regression is that the variables are related in a linear fashion. When we create dummy variables, such as Black, White, Asian, Hispanic, and Others from a nominal level variable, the true value of each of these categories is not ascertainable. The value we assign is arbitrary.

Likewise, the values assigned to ordinal level data are not meaningful values, except for the manner in which they are hierarchically arranged. There is not a measurable distance between "very," "somewhat," "not very," and "not at all." Did the respondent choose "very" because he/she felt closer to "very" on the issue at hand than he/she did to "somewhat?" Or did he/she choose "very" because he/she felt more than "somewhat" and wanted to express this?

Log linear analysis transforms non-linear models into essentially linear models through the use of logarithms. Some or all of the variables included can be nominal or ordinal measurements. Like all other causal modelling techniques, log linear analysis requires the researcher to specify a theoretical model prior to testing with the data. In practice, successive models are sometimes tested to find the "best" fit.

The Log-Linear Model

The term log-linear derives from the fact that one can, through logarithmic transformations, restate the problem of analyzing multi-way frequency tables in terms similar to ANOVA. Specifically, we can think of the multi-way frequency table as reflecting various main effects and interaction effects that add together in a linear fashion to bring about the observed table of frequencies.

Why Log-Linear Analysis?

Crosstabulation is a basic, straightforward method for analyzing data. For example, a researcher may tabulate the scores on a racism index by categories of respondents' race and gender; one could tabulate the number of high school drop-outs by age, gender, and school district.  In these cases, the major results can be summarized in a multi-variate frequency table -- a crosstabulation table with two or more variables.

Log-Linear analysis is a more "sophisticated" way of looking at crosstabulation tables. Specifically, each of the factors used in the crosstabulation (e.g., age, gender, region, etc.) and their interactions can be tested for statistical significance. The following provides a brief introduction to these methods, their logic, and interpretation.

Correspondence analysis is a descriptive/exploratory technique designed to analyze two-way and multi-way tables containing some measure of correspondence between the rows and columns. The results provide information which is similar in nature to those produced by Factor Analysis techniques. They allow one to explore the structure of the categorical variables included in the table.
To top

Two-way Frequency Tables

The simplest crosstabulation is a 2 by 2 table. Suppose we were interested in the relationship between age and people's work status. We took a sample of 100 subjects, and determined who works 35 hours or more per week (Full Time) and who works less than 40 hours per week (Part Time). We also recorded the respondents' age and regrouped them into categories. The results of this study are summarized as follows:

Work 
Status
Age Total
Less than 45 45 or older
FT
PT
40
20
  5
35
45
55
Total 60 40 100

While interpreting the results of this study, we'll introduce the terminology that will allow us to generalize to complex tables more easily.

Dependent and independent variables. In multiple regression or analysis of variance we usually distinguish between independent and dependent variables. The dependent variable is the one that we are trying to explain. We hypothesize the dependent variable to depend on the independent variables. We could classify the variables in the above 2 by 2 table as follows: we may think of work status as the dependent variable, and age as the independent variable. (Some fields of study may use the terms "response variables" and "design variables," respectively. Response variables vary in response to the design variables.) We might hypothesize that people 45 years of ageand older are less likely to work full time than people under age 45.
Most statistics test the null hypothesis -- the hypothesis that there is no difference or there is no relationship between the variables. (The null hypothesis would state that there is no difference in work status by age.)

Fitting marginal frequencies (column and row totals). Analyzing the example table. We first ask what the frequencies would look like if there were no relationship between variables (the null hypothesis). Without going into details, intuitively one could expect that the frequencies in each cell would proportionately reflect the marginal frequencies (Totals). For example, consider the following table,which gives the cell frequencies we would expect, given the marginal frequencies, if the null hypothesis of no relationship were true:
 
 

Work 
Status
Age Row
Total
Less than 45 45 or older
FT
PT
27
33
18
22
45
55
Column
Total
60 40 100

In this table, the proportions of the marginal frequencies are reflected in the individual cells. Two-thirds of the sample are under age 45, one-third are 45 or older.  Forty-five percent work Full-time,while 55% work part-time.  If you compare this table with the previous one you will see that the previous table does reflect a relationship between the two variables: There are more cases than expected (under the null hypothesis) below age 45 working full-time, and more cases over age 45 working part-time.

This illustrates the general principle on which the log-linear analysis is based: Given the marginal totals for two (or more) variables, we can compute the cell frequencies that would be expected if the two (or more) variables are unrelated. Significant differences between the observed frequencies and the expected frequencies suggest that a relationship exists between the two (or more) variables.

Model fitting approach. Fitting the model of two variables that are not related (age and work status) amounts to computing the cell frequencies in the table based on the respective marginal frequencies (totals). Significant deviations of the observed table from those fitted frequencies reflect the lack of fit of the independence (between two variables) model. In that case we would reject that model for our data, and instead accept the model that allows for a relationship or association between age and hair color.
To top

 

Multi-way Frequency Tables

The reasoning for the analysis of a 2 by 2 table can be generalized to more complex tables. Suppose we had a third variable in our study, namely whether or not the individuals in our sample have a disability. Because we are interested in the effect of disability on work status, we will consider Disability as another independent variable.  The resulting table is a three-way frequency table.

Fitting models. We apply our previous reasoning to analyze this table. Specifically, we fit different models that reflect different hypotheses about the data. For example, we could begin with a model that hypothesizes independence between all factors. As before, the expected frequencies in that case would reflect the respective marginal frequencies. If any significant deviations occur, we would reject this model.

Interaction effects. Another conceivable model would be that age is related to work status, and disability is related to work status, but the two (age and disability) independent variables do not interact in their effect. In that case, we would need to simultaneously fit the marginal totals for the two-way table of age by work status collapsed across categories of disability, and the two-way table of disability by work status collapsed across the levels of age. If this model does not fit the data, we would conclude that age, disability, and work status all are interrelated -- age and disability interact in their effect on work status.

The concept of interaction here is analogous to that used in analysis of variance. For example, the age by disability interaction could be interpreted such that the relationship of age to work status is modified by disability. While age brings about only little difference in work status in the absence of disability, age is highly related to work status when disability is present. The effects of age and disability on work status are not additive, but interactive.

Iterative proportional fitting. The computation of expected frequencies becomes increasingly complex when there are more than two variables in the table. However, they can be computed, and, therefore, we can easily apply the reasoning developed for the 2 by 2 table to complex tables. The commonly used method for computing the expected frequencies is the so-called iterative proportional fitting procedure.

Goodness-of-Fit

In the previous discussion we made reference to the "significance" of deviations of the observed frequencies from the expected frequencies. The statistical significance of the goodness-of-fit of a particular model can be tested with a Chi-square test. Two types of Chi-squares can be computed, the traditional Pearson Chi-square statistic and the maximum likelihood ratio Chi-square statistic. The interpretation and magnitude of these two Chi-square statistics are essentially identical. Both tests evaluate whether the expected cell frequencies under the respective model are significantly different from the observed cell frequencies. If so, the respective model for the table is rejected.

Reviewing and plotting residual frequencies. After one has chosen a model for the observed table, it is always a good idea to inspect the residual frequencies, that is, the observed minus the expected frequencies. If the model is appropriate for the table, then all residual frequencies should be "random noise," that is, consist of positive and negative values of approximately equal magnitudes that are distributed evenly across the cells of the table.

Statistical significance of effects. The Chi-squares of models that are hierarchically related to each other can be directly compared. For example, if we first fit a model with the age by work status interaction and the disability by work status interaction, and then fit a model with the age by disability by work status (three-way) interaction, then the second model is a superset of the previous model. We could evaluate the difference in the Chi-square statistics, based on the difference in the degrees of freedom; if the differential Chi-square statistic is significant, then we would conclude that the three-way interaction model provides a significantly better fit to the observed table than the model without this interaction. Therefore, the three-way interaction is statistically significant.

In general, two models are hierarchically related to each other if one can be produced from the other by either adding terms (variables or interactions) or deleting terms (but not both at the same time).

Automatic Model Fitting

When analyzing four- or higher-way tables, finding the best fitting model can become increasingly difficult. You can use automatic model fitting options to facilitate the search for a "good model" that fits the data. The general logic of this algorithm is as follows. First, fit a model with no relationships between factors; if that model does not fit (i.e., the respective Chi- square statistic is significant), then it will fit a model with all two-way interactions. If that model does not fit either, then the program will fit all three-way interactions, and so on. Let us assume that this process found the model with all two-way interactions to fit the data. The program will then proceed to eliminate all two-way interactions that are not statistically significant. The resulting model will be the one that includes the least number of interactions necessary to fit the observed table.
To top

 

Thanks to UTSA for this Web space.
For the present, use your Browser's BACK button to go back to whence you came.