# Data Analysis

Different statistics and methods used to describe the characteristics of the members of a sample or population, explore the relationships between variables, to test research hypotheses, and to visually represent data are described. Terms relating to the topics covered are defined in the Research Glossary.

• Descriptive Statistics

• Tests of Significance

• Graphical/Pictorial Methods

• Analytical Techniques

Descriptive statistics can be useful for two purposes:

1. To provide basic information about the characteristics of a sample or population. These characteristics are represented by variables in a research study dataset.

2. To highlight potential relationships between these characteristics, or the relationships among the variables in the dataset.

The four most common descriptive statistics are:

#### Proportions, Percentages and Ratios

One of the most basic ways of describing the characteristics of a sample or population is to classify its individual members into mutually exclusive categories and counting the number of cases in each of the categories. In research, variables with discrete, qualitative categories are called nominal or categorical variables. The categories can be given numerical codes, but they cannot be ranked, added, or multiplied. Examples of nominal variables include gender (male, female), preschool program attendance (yes, no), and race/ethnicity (White, African American, Hispanic, Asian, American Indian). Researchers calculate proportions, percentages and ratios in order to summarize the data from nominal or categorical variables and to allow for comparisons to be made between groups.

Proportion—The number of cases in a category divided by the total number of cases across all categories of a variable.

Percentage—The proportion multiplied by 100 (or the number of cases in a category divided by the total number of cases across all categories of a value times 100).

Ratio—The number of cases in one category to the number of cases in a second category.

Example:

A researcher selects a sample of 100 students from a Head Start program. The sample includes 20 White children, 30 African American children, 40 Hispanic children and 10 children of mixed-race/ethnicity.

Proportion of Hispanic children in the program = 40 / (20+30+40+10) = .40.

Percentage of Hispanic children in the program = .40 x 100 = 40%.

Ratio of Hispanic children to White children in the program = 40/20 = 2.0, or the ratio of Hispanic to White children enrolled in the Head Start program is 2 to 1.

#### Measures of Central Tendency

Proportions, percentages and ratios are used to summarize the characteristics of a sample or population that fall into discrete categories. Measures of central tendency are the most basic and, often, the most informative description of a population's characteristics, when those characteristics are measured using an interval scale. The values of an interval variable are ordered where the distance between any two adjacent values is the same but the zero point is arbitrary. Values on an interval scale can be added and subtracted. Examples of interval scales or interval variables include household income, years of schooling, hours a child spends in child care and the cost of child care.

Measures of central tendency describe the "average" member of the sample or population of interest. There are three measures of central tendency:

Mean—The arithmetic average of the values of a variable. To calculate the mean, all the values of a variable are summed and divided by the total number of cases.

Median—The value within a set of values that divides the values in half (i.e. 50% of the variable's values lie above the median, and 50% lie below the median).

Mode—The value of a variable that occurs most often.

Example:

The annual incomes of five randomly selected people in the United States are \$10,000, \$10,000, \$45,000, \$60,000, and \$1,000,000.

Mean Income = (10,000 + 10,000 + 45,000 + 60,000 + 1,000,000) / 5 = \$225,000.

Median Income = \$45,000.

Modal Income = \$10,000.

The mean is the most commonly used measure of central tendency. Medians are generally used when a few values are extremely different from the rest of the values (this is called a skewed distribution). For example, the median income is often the best measure of the average income because, while most individuals earn between \$0 and \$200,000 annually, a handful of individuals earn millions.

#### Measures of Dispersion

Measures of dispersion provide information about the spread of a variable's values. There are three key measures of dispersion:

Range is simply the difference between the smallest and largest values in the data. Researchers often report simply the values of the range (e.g., 75 – 100).

Variance is a commonly used measure of dispersion, or how spread out a set of values are around the mean. It is calculated by taking the average of the squared differences between each value and the mean. The variance is the standard deviation squared.

Standard deviation, like variance, is a measure of the spread of a set of values around the mean of the values. The wider the spread, the greater the standard deviation and the greater the range of the values from their mean. A small standard deviation indicates that most of the values are close to the mean. A large standard deviation on the other hand indicates that the values are more spread out. The standard deviation is the square root of the variance.

Example:

Five randomly selected children were administered a standardized reading assessment. Their scores on the assessment were 50, 50, 60,75 and 90 with a mean score of 65.

Range = 90 - 50 = 40.

Variance = [(50 - 65)2 + (50 - 65)2 + (60 - 65)2 + (75 - 65)2 + (90 - 65)2] / 5 = 300.

Standard Deviation = Square Root (150,540,000,000) = 17.32.

Skewness and Kurtosis

The range, variance and standard deviation are measures of dispersion and provide information about the spread of the values of a variable. Two additional measures provide information about the shape of the distribution of values.

Skew is a measure of whether some values of a variable are extremely different from the majority of the values. Skewness refers to the tendency of the values of a variable to depart from symmetry. A distribution is symmetric if one half of the distribution is exactly equal to the other half. For example, the distribution of annual income in the U.S. is skewed because most people make between \$0 and \$200,000 a year, but a handful of people earn millions. A variable is positively skewed (skewed to the right) if the extreme values are higher than the majority of values. A variable is negatively skewed (skewed to the left) if the extreme values are lower than the majority of values. In the example of students' standardized test scores, the distribution is slightly positively skewed.

Kurtosis measures how outlier-prone a distribution is. Outliers are values of a variable that are much smaller or larger than most of the values found in a dataset. The kurtosis of a normal distribution is 0. If the kurtosis is different from 0, then the distribution produces outliers that are either more extreme (positive kurtosis) or less extreme (negative kurtosis) than are produced by the normal distribution.

#### Measures of Association

Measures of association indicate whether two variables are related. Two measures are commonly used:

• Chi-square test of independence

• Correlation

Chi-Square test of independence is used to evaluate whether there is an association between two variables. (The chi-square test can also be used as a measure of goodness of fit, to test if data from a sample come from a population with a specific distribution, as an alternative to Anderson-Darling and Kolmogorov-Smirnov goodness-of-fit tests.)

• It is most often used with nominal data (i.e., data that are put into discrete categories: e.g., gender [male, female] and type of job [unskilled, semi-skilled, skilled]) to determine whether they are associated. However, it can also be used with ordinal data.

• Assumes that the samples being compared (e.g., males, females) are independent.

• Tests the null hypothesis of no difference between the two variables (i.e., type of job is not related to gender).

To test for associations, a chi-square is calculated in the following way: Suppose a researcher wants to know whether there is a relationship between gender and two types of jobs, construction worker and administrative assistant. To perform a chi-square test, the researcher counts the number of female administrative assistants, the number of female construction workers, the number of male administrative assistants, and the number of male construction workers in the data. These counts are compared with the number that would be expected in each category if there were no association between job type and gender (this expected count is based on statistical calculations). The association between the two variables is determined to be significant (the null hypothesis is rejected), if the value of the chi-square test is greater than or equal to the critical value for a given significance level (typically .05) and the degrees of freedom associated with the test found in a chi-square table. The degrees of freedom for the chi-square are calculated using the following formula: df = (r-1)(c-1) where r is the number of rows and c is the number of columns in a contingency or cross-tabulation table. For example, the critical value for a 2 x 2 table with 1 degree of freedom ([2-1][2-1]=1) is 3.841.

Correlation coefficient is used to measure the strength and direction of the relationship between numeric variables (e.g., weight and height).

• The most common correlation coefficient is the Pearson's product-moment correlation coefficient (or simply Pearson's r), which can range from -1 to +1.

• Values closer to 1 (either positive or negative) indicate that a stronger association exists between the two variables.

• A positive coefficient (values between 0 and 1) suggests that larger values of one of the variables are accompanied by larger values of the other variable. For example, height and weight are usually positively correlated because taller people tend to weigh more.

• A negative association (values between 0 and -1) suggests that larger values of one of the variables are accompanied by smaller values of the other variable. For example, age and hours slept per night are often negatively correlated because older people usually sleep fewer hours per night than younger people.

The findings reported by researchers are typically based on data collected from a single sample that was drawn from the population of interest (e.g., a sample of children selected from the population of children enrolled in Head Start or Early Head Start). If additional random samples of the same size were drawn from this population, the estimated percentages and means calculated using the data from each of these other samples might differ by chance somewhat from the estimates produced from one sample. Researchers use one of several tests to evaluate whether their findings are statistically significant.

Statistical significance refers to the probability or likelihood that the difference between groups or the relationship between variables observed in statistical analyses is not due to random chance (e.g., that differences between the average scores on a measure of language development between 3- and 4-year-olds are likely to be “real” rather than just observed in this sample by chance). If there is a very small probability that an observed difference or relationship is due to chance, the results are said to reach statistical significance. This means that the researcher concludes that there is a real difference between two groups or a real relationship between the observed variables.

Significance tests and the associated p- value only tell us how likely it is that a statistical result (e.g., a difference between the means of two or more groups, or a correlation between two variables) is due to chance. The p-value is the probability that the results of a statistical test are due to chance. In the social and behavioral sciences, a p-value less than or equal to .05 is usually interpreted to mean that the results are statistically significant (that the statistical results would occur by chance 5 times or fewer out of 100), although sometimes researchers use a p-value of .10 to indicate whether a result is statistically significant. The lower the p-value, the less likely a statistical result is due to chance. Lower p-values are therefore a more rigorous criteria for concluding significance.

Researchers use a variety of approaches to test whether their findings are statistically significant or not. The choice depends on several factors, including the number of groups being compared, whether the groups are independent from one another, and the type of variables used in the analysis. Three widely used tests are the t-test, F-test, and Chi-square test.

Three of the more widely used tests of statistical significance are described briefly below.

• Chi-Square test is used when testing for associations between categorical variables (e.g., differences in whether a child has been diagnosed as having a cognitive disability by gender or race/ethnicity). It is also used as a goodness-of-fit test to determine whether data from a sample come from a population with a specific distribution.

• t-test is used to compare the means of two independent samples (independent t-test), the means of one sample at different times (paired sample t-test) or the mean of one sample against a known mean (one sample t-test). For example, when comparing the mean assessment scores of boys and girls or the mean scores of 3- and 4-year-old children, an independent t-test would be used. When comparing the mean assessment scores of girls only at two time points (e.g., fall and spring of the program year) a paired t-test would be used. A one sample t-test would be used when comparing the mean scores of a sample of children to the mean score of a population of children. The t- test is appropriate for small sample sizes (less than 30) although it is often used when testing group differences for larger samples. It is also used to test whether correlation and regression coefficients are significantly different from zero.

• F-test is an extension of the t-test and is used to compare the means of three or more independent samples (groups). The F-test is used in Analysis of Variance (ANOVA) to examine the ratio of the between groups to within groups variance. It is also used to test the significance of the total variance explained by a regression model with multiple independent variables.

Significance tests alone do not tell us anything about the size of the difference between groups or the strength of the association between variables. Because significance test results are sensitive to sample size, studies with different sample sizes with the same means and standard deviations would have different t statistics and p values. It is therefore important that researchers provide additional information about the size of the difference between groups or the association and whether the difference/association is substantively meaningful.

#### Resources

See the following for additional information about descriptive statistics and tests of significance:

There are several graphical and pictorial methods that enhance understanding of individual variables and the relationships between variables. Graphical and pictorial methods provide a visual representation of the data. Some of these methods include:

• Bar charts

• Pie charts

• Line graphs

• Scatter plots

• Geographical Information Systems (GIS)

• Sociograms

#### Bar charts

• Bar charts visually represent the frequencies or percentages with which different categories of a variable occur.

• Bar charts are most often used when describing the percentages of different groups with a specific characteristic. For example, the percentages of boys and girls who participate in team sports. However, they may also be used when describing averages such as the average boys and girls spend per week participating in team sports.

• Each category of a variable (e.g., gender [boys and girls], children's age [3, 4, and 5]) is displayed along the bottom (or horizontal or X axis) of a bar chart.

• The vertical axis (or Y axis) includes the values of the statistic on that the groups are being compared (e.g., percentage participating in team sports).

• A bar is drawn for each of the categories along the horizontal axis and the height of the bar corresponds to the frequency or percentage with which that value occurs.

#### Pie charts

• A pie chart (or a circle chart) is one of the most commonly used methods for graphically presenting statistical data.

• As its name suggests, it is a circular graphic, which is divided into slices to illustrate the proportion or percentage of a sample or population that belong to each of the categories of a variable.

• The size of each slice represents the proportion or percentage of the total sample or population with a specific characteristic (found in a specific category). For example, the percentage of children enrolled in Early Head Start who are members of different racial/ethnic groups would be represented by different slices with the size of each slice proportionate to the group's representation in the total population of children enrolled in the Early Head Start program.

#### Line graphs

• A line graph is a type of chart which displays information as a series of data points connected by a straight line.

• Line graphs are often used to show changes in a characteristic over time.

• It has an X-axis (horizontal axis) and a Y axis (vertical axis). The time segments of interest are displayed on the X-axis (e.g., years, months). The range of values that the characteristic of interest can take are displayed along the Y-axis (e.g., annual household income, mean years of schooling, average cost of child care). A data point is plotted coinciding with the value of the Y variable plotted for each of the values of the X variable, and a line is drawn connecting the points.

#### Scatter plots

• Scatter plots display the relationship between two quantitative or numeric variables by plotting one variable against the value of another variable

• The values of one of the two variables are displayed on the horizontal axis (x axis) and the values of the other variable are displayed on the vertical axis (y axis)

• Each person or subject in a study would receive one data point on the scatter plot that corresponds to his or her values on the two variables. For example, a scatter plot could be used to show the relationship between income and children's scores on a math assessment. A data point for each child in the study showing his or her math score and family income would be shown on the scatter plot. Thus, the number of data points would equal the total number of children in the study.

#### Geographic Information Systems (GIS)

• A Geographic Information System is computer software capable of capturing, storing, analyzing, and displaying geographically referenced information; that is, data identified according to location.

• Using a GIS program, a researcher can create a map to represent data relationships visually. For example, the National Center for Education Statistics creates maps showing the characteristics of school districts across the United States such as the percentage of children living in married couple households, median family incomes and percentage of population that speaks a language other than English. The data that are linked to school district location come from the American Community Survey.

#### Sociograms

• Display networks of relationships among variables, enabling researchers to identify the nature of relationships that would otherwise be too complex to conceptualize.

Researchers use different analytical techniques to examine complex relationships between variables. There are three basic types of analytical techniques:

• Regression Analysis

• Grouping Methods

• Multiple Equation Models

#### Regression Analysis

Regression analysis assumes that the dependent, or outcome, variable is directly affected by one or more independent variables. There are four important types of regression analyses:

1. Ordinary least squares (OLS) regression

• OLS regression (also known as linear regression) is used to determine the relationship between a dependent variable and one or more independent variables.

• OLS regression is used when the dependent variable is continuous. Continuous variables, in theory, can take on any value with a range. For example, family child care expenses, measured in dollars, is a continuous variable.

• Independent variables may be nominal, ordinal or continuous. Nominal variables, which are also referred to as categorical variables, have two or more non-numeric or qualitative categories. Examples of nominal variables are children's gender (male, female), their parents' marital status (single, married, separated, divorced), and the type of child care children receive (center-based, home-based care). Ordinal variables are similar to nominal variables except it is possible to order the categories and the order has meaning. For example, children's families’ socioeconomic status may be grouped as low, middle and high.

• When used to estimate the associations between two or more independent variables and a single dependent variable, it is called multiple linear regression.

• In multiple regression, the coefficient (i.e., standardized or unstandardized regression coefficient for each independent variable) tells you how much the dependent variable is expected to change when that independent variable increases by one, holding all the other independent variables constant.

2. Logistic regression

• Logistic regression (or logit regression) is a special form of regression analysis that is used to examine the associations between a set of independent or predictor variables and a dichotomous outcome variable. A dichotomous variable is a variable with only two possible values, e.g. child receives child care before or after the Head Start program day (yes, no).

• Like linear regression, the independent variables may be either interval, ordinal, or nominal. A researcher might use logistic regression to study the relationships between parental education, household income, and parental employment and whether children receive child care from someone other than their parents (receives nonparent care/does not receive nonparent care).

3. Hierarchical linear modeling (HLM)

• Used when data are nested. Nested data occur when several individuals belong to the same group under study. For example, in child care research, children enrolled in a center-based child care program are grouped into classrooms with several classrooms in a center. Thus, the children are nested within classrooms and classrooms are nested within centers.

• Allows researchers to determine the effects of characteristics for each level of nested data, classrooms and centers, on the outcome variables. HLM is also used to study growth (e.g., growth in children’s reading and math knowledge and skills over time).

4. Duration models

• Used to estimate the length of time before a given event occurs or the length of time spent in a state. For example, in child care policy research, duration models have been used to estimate the length of time that families receive child care subsidies.

• Sometimes referred to as survival analysis or event history analysis.

#### Grouping Methods

Grouping methods are techniques for classifying observations into meaningful categories. Two of the most common grouping methods are discriminant analysis and cluster analysis.

Discriminant analysis

• Identifies characteristics that distinguish between groups. For example, a researcher could use discriminant analysis to determine which characteristics identify families that seek child care subsidies and which identify families that do not.

• It is used when the dependent variable is a categorical variable (e.g., family receives child care subsidies [yes, no], child enrolled in family care [yes, no], type of child care child receives [relative care, non-relative care, center-based care]). The independent variables are interval variables (e.g., years of schooling, family income).

Cluster analysis

• Used to classify similar individuals together. It uses a set of measured variables to classify a sample of individuals (or organizations) into a number of groups such that individuals with similar values on the variables are placed in the same group. For example, cluster analysis would be used to group together parents who hold similar views of child care or children who are suspended from school.

• Its goal is to sort individuals into groups in such a way that individuals in the same group (cluster) are more similar to each other than to individuals in other groups.

• The variables used in cluster analysis may be nominal, ordinal or interval.

#### Multiple Equation Models

Multiple equation modeling, which is an extension of regression, is used to examine the causal pathways from independent variables to the dependent variable. For example, what are the variables that link (or explain) the relationship between maternal education (independent variable) and children's early reading skills (dependent variable)? These variables might include the nature and quality of mother-child interactions or the frequency and quality of shared book reading.

There are two main types of multiple equation models:

• Path analysis

• Structural equation modeling

Path analysis

Path analysis is an extension of multiple regression that allows researchers to examine multiple direct and indirect effects of a set of variables on a dependent, or outcome, variable. In path analysis, a direct effect measures the extent to which the dependent variable is influenced by an independent variable. An indirect effect measures the extent to which an independent variable's influence on the dependent variable is due to another variable.

• A path diagram is created that identifies the relationships (paths) between all the variables and the direction of the influence between them.

• The paths can run directly from an independent variable to a dependent variable (e.g., X→Y), or they can run indirectly from an independent variable, through an intermediary, or mediating, variable, to the dependent variable (e.g. X1→X2→Y).

• The paths in the model are tested to determine the relative importance of each.

• Because the relationships between variables in a path model can become complex, researchers often avoid labeling the variables in the model as independent and dependent variables. Instead, two types of variables are found in these models:

• Exogenous variables are not affected by other variables in the model. They have straight arrows emerging from them and not pointing to them.

• Endogenous variables are influenced by at least one other variable in the model. They have at least one straight arrow pointing to them.

Structural equation modeling (SEM)

Structural equation modeling expands path analysis by allowing for multiple indicators of unobserved (or latent) variables in the model. Latent variables are variables that are not directly observed (measured), but instead are inferred from other variables that are observed or directly measured. For example, children's school readiness is a latent variable with multiple indicators of children's development across multiple domains (e.g., children's scores on standardized assessments of early math and literacy, language, scores based on teacher reports of children's social skills and problem behaviors).

There are two parts to a SEM analysis. First, the measurement model is tested. This involves examining the relationships between the latent variables and their measures (indicators). Second, the structural model is tested in order to examine how the latent variables are related to one another. For example, a researcher might use SEM to investigate the relationships between different types of executive functions and word reading and reading comprehension for elementary school children. In this example, the latent variables word reading and reading comprehension might be inferred from a set of standardized reading assessments and the latent variables cognitive flexibility and inhibitory control from a set of executive function tasks. The measurement model of SEM allows the researcher to evaluate how well children's scores on the standardized reading assessments combine to identify children's word reading and reading comprehension. Assuming that the results of these analyses are acceptable, the researcher would move on to an evaluation of the structural model, examining the predicted relationships between two types of executive functions and two dimensions of reading.