X

(Properly) using statistics in social sciences (and journalism)

Laurence Dierickx

Whether in my research or in my courses with (data)journalism students, I regularly observe that statistics are used mechanically, most often limited to simple descriptions. The aim of this post is to provide a framework for better understanding the role of the main statistical tools: what they can tell us, and what conclusions they cannot draw.

 

Descriptive statistics are not neutral

Descriptive statistics (mean, median, standard deviation, percentages) are often presented as a « neutral » step in the analysis, since they describe the data before moving on to interpretation. This view, however, is misleading because describing already involves choosing an interpretive framework.
An average implicitly assumes a certain degree of data homogeneity. For example, in economic journalism, reporting an “average income” for a population can be misleading if that population is very unequal: a small proportion of high incomes can pull the average upwards, giving a biased picture of the real situation of the majority.

A median is based on the idea that the central position is informative. In social sciences, median income is often used to describe a population. While this helps to avoid the effects of extreme values, it can also mask important realities, such as the presence of distinct groups (for example, a stable middle class alongside a vulnerable population facing significant hardship).

A percentage implies that the aggregated categories are relevant and comparable. In journalism, saying that « 60% of young people support a measure » assumes that the category « young people » is homogeneous, whereas it can include very different profiles depending on age, education level, or socio-economic status.

The standard deviation is another good example of the tension between common usage and fragile interpretation. It is ubiquitous, yet rarely questioned. In theory, it measures the dispersion around the mean. In practice, however, its interpretation is much less straightforward. Two distributions with very different characteristics can have the same standard deviation. Skewed distributions can make this measure uninformative. Moreover, it is often used without examining the data’s actual distribution.

This routine use can produce artificially symmetrical comparisons between groups that are not symmetrical. Summarising data with « mean ± standard deviation » gives the impression of a simple, comparable structure, whereas the reality may be much more complex. Bimodal distributions, threshold effects, or even distinct subpopulations can thus be completely obscured by this type of summary.

This problem becomes even clearer in certain situations where data aggregation completely alters the interpretation. Simpson’s paradox is a classic example: a trend observed across several subgroups can disappear or even reverse when the data are aggregated. In other words, depending on the level of analysis chosen, the data can appear to tell contradictory stories.

This kind of situation is not uncommon in the social sciences or (data) journalism. It serves as a reminder that the same dataset can produce different conclusions depending on how it is segmented, grouped, or summarised. Choices regarding categorisation, grouping, or the level of aggregation are never neutral: they directly influence the results observed.

Descriptive statistics is, therefore, not a mere preamble to analysis. It already constitutes a first step in interpretation, which shapes the way in which data is perceived and understood.

 

Associations and comparisons between categorical variables

Descriptive statistics are insufficient to accurately describe the relationships between variables. Observing differences in percentages or means may suggest associations, but does not reveal whether these differences are robust or simply due to chance.

For categorical variables, the chi-square test is commonly used to assess the independence between two variables, for example, between gender and political opinion. Although this test is simple and effective with large samples, it relies on limited logic. It detects the presence of an association, but does not describe its structure, provide clear direction or offer a directly interpretable measure of its strength. For small samples or when certain categories are underrepresented, Fisher’s exact test is preferred, which is based on similar logic but without asymptotic approximations.

In both cases, a common mistake is to confuse a statistically significant association with a substantial relationship. To go further, it is necessary to introduce effect size measures adapted to categorical variables, such as Cramér’s V or the phi coefficient , which allow us to assess the strength of the association independently of the sample size.

When comparing continuous variables between groups, Student’s t-test is the most widely used tool. It allows us to test whether two means differ significantly. However, its interpretation again relies on several key assumptions: an approximately normal distribution of the data, independence of the observations and, depending on the version, homogeneity of variances between groups. In real-world social data, these conditions are rarely met entirely.

This does not mean that the t-test should be abandoned altogether, however, as it remains relatively robust to certain violations, particularly when sample sizes are sufficient. However, when there are significant deviations from the hypotheses (e.g. highly skewed distributions, outliers or small samples), other approaches may be more appropriate.

For example, nonparametric tests, such as the Mann-Whitney U test, can be used. This test does not assume normality and compares distributions more generally. In other cases, bootstrap approaches allow for estimating differences between groups without strong assumptions about the data’s shape. Finally, when there are significant differences in variance between groups, adapted versions of the t-test (such as Welch’s test ) can be used.

These alternatives highlight a crucial point: there is no universal test. The choice of tool depends on the data structure and the question being asked.

Statistical tests do not automatically generate conclusions; they provide elements of an answer within a specific framework that must be understood in order to interpret the results correctly.

Interpretive problems become particularly apparent when moving from statistical results to substantive conclusions. In social sciences and data journalism, several errors recur frequently.

A common first mistake is to interpret a statistically significant difference as a large one. For example, in an opinion poll, a variation of a few percentage points between two groups may be statistically significant but empirically negligible. Conversely, potentially large differences may not be detected as significant due to an insufficient sample size.

A second case concerns group comparisons that do not control for contextual variables. Observing that two populations differ on a single variable (income, opinion, behaviour) says nothing, in itself, about the mechanisms behind this difference. Without taking into account factors such as age, education level, or social position, there is a risk of wrongly attributing a direct relationship to what are actually structural effects.

A third common case is causal interpretations based on simple associations. For example, a correlation or difference in means between two groups may be presented as an implicit ‘effect’, even though there is no methodological framework to support a causal relationship. This is particularly evident in certain media analyses of data, where methodological caution is sacrificed in favour of a more direct narrative.

 

Factorials, constructs and regressions: moving to modelling

Even before delving into regression, an important step in the social sciences often involves constructing synthetic variables from multiple indicators. This is the case with factor analyses, which reduce the dimensionality of data by identifying latent structures, and with the construction of composite scales, often validated using indicators such as Cronbach’s alpha. These tools are used to measure theoretical concepts, such as satisfaction or social representations, which are not directly observable.

Regression goes beyond simple comparisons between groups or variables. Linear regression is used when the dependent variable is continuous, while logistic regression is employed when the dependent variable is binary, for example, to estimate the probability of an event. In both cases, regression permits control for multiple variables simultaneously, making it a powerful tool for isolating conditional associations.

However, this tool is very often overinterpreted. Three confusions recur in the use of regression models: 1) the fact that a variable is associated with another in a model does not mean that it is the cause of it; 2) a regression coefficient describes a conditional variation in a given model, not a causal mechanism in itself; 3) a statistical model does not represent the social world as it is, but a structured simplification of the observed data.

Any regression is highly dependent on modelling choices. Three elements are particularly important: the variables included, the variables excluded, and the interactions between variables ignored. These choices are therefore never neutral. Two models built on the same data can lead to different results, or even opposite conclusions, simply by changing the model specification.

 

The problem of p-values

The p-value is probably the most misunderstood and misused statistical tool in the social sciences. It is often interpreted as a direct measure of the truth of a result, when it does not. It does not indicate the probability that the hypothesis is true, nor the magnitude of an effect, nor the overall robustness of a result. It simply indicates the probability of observing data at least as extreme as those observed if the null hypothesis were true.

In other words, the p-value is conditional on an initial hypothesis: it measures the compatibility of the data with a model in which no effect exists.
In practice, however, it is often reduced to a binary logic: p < 0.05 is taken to indicate a “significant” result, while p ≥ 0.05 is interpreted as “not significant.” This simplification transforms a probabilistic tool into an automatic decision rule, masking the true uncertainty in the estimates and leading to an overly rigid interpretation of the results.

Effect size complements this analysis by measuring the true importance of a phenomenon, independent of its statistical significance. Several indicators are commonly used depending on the context: Cohen’s d for comparing means, odds ratios in logistic models, or R² to measure the proportion of variance explained by a model.

These measures are essential because they allow us to distinguish between an effect that is merely detectable statistically and one that is truly substantial empirically.

A result can indeed be statistically significant while practically negligible, particularly when the sample sizes are very large. Conversely, a large effect may not reach statistical significance if the data are insufficient or too noisy.

 

Statistical power, multiplicity of tests, and causal confounding: the limits of inference

Statistical power is the probability of detecting a real effect when it exists. However, this is often overlooked in the social sciences. Many studies are too small to be statistically powerful, meaning their negative conclusions may be unreliable.

This has several important consequences. First, the results become unstable from one study to another. Second, non-replications are frequent, which weakens the robustness of empirical knowledge. Finally, a lack of significance is often wrongly interpreted as a lack of effect, when it may simply reflect a lack of statistical power.

Therefore, the absence of a result is not proof of the absence of an effect.

This difficulty is compounded by another recurring problem: the sheer number of statistical tests. When a large number of analyses are performed on the same data, the probability of obtaining at least one false positive mechanically increases. This can lead to selecting the most « significant » results, neglecting inconclusive results, or even transforming exploratory analyses into definitive conclusions.

Corrections exist, including Bonferroni adjustments or false discovery rate (FDR) control procedures, but they are still too rarely applied or correctly implemented in current practice.

These difficulties in inference are often compounded by a more fundamental confusion between correlation and causation. Two variables can be associated without a direct causal link due to confounding factors, reverse causality, or selection bias.

 


Conclusion

Statistics in the social sciences and data journalism are not a machine for producing results or explanations. They are a set of tools designed to answer specific questions within well-defined frameworks. Making better use of statistics does not mean using more of them, but rather understanding more precisely what they can tell us and, above all, what conclusions they cannot lead us to draw.