How is the per-language MAL significance computed?

Pegah Faghiri, Kim Gerdes, Sylvain Kahane (2026). Verifying the Menzerath-Altmann law in the verbal domain in 180 languages. UDW26 @ LREC 2026.

This page explains the per-language permutation significance test behind the «Sig.» column on the big effect table and the «21.6% pass the universality test» tile on the home page. It is a standard non-parametric test of the sign and slope of the MAL log-log regression for each language separately, as defined in the cell test_mal_significance_beta() of 08_menzerath_altmann_analysis.ipynb.

Headline result

On 185 UD v2.17 languages (those with enough verbs to fit a regression), the test gives:

MAL✓ 40/185 = 21.6% show a significantly positive MAL slope (β > 0, p < 0.05).
anti✓ 11/185 = 5.9% show a significantly negative slope (anti-MAL).
n.s. 134/185 = 72.4% are not significant: their slope is not distinguishable from zero given the data they have.

Pies and beta histogram

The test, step by step

For each language we collect the cached MAL signal {n → mean_constituent_size} for all n from 1 up to the largest n that meets the minimum-count threshold (default MIN_COUNT = 100 verbs).
We fit an ordinary least-squares regression in log-log space: log(mean_size) = α + β · log(n). The slope β is the headline statistic: negative means MAL holds (more dependents → smaller constituents); positive means anti-MAL.
We build a null distribution of β by shuffling the y-values (log(mean_size)) with respect to the x-values (log(n)) and refitting the slope. We repeat this 1 000 times (n_permutations = 1000). Under the null hypothesis «n and constituent size are independent», the observed β should be a typical draw from this distribution.
The two-tailed p-value is the fraction of permuted |β_perm| at least as extreme as the observed |β_obs|. We declare the language significant when p < 0.05 (alpha = 0.05).
We split the «significant» languages by sign: significant_mal = True iff β < 0 (slope goes the MAL direction in log-log space, i.e. compression). Note: in our cached numbers the sign convention is flipped: we report beta_1max as the positive slope of the inverse relation, so the test code uses «significant and beta > 0» directly to flag MAL–compliant languages (see test_mal_significance_beta for the exact sign handling).

What gets shuffled, exactly? A short English walkthrough

Honest framing first. The permutation does not shuffle individual sentences or individual dependents — it shuffles the per-n column of aggregated mean sizes. But seeing one real verb per bucket makes those means concrete, so let us walk through English end‑to‑end.

A · What does one data point really represent?

Each point on the English MAL curve corresponds to one chunk size n and aggregates all English verbs in UD v2.17 that have exactly n dependents. The y‑value is the average subtree size of those dependents. Two real EWT verbs, one from the n=2 bucket and one from the n=4 bucket:

n=2 bucket. The verb pushing has 2 dependents (she, subtree size 1; the stroller, subtree size 2). Mean dependent size = (1+2)/2 = 1.5. This sentence contributes one number to the n=2 average.

n=4 bucket. The verb fear has 4 dependents (then 1, I 1, the achieve‑clause 11, the slip‑clause 8). Mean = (1+1+11+8)/4 = 5.25. This sentence contributes one number to the n=4 average.

B · The aggregated table the test actually sees

After averaging over all English verbs (not just our two examples), the cached signal data/lang2MAL_full.pkl['en']['total'] collapses to just 5 rows (those n‑values that pass MIN_COUNT=100):

n	mean dependent size	log n	log mean	# English verbs in bucket
1	3.724	0.000	1.314	186 933
2	3.383	0.693	1.219	60 660
3	3.134	1.099	1.143	8 918
4	3.209	1.386	1.166	1 019
5	3.348	1.609	1.208	126

Fitting OLS on the (log n, log mean) columns gives the observed slope β_obs ≈ +0.053 (very mild, slightly anti-MAL direction).

C · The shuffle, in pictures

We now ask: could a slope this small simply arise from a chance pairing of n’s and means? We literally permute the y‑column while keeping the x‑column fixed. Two example shuffles:

log n	original log mean	shuffle #1	shuffle #2
0.000	1.314	1.166	1.208
0.693	1.219	1.143	1.314
1.099	1.143	1.208	1.143
1.386	1.166	1.314	1.219
1.609	1.208	1.219	1.166
refitted β	+0.053	−0.052	−0.018

Each shuffle is one random re-pairing of the same five y‑values to the same five x‑values. We refit OLS on the shuffled table and record the new slope. Repeat 1 000 times.

D · The null distribution and the p-value

For English the 1 000 permuted slopes form a distribution centred on 0, ranging roughly from −0.10 to +0.09. The observed β_obs = +0.053 sits squarely in the crowd: about 60.5% of the permutations yield a slope at least as extreme in absolute value. That fraction is the two-tailed p-value, and 0.605 ≫ 0.05, so English is reported as not significant (see its row in the big effect table).

A small ASCII picture of where the observed slope lands in the null distribution:

              count of 1000 permuted β
            ↓
   −0.10  ▁▁▂▃▅▇██▇▅▃▂▁▁   +0.09
                       ↑
                 β_obs = +0.053  (≈ 60th percentile of |β|)

E · So do dep trees help here?

Partly, yes. Dep trees make it tangible what one verb instance contributes to one bucket’s mean — they rescue the test from feeling abstract. They also make clear why the n=4 bucket of English is so noisy (only 1 019 verbs, often dragged up by occasional long‑clause dependents like the achieve‑clause above).

But, strictly, no. The permutation operates one level up: on the 5‑row aggregated table in panel B, not on the individual sentences in panel A. Shuffling whole sentences would be a different test (and a much more expensive one). What we shuffle here is just the y‑column of those five averages.

Why a permutation test (and not the textbook OLS p-value)?

No distributional assumptions. Classical OLS p-values assume normally-distributed, homoscedastic residuals — obviously violated when there are only 4–7 data points per language and the errors come from heavy-tailed corpus counts.
Tiny n. Most languages have only a handful of valid (n, mean-size) points. A permutation test gives an exact reference distribution from the actual data, rather than relying on asymptotic theory.
Honest about underpowered languages. Languages with very few points or very small treebanks end up n.s. not because MAL is false there, but because the evidence is too thin. The test makes that visible.

Three worked examples

Each row shows the actual fitted slope β(1→max) on the cached data and the permutation p-value. Click the language name to jump to its row in the big effect table.

Language	β(1→max)	p-value	Verdict	Interpretation
Georgian	+0.117	<0.001	MAL✓	Slope clearly positive in the MAL direction; the permutation distribution almost never reaches this magnitude by chance, so we reject independence and call it MAL–compliant.
OldChurchSlavonic	-0.232	<0.001	anti✓	Slope clearly negative — longer verbs come with larger constituents on average. The permutation distribution rarely produces a slope this negative, so this is a real anti-MAL signal, not noise.
Turkish	+0.004	0.957	n.s.	Slope is essentially flat (≈ 0). The permutation distribution easily reproduces the observed \|β\| by pure label-shuffling, so we cannot reject independence: n.s. simply means «the data don’t tell us either way».

Caveats and reading guidance

«n.s.» is not «no MAL». A non-significant verdict means this language’s sample cannot rule out independence. Many of the 134 n.s. languages do show a negative point estimate of β — just not strongly enough to clear the 5% threshold.
No multiple-comparisons correction. We run one test per language. With ~185 languages and α = 0.05 we expect ~9 false positives by chance alone; the observed 40 significant-MAL signals are well above that floor (binomial test against 5% chance level is highly significant; see notebook cell).
Threshold choice. Languages that don’t reach MIN_COUNT verbs at any n ≥ 2 are excluded from the test entirely; that’s why the denominator can be slightly smaller than the language count shown elsewhere on the site.
Aggregate ≠ per-language. The cross-tabulation tests on the statistical-tests page (Fisher’s exact on VO/OV/NDO contrasts) test population-level patterns and behave very differently: they pool languages, so they have much more power.

Implementation: test_mal_significance_beta() in 08_menzerath_altmann_analysis.ipynb; results cached to data/mal_universality_test_beta.csv and consumed here via _load_universality_significance() in mal_site.py.