How is the per-language MAL significance computed?

Pegah Faghiri, Kim Gerdes, Sylvain Kahane (2026). Verifying the Menzerath-Altmann law in the verbal domain in 180 languages. UDW26 @ LREC 2026.

This page explains the per-language permutation significance test behind the «Sig.» column on the big effect table and the «21.6% pass the universality test» tile on the home page. It is a standard non-parametric test of the sign and slope of the MAL log-log regression for each language separately, as defined in the cell test_mal_significance_beta() of 08_menzerath_altmann_analysis.ipynb.

Headline result

On 185 UD v2.17 languages (those with enough verbs to fit a regression), the test gives:

Pies and beta histogram

The test, step by step

  1. For each language we collect the cached MAL signal {n → mean_constituent_size} for all n from 1 up to the largest n that meets the minimum-count threshold (default MIN_COUNT = 100 verbs).
  2. We fit an ordinary least-squares regression in log-log space: log(mean_size) = α + β · log(n). The slope β is the headline statistic: negative means MAL holds (more dependents → smaller constituents); positive means anti-MAL.
  3. We build a null distribution of β by shuffling the y-values (log(mean_size)) with respect to the x-values (log(n)) and refitting the slope. We repeat this 1 000 times (n_permutations = 1000). Under the null hypothesis «n and constituent size are independent», the observed β should be a typical draw from this distribution.
  4. The two-tailed p-value is the fraction of permuted |βperm| at least as extreme as the observed |βobs|. We declare the language significant when p < 0.05 (alpha = 0.05).
  5. We split the «significant» languages by sign: significant_mal = True iff β < 0 (slope goes the MAL direction in log-log space, i.e. compression). Note: in our cached numbers the sign convention is flipped: we report beta_1max as the positive slope of the inverse relation, so the test code uses «significant and beta > 0» directly to flag MAL–compliant languages (see test_mal_significance_beta for the exact sign handling).

What gets shuffled, exactly? A short English walkthrough

Honest framing first. The permutation does not shuffle individual sentences or individual dependents — it shuffles the per-n column of aggregated mean sizes. But seeing one real verb per bucket makes those means concrete, so let us walk through English end‑to‑end.

A · What does one data point really represent?

Each point on the English MAL curve corresponds to one chunk size n and aggregates all English verbs in UD v2.17 that have exactly n dependents. The y‑value is the average subtree size of those dependents. Two real EWT verbs, one from the n=2 bucket and one from the n=4 bucket:

n=2 bucket. The verb pushing has 2 dependents (she, subtree size 1; the stroller, subtree size 2). Mean dependent size = (1+2)/2 = 1.5. This sentence contributes one number to the n=2 average.
n=4 bucket. The verb fear has 4 dependents (then 1, I 1, the achieve‑clause 11, the slip‑clause 8). Mean = (1+1+11+8)/4 = 5.25. This sentence contributes one number to the n=4 average.

B · The aggregated table the test actually sees

After averaging over all English verbs (not just our two examples), the cached signal data/lang2MAL_full.pkl['en']['total'] collapses to just 5 rows (those n‑values that pass MIN_COUNT=100):

nmean dependent sizelog nlog mean# English verbs in bucket
13.7240.0001.314186 933
23.3830.6931.21960 660
33.1341.0991.1438 918
43.2091.3861.1661 019
53.3481.6091.208126

Fitting OLS on the (log n, log mean) columns gives the observed slope βobs ≈ +0.053 (very mild, slightly anti-MAL direction).

C · The shuffle, in pictures

We now ask: could a slope this small simply arise from a chance pairing of n’s and means? We literally permute the y‑column while keeping the x‑column fixed. Two example shuffles:

log noriginal log meanshuffle #1shuffle #2
0.0001.3141.1661.208
0.6931.2191.1431.314
1.0991.1431.2081.143
1.3861.1661.3141.219
1.6091.2081.2191.166
refitted β+0.053−0.052−0.018

Each shuffle is one random re-pairing of the same five y‑values to the same five x‑values. We refit OLS on the shuffled table and record the new slope. Repeat 1 000 times.

D · The null distribution and the p-value

For English the 1 000 permuted slopes form a distribution centred on 0, ranging roughly from −0.10 to +0.09. The observed βobs = +0.053 sits squarely in the crowd: about 60.5% of the permutations yield a slope at least as extreme in absolute value. That fraction is the two-tailed p-value, and 0.605 ≫ 0.05, so English is reported as not significant (see its row in the big effect table).

A small ASCII picture of where the observed slope lands in the null distribution:

              count of 1000 permuted β
            ↓
   −0.10  ▁▁▂▃▅▇██▇▅▃▂▁▁   +0.09
                       ↑
                 βobs = +0.053  (≈ 60th percentile of |β|)

E · So do dep trees help here?

Partly, yes. Dep trees make it tangible what one verb instance contributes to one bucket’s mean — they rescue the test from feeling abstract. They also make clear why the n=4 bucket of English is so noisy (only 1 019 verbs, often dragged up by occasional long‑clause dependents like the achieve‑clause above).

But, strictly, no. The permutation operates one level up: on the 5‑row aggregated table in panel B, not on the individual sentences in panel A. Shuffling whole sentences would be a different test (and a much more expensive one). What we shuffle here is just the y‑column of those five averages.

Why a permutation test (and not the textbook OLS p-value)?

Three worked examples

Each row shows the actual fitted slope β(1→max) on the cached data and the permutation p-value. Click the language name to jump to its row in the big effect table.

Languageβ(1→max)p-valueVerdictInterpretation
Georgian+0.117<0.001MAL✓Slope clearly positive in the MAL direction; the permutation distribution almost never reaches this magnitude by chance, so we reject independence and call it MAL–compliant.
OldChurchSlavonic-0.232<0.001anti✓Slope clearly negative — longer verbs come with larger constituents on average. The permutation distribution rarely produces a slope this negative, so this is a real anti-MAL signal, not noise.
Turkish+0.0040.957n.s.Slope is essentially flat (≈ 0). The permutation distribution easily reproduces the observed |β| by pure label-shuffling, so we cannot reject independence: n.s. simply means «the data don’t tell us either way».

Caveats and reading guidance

Implementation: test_mal_significance_beta() in 08_menzerath_altmann_analysis.ipynb; results cached to data/mal_universality_test_beta.csv and consumed here via _load_universality_significance() in mal_site.py.