A global analysis of matches and mismatches between human genetic and linguistic histories
The relationship between cultural and biological diversity was examined to better understand population origins, diversification, and interactions. Instances where linguistic and genetic variations do not align are often dismissed as exceptions to the norm. But how frequently do these discrepancies occur? To answer this, we compiled a new genetic database and integrate it with relevant qualitative and quantitative linguistic and cultural data.
The coevolution of languages and genes offers a powerful Darwinian framework for tracing population dynamics across time and geography, serving as one of the most compelling parallels between cultural and biological diversity (Cavalli-Sforza, 1991). In recent years, researchers have explored this relationship to gain insights into population origins, diversification, and interactions. Notable case studies include the spread of major language families such as Indo-European (Haak et al., 2015) and Austronesian (Gray et al., 2009), as well as more localized examples examining the balance between contact and cultural barriers among groups (Pakendorf, 2014).

Mismatches between linguistic and genetic variation are usually disregarded as an exception to the general pattern. But how often these events occur? Can we estimate the incidence of language shift and reconstruct more realistic models of cultural evolution? And which circumstances are driving these discontinuities in cultural transmission?

To answer these questions at a worldwide as well at a regional scale, we first need a robust panel of genetic diversity to be matched with relevant linguistic and cultural information on the populations collected. With this in mind I started developing GeLaTo – Genes and Languages Together, a collection of published genetic population data for population history research purposes, which are matched with unique linguistic identifiers to facilitate cross-database comparisons and multidisciplinary analysis.
Numerous standardized linguistic databases can be matched with GeLaTo. Available resources include for example glottobank (which includes grambank, lexibank, parabank, phonobank, numeralbank), CoBL, soundcomparisons, WALS, Tsammalex, WOLD, AFBO, AUTOTYP. D-Place (Database of Places, Languages, Cultures and Environment) and Pulotu (Database of Pacific Religions) are great resources for quantitative cultural comparisons.

The choice of genetic data included in GeLaTo corresponds to essential guidelines: maximum compatibility and standardization, modern high quality data, avoidance of ascertainment bias, availability for different regions of the world, and finally high resolution to capture recent events. The dataset provides elaborated summary statistics such as genetic diversity within a population, genetic proximity between pairs of populations, sharing of identical motifs, and demographic history reconstructions.
SCIENTIFIC RELEVANCE
- Enabling geneticists to accurately interpret human history through molecular data and providing a comprehensive reference dataset for regional and global comparisons.
- Enabling linguists, historians and cultural anthropologists to integrate information on genealogical on genealogical relationships and demographic patterns, which can be reliably inferred from genetic data.
- Enabling scholars from diverse disciplines to explore critical questions about human diversity through a genuinely multidisciplinary approach, fostering a deeper and more accurate understanding of the complex mechanisms driving human migration, interaction, and cultural transmission.
Our final aim is to develop a more realistic understanding of the complex mechanisms behind cultural transmission. The change of cultural features through time not only impacts our ability of tracing back human prehistory, but also influences the definition of “population” as the unit of research.
This project started from a collaboration with Damián Blasi, Balthasar Bickel, Robert Forkel and Russell Gray.

Current implementation of genetic and linguistic quantitative analysis take advantage of a new curated global linguistic dataset (Graff et al. 2025a) which reduces the proportion of missing data and minimizes dependencies among linguistic features. By integrating these genetic and linguistic dataset, we could address to which extent contact between groups results in linguistic exchange and homogenisation (Graff et al. 2025b). To identify populations that have experienced demographic contact, we applied ADMIXTURE analysis on GeLaTo. Our findings reveal that populations with demographic contact show an increased likelihood of linguistic sharing between unrelated languages, with features such as word order and consonant sounds being more easily transferred. We also observed instances where languages became more distinct, as groups emphasize linguistic differences to assert unique identities. Both convergence and divergence are essential aspects of the global narrative of language evolution, and our interdisciplinary approach highlights the systematic nature of these dynamics throughout human history.

FURTHER READING:
Barbieri, C., Blasi, D. E., Arango-Isaza, E., Sotiropoulos, A. G., Hammarström, H., Wichmann, S., Greenhill, S. J., Gray, R. D., Forkel, R., Bickel, B., & Shimizu, K. K. 2022. A global analysis of matches and mismatches between human genetic and linguistic histories. Proceedings of the National Academy of Sciences, 119(47), e2122084119 https://www.pnas.org/doi/10.1073/pnas.2122084119 LINK
Graff, A., Chousou-Polydouri, N., Inman, D., Skirgård, H., Lischka, M., Zakharko, T., Barbieri, C., & Bickel, B. 2025a. Curating global datasets of structural linguistic features for independence. Scientific Data, 12(1), 106. https://doi.org/10.1038/s41597-024-04319-4
Graff, A., Blasi, D.E., Ringen, E.J., Bajić, V., Bavelier, D., Shimizu, K.K., Pakendorf, B., Barbieri, C., Bickel, B., 2025b. Patterns of genetic admixture reveal similar rates of borrowing across diverse scenarios of language contact. Science Advances 11, eadv7521. https://doi.org/10.1126/sciadv.adv7521
Pakendorf B. 2014. Coevolution of languages and genes. Curr Opin Genet Dev 29:39–44.
Cavalli-Sforza LL. 1991. Genes, Peoples and Languages. Sci Am 265:104–110.
Gray RD, Drummond AJ, Greenhill SJ. 2009. Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement. Science (80) 323.
Haak W, Lazaridis I, Patterson N, Rohland N, Mallick S, et al. 2015. Massive migration from the steppe was a source for Indo-European languages in Europe. Nature 522:207–211.
