A global analysis of matches and mismatches between human genetic and linguistic histories


The coevolution of languages and genes represents the ultimate Darwinian paradigm to track population dynamics in time and space, and one of the most evoked parallels between cultural and biological diversity (Cavalli-Sforza, 1991). In the past years scholars analyzed this congruence to shed light on population origin, diversification and contact. Popular case studies include the diffusion of major language families, such as the Indo-European (Haak et al.,2015) or the Austronesian (Gray et al., 2009), as well as smaller regional cases of contact vs. cultural barrier between groups (Pakendorf, 2014).

Genealogical Tree of Dead and Living Languages,
by Félix Gallet (c. 1800).
Genealogical Tree of Dead and Living Languages, by Félix Gallet (c. 1800).

Mismatches between linguistic and genetic variation are usually disregarded as an exception to the general pattern. But how often these events occur? Can we estimate the incidence of language shift and reconstruct more realistic models of cultural evolution? And which circumstances are driving these discontinuities in cultural transmission?

To answer these questions at a worldwide as well at a regional scale, we first need a robust panel of genetic diversity to be matched with relevant linguistic and cultural information on the populations collected. With this in mind I started developing GeLaTo - Genes and Languages Together, a collection of published genetic population data for population history research purposes, which are matched with unique linguistic identifiers to facilitate cross-database comparisons and multidisciplinary analysis.

Numerous standardized linguistic databases can be matched with GeLaTo. Available resources include for example glottobank (which includes grambank, lexibank, parabank, phonobank, numeralbank), CoBL, soundcomparisons, WALS, Tsammalex, WOLD, AFBO, AUTOTYP. Finally, D-Place (Database of Places, Languages, Cultures and Environment) and Pulotu (Database of Pacific Religions) are great resources for quantitative cultural comparisons.

GeLaTo logo

The choice of genetic data included in GeLaTo corresponds to essential guidelines: maximum compatibility and standardization, modern high quality data, avoidance of ascertainment bias, availability for different regions of the world, and finally high resolution to capture recent events. The dataset provides elaborated summary statistics such as genetic diversity within a population, genetic proximity between pairs of populations, sharing of identical motifs, and demographic history reconstructions.


  • Allowing geneticists to properly characterize the human history behind the molecular data, and give an accessible reference dataset for regional or worldwide comparisons.
  • Allowing linguists, historians and cultural anthropologists to integrate information on genealogical relatedness and demography, which can be robustly extrapolated from the genetic data.
  • Allowing scholars of various disciplines to approach questions of major relevance on human diversity in a true multidisciplinary perspective, and develop a more realistic understanding of the complex mechanisms behind human migration, contact and cultural transmission.

Our final aim is to develop a more realistic understanding of the complex mechanisms behind cultural transmission. The change of cultural features through time not only impacts our ability of tracing back human prehistory, but also influences the definition of “population” as the unit of research.

This project is developed in collaboration with Damián Blasi, Balthasar Bickel, Robert Forkel and Russell Gray.



Barbieri, C., Blasi, D. E., Arango-Isaza, E., Sotiropoulos, A. G., Hammarström, H., Wichmann, S., Greenhill, S. J., Gray, R. D., Forkel, R., Bickel, B., & Shimizu, K. K. 2022. A global analysis of matches and mismatches between human genetic and linguistic histories. Proceedings of the National Academy of Sciences, 119(47), e2122084119 https://www.pnas.org/doi/10.1073/pnas.2122084119 LINK

  • Cavalli-Sforza LL. 1991. Genes, Peoples and Languages. Sci Am 265:104–110.
  • Gray RD, Drummond AJ, Greenhill SJ. 2009. Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement. Science (80) 323.
  • Haak W, Lazaridis I, Patterson N, Rohland N, Mallick S, et al. 2015. Massive migration from the steppe was a source for Indo-European languages in Europe. Nature 522:207–211.
  • Pakendorf B. 2014. Coevolution of languages and genes. Curr Opin Genet Dev 29:39–44.