Why Most Global DNA Tests Fail the South Asian Test - And What Actually Works
If you took a 23andMe or AncestryDNA test and received a result like "South Asian: 89%" or "Broadly South Asian: 76%" with almost nothing else, you are not alone - and your test did not fail. Your raw DNA data is accurate. What failed is the comparison database. Knowing why this happens, and what it would take to fix it, also explains why ancestry testing in India has been so disappointing for so many people.
The gap between what Indian users expect from a DNA ancestry test and what global services actually deliver is not a minor inconvenience - it is a fundamental technical limitation that no update to the algorithm can fix without first rebuilding the underlying data. The problem lives in the reference panel, and fixing it requires an entirely different approach to data collection than the one global companies have historically taken.
What a Reference Panel Actually Is
A DNA ancestry test does not sequence your entire genome - it reads approximately 700,000 specific positions in your DNA called single-nucleotide polymorphisms (SNPs). These are positions in the genome where individuals differ from one another. The specific pattern of variants you carry at those 700,000 positions is the raw material of an ancestry test.
Your ancestry is not determined by looking at your SNPs in isolation. The algorithm works by comparing your SNP pattern to a reference panel - a curated database of DNA profiles from people with well-documented, multi-generational ancestry in specific places or communities. When the algorithm finds that your DNA pattern most closely resembles the reference profiles for a particular group, it assigns you ancestry from that group.
The mathematical implication of this is critical: if your ancestral population is not in the reference panel, the algorithm cannot identify your ancestry. It will find the closest available category in the database and assign your ancestry there - or, if nothing is close enough, it falls back to a broad catch-all bucket like "South Asian."
The reference panel is not a secondary concern or a technical detail. It is the entire product. Everything else - the interface, the reports, the chromosome browser - is built on top of it. A gorgeous app with a thin reference panel gives you a confident-looking answer that is essentially a guess.
The key question to ask any DNA testing company: How many distinct South Asian reference populations does your database include? For most global services, that number is between 5 and 20. For Helixline, it is 200+. That difference is not incremental - it changes the character of what the test can tell you entirely.
How 23andMe and AncestryDNA Built Their South Asian Panels - And Why It Matters
23andMe's reference panel has been constructed primarily from its own customer base, supplemented by external academic datasets. The approach is self-reinforcing: early adopters of DNA testing were disproportionately affluent, English-speaking, and from Western countries. The US diaspora from South Asia skews heavily toward certain communities - Gujaratis, Punjabis, and to a lesser extent Tamils - because these groups had earlier and larger diaspora waves. The reference panel that emerged from this customer base reflects those communities preferentially.
Communities that have smaller diaspora representations - Marathis, Kannadigas, Malayalis, Odiyas, and thousands of specific endogamous jati groups - are either absent from the reference panel or represented by too few samples to allow accurate assignment. When the algorithm encounters DNA from a Maratha from Pune or a Bunt from coastal Karnataka, there is no adequately populated reference cluster for those communities. The assignment either gets pushed into the nearest available cluster or dissolves into the "Broadly South Asian" catch-all.
AncestryDNA has a structurally similar problem. Their South Asian component was built in part on the 1000 Genomes Project South Asian samples - a foundational genomics dataset that includes populations from five South Asian communities: Gujarati Indians in Houston, Punjabi individuals from Lahore, Telugu speakers from the UK, Tamil Sri Lankans from the UK, and Bangladeshis from Dhaka. Five populations, all either diaspora or from a single country. Out of thousands of actual Indian communities.
The mathematical consequence is straightforward. If your DNA matches 55% to the Punjabi reference cluster, 25% to the Gujarati cluster, and the remaining 20% to no specific South Asian reference with enough statistical confidence, the algorithm does not report the ambiguous 20% as a specific community - it reports it as "Broadly South Asian." That is not the algorithm failing; it is the algorithm being honest about the limits of its data. The problem is that for many Indian users, the ambiguous fraction is the majority of their report.
Why South Asian Genetic Diversity Is Particularly Hard to Capture
India does not just have a lot of people - it has an extraordinary concentration of distinct genetic populations compressed into a single subcontinent. The degree of within-India genetic diversity is, in absolute terms, comparable to the genetic diversity of all of Europe combined.
The primary driver of this is endogamy. India has over 4,600 documented caste and community groups, many of which have practiced strict marriage within the group for hundreds of generations. When a community of a few thousand people marries exclusively among itself for twenty or thirty generations, its genetic profile becomes distinctive in ways that are measurable and detectable - but only if those communities are represented in a reference panel.
A Tamil Brahmin community is genetically distinguishable from a Tamil Vellalar community, which is distinguishable from a Tamil Nadar community - not because the differences are large, but because endogamy has preserved distinct variant frequencies within each group. The same pattern holds across India: Rajput from Rajasthan versus Jat from Haryana versus Bania from Gujarat versus Kayastha from Bengal are all genetically distinct in ways that a sufficiently dense reference panel can detect.
South Asia also has one of the most complex population history stories on Earth, with multiple deep admixture layers:
- AASI (Ancient Ancestral South Indians): The deepest layer, related to the earliest settlers of South Asia. Populations with high AASI ancestry are found primarily among Dravidian-speaking communities and Adivasi tribal groups. This component is virtually absent from global reference panels.
- Iranian farmer-related ancestry: Arrived with the spread of agriculture from West Asia, mixed with AASI populations to form the ancestral South Indian (ASI) and ancestral North Indian (ANI) components described by Reich et al.
- Steppe pastoralist ancestry: Associated with the Indo-Aryan migrations from the Eurasian steppe, present at varying proportions across all Indian populations - highest in Upper Caste North Indians and lowest in South Indian and Adivasi populations.
- Southeast Asian connections: Visible in northeastern Indian populations and in communities like the Munda speakers of central and eastern India.
A reference panel that captures this complexity needs hundreds of distinct Indian population samples, not five or fifteen. The academic literature on South Asian population genetics has documented dozens of genetically distinct community clusters. Replicating even a fraction of that resolution in a commercial reference panel requires deliberate, in-country data collection - not sampling diaspora customers or repurposing academic datasets designed for other purposes.
The Helixline Approach: What a South Asian-First Panel Looks Like
Helixline's reference panel was built specifically for Indian genetic diversity, from the ground up, with India as the starting point rather than an afterthought.
The database includes 200+ South Asian reference populations - not continental or subcontinental aggregates, but community-level groups spanning every major Indian state, major OBC and upper-caste communities, and a significant number of tribal and Adivasi populations. Reference samples were sourced from participants across India - not from diaspora communities, who carry their own admixture from the destination country and whose representation in the reference panel introduces noise rather than signal for users testing in India.
The practical result is resolution that global tests cannot offer. A Tamil Vellalar result looks different from a Tamil Brahmin result. A Punjabi Khatri result is distinguishable from a Punjabi Jat result. A Bengali Brahmin result differs from a Bengali Kayastha result - because these communities are genetically distinct, and the panel is dense enough to detect and report those differences.
Helixline's reports also model the deep ancestral layers specific to South Asia: ANI/ASI proportions, Steppe ancestry contribution, and AASI-related ancestry - components that global tests either do not report or cannot resolve with their sparse South Asian data.
A Practical Example: The Same DNA, Two Different Tests
The difference between a global test and a South Asian-first test is best illustrated with a concrete example. Consider a person from Andhra Pradesh with Telugu heritage from a Kamma family.
The 23andMe result is not wrong. The person is South Asian. But it provides no information about which part of South Asia, which community, or which ancient ancestral streams contribute to their genome. The Helixline result places them in a specific community genetic cluster, separately identifies their Steppe component (from Indo-Aryan ancestral input) from their deep South Indian AASI-rich ancestry, and produces an ancestral narrative that is coherent and informative.
The "Central Asian: 5%" in the 23andMe result is actually the Steppe pastoralist component being misattributed - because the algorithm, lacking dense South Asian references, is comparing the user's Steppe-related ancestry to Central Asian references in the database rather than specifically modelling it as Steppe ancestry that entered India via the Indo-Aryan migration route. With a South Asian-calibrated reference panel, that component is correctly identified and contextualised.
What "Broadly South Asian" Actually Means - The Algorithm Telling You It Does Not Know
When 23andMe returns "Broadly South Asian," it is a specific algorithm output, not a generic label. It means: "I can identify that this DNA is South Asian in origin, but I cannot confidently assign it to any of the more specific South Asian reference populations in my database."
This happens for predictable reasons. If your SNP pattern does not cluster clearly with the Gujarati reference, the Punjabi reference, or the Telugu or Tamil or Bengali references in the database - the only South Asian populations with adequate representation - the algorithm does not invent a category. It reports the assignment it is confident about (South Asian) and flags the rest as unresolvable (Broadly South Asian).
For a Rajput from Rajasthan, a Nair from Kerala, a Coorg from Karnataka, a Gond from Madhya Pradesh, or a Meitei from Manipur, the algorithm has no reference cluster that is a close match. Their DNA is clearly South Asian - but none of the available specific clusters fit. The "Broadly" label is the honest output of an algorithm that is doing its best with inadequate data.
This is worth emphasising clearly: there is nothing wrong with your DNA. The limitation is structural and sits entirely in the database, not in the user's genome.
What this means practically: If you received a "Broadly South Asian" result, you likely come from a community that is simply not well-represented in the global reference panel. That does not mean your ancestry is somehow generic - it means the test lacked the data to find your specific community. The answer exists; the database just did not have it.
Does the Reference Panel Problem Affect Health Genetics Too?
For the most common health genetics applications - individual SNP results like BRCA variant status, pharmacogenomic markers, carrier screening - the reference panel limitation does not directly affect accuracy. These are population-agnostic single-variant readings: you either carry the BRCA variant or you do not, and that determination does not depend on the reference panel.
However, for polygenic risk scores (PRS) - the more sophisticated approach to disease risk that combines hundreds of small-effect variants into an overall risk estimate - the reference panel and training population matter enormously.
A polygenic risk score for type 2 diabetes, cardiovascular disease, or breast cancer is calculated based on the effects of hundreds of genetic variants, where the effect size of each variant was estimated in a specific study population. If that population was predominantly European (as most large-scale GWAS studies historically were), the risk score may not be accurate when applied to South Asians - because variant effect sizes can differ between ancestral populations, and the European-calibrated weights may under- or overestimate risk in South Asian genomes.
Helixline's health reports in Decode use South Asian-calibrated polygenic risk scores where the relevant training data is available - a distinction that matters for conditions like type 2 diabetes and cardiovascular disease, which have substantially different underlying genetic architectures in South Asian versus European populations.
Why Global Tests Work Better for Europeans: The Structural Reason
This is not a conspiracy or an oversight that went unnoticed. It reflects the economics of the consumer genomics industry.
23andMe launched in 2006, primarily targeting the US market. The early adopter demographic was affluent, educated, and predominantly European-American. The reference panel grew from this customer base, meaning it progressively accumulated more European samples relative to any other group. As the European reference panel grew denser, European results improved - which attracted more European customers, which further enriched the European reference, in a self-reinforcing cycle.
South Asians represent approximately 17% of the global population but a much smaller share of consumer genomics customers in the early and mid years of the industry. The commercial incentive to invest in South Asian reference panel development was limited when the revenue-generating customer base was overwhelmingly Western.
There is also a technical reason European diversity resolves more easily with fewer reference populations. European populations, while genetically distinct from each other, are significantly less differentiated from one another than Indian endogamous communities are. The genetic distance between a Swedish and an Italian population is smaller than the genetic distance between a Tamil Brahmin community and a Tamil Nadar community - despite the fact that the latter two communities share geography, language, and culture. Centuries of endogamy in India have produced genetic differentiation that is unusually high relative to other global populations, and capturing it requires proportionally more reference populations.
The Number That Matters When Comparing DNA Tests
When evaluating any DNA test for South Asian ancestry, the single most informative specification is: how many distinct South Asian reference populations does the database include?
| Service | South Asian Reference Populations (approx.) | Primary Data Source | India-specific Collection? |
|---|---|---|---|
| 23andMe | ~10 - 20 aggregated | Customer database + external academic datasets | No |
| AncestryDNA | ~5 - 15 aggregated | 1000 Genomes, customer base (US diaspora) | No |
| Xcode Life | Similar to upstream source | Re-analysis of existing test data | No |
| Helixline | 200+ | India-sourced participants, community-level sampling | Yes |
The number itself is not the only consideration - sample quality, sample size per population, and how the reference panel was validated all matter. But as a first filter, the reference population count tells you whether a South Asian user is likely to get a resolution-limited result or a genuinely informative one.
See what your ancestry actually looks like with 200+ South Asian reference populations
Already have a 23andMe or AncestryDNA file? Upload it and get South Asian-specific ancestry results from ₹2,500 - no new kit needed. Or order Origins for the full test from scratch.
Upload Raw Data - from ₹2,500Order Origins Kit - ₹6,999
Frequently Asked Questions
Is my 23andMe raw data wrong if I got a vague South Asian result?
No. The raw data - the actual SNP readings from your DNA - is accurate regardless of the ancestry report quality. What is limited is the interpretation layer, not the underlying data. This is why uploading your 23andMe raw data to a South Asian-optimised service like Helixline (from ₹2,500) can give you significantly more informative ancestry results without re-testing. The raw file contains all 700,000 SNP readings; what changes when you switch services is which reference panel those readings are compared against.
Why does 23andMe seem to work well for European users but not for Indians?
Because European genetic diversity, while real, is easier to resolve with fewer reference populations - European communities are genetically less differentiated from each other than Indian endogamous communities are. More importantly, 23andMe's customer base is predominantly from the US, UK, and Europe, so their reference panel has accumulated far more European samples over time. South Asians are roughly 17% of the global population but represent a much smaller share of consumer DNA testing customers historically - and their reference panels reflect this imbalance directly.
What does "ANI and ASI" mean and why don't global tests show it?
ANI (Ancestral North Indian) and ASI (Ancestral South Indian) are the two major ancestral streams that underlie all South Asian populations in varying proportions - first described in a landmark 2009 paper by Reich et al. published in Nature. All Indian populations are a mixture of these streams, with additional Steppe pastoralist admixture in many North Indian groups, and Southeast Asian-related ancestry in some northeastern populations. Global tests rarely report ANI/ASI components because their South Asian reference panel is not dense enough to model these internal South Asian ancestral layers separately. With too few South Asian references, the algorithm cannot distinguish between an ANI-heavy profile and an ASI-heavy one - it just reports "South Asian" for both.
Can a South Asian-specific test tell me which state or district my ancestors came from?
At community level, yes - at district or village level, no. DNA ancestry testing identifies which genetic cluster your DNA belongs to, not a geographic coordinate. A Helixline result can tell you that your ancestry clusters most strongly with a community profile associated with a specific region of India - say, Rajput from Rajasthan or Lingayat from Karnataka - which, combined with what you already know about your family history, can significantly narrow the geographic picture. Pinpointing a specific village or district requires genealogical records, not genetics. The two approaches are complementary.
I already uploaded my 23andMe data to Xcode Life. Will Helixline give different results?
Very likely yes. Xcode Life's ancestry analysis is constrained by the same fundamental limitation - their South Asian reference panel draws from the same upstream sources as your original test. When you upload your 23andMe data to Xcode Life, they are re-analysing the same SNPs against a reference that has similar South Asian coverage to what 23andMe itself uses. The interpretation is not genuinely different, even if the interface looks different. Helixline's upload service uses a distinct, India-built reference panel with 200+ South Asian populations - the comparison database is different, which means the interpretation can be meaningfully different, not just a reformatted version of the same answer.