How-To Guide

What Is DNA Imputation? How We Fill the Missing SNPs in Your 23andMe / AncestryDNA File

🧩Imputation in one line: your DNA chip read ~600,000 markers; imputation infers the millions it skipped, so an old 23andMe file can yield a far richer ancestry report. Upload & impute from $25 →

If you have ever wondered how Helixline can take the same 23andMe or AncestryDNA file you already have and squeeze far more detail out of it, the answer is a technique called genotype imputation. It is one of the most important — and least understood — ideas in consumer genomics. This guide explains what it is, how it works, how accurate it is, and why it is the key to fixing vague "Broadly South Asian" results.

First, what your DNA test actually measured

Your genome is about 3 billion base pairs long, and roughly 10 million or more positions vary commonly between people. These variable positions are called SNPs (single-nucleotide polymorphisms). A consumer DNA test does not sequence all of this. Instead it uses a SNP microarray — a chip that reads a pre-selected set of about 600,000 to 700,000 SNPs. That is well under one-tenth of one percent of your genome.

Those positions are chosen cleverly to be informative, but they are still a sparse sample. If you think of your genome as a book, the chip reads a few sentences per page and leaves most of the text blank. The raw data file you download from 23andMe is essentially a list of the letters at those few hundred thousand positions.

The core insight: DNA is not inherited one letter at a time. It is inherited in long stretches called haplotypes that are passed down together across generations. Because nearby SNPs are correlated (a phenomenon called linkage disequilibrium), knowing the letters at the measured positions lets you predict, with high confidence, the letters at the positions in between. That prediction is imputation.

How imputation works, step by step

Imputation is a well-established statistical pipeline used in essentially every modern genetics study. Here is the intuition behind each step:

  1. Phasing. You inherited one copy of each chromosome from each parent. Phasing sorts your measured SNPs into the two parental haplotypes — figuring out which letters travelled together on the same physical strand.
  2. Reference matching. Your phased haplotypes are compared against a reference panel: a large collection of genomes that have been fully sequenced, so every position is known. Algorithms find segments of reference haplotypes that match yours.
  3. Inference. Wherever a matching reference haplotype covers a position your chip did not read, the algorithm copies in the most probable letter and records a confidence score. Repeat this across the whole genome and the sparse 600,000-marker file becomes a dense map of millions of variants.

No new DNA is created — you are not inventing data. You are recovering information that was always implied by the inheritance structure of your genome and the patterns seen in thousands of other people.

Why the reference panel is everything

Imputation is only as good as the reference panel behind it. Widely used panels include the 1000 Genomes Project, the Haplotype Reference Consortium, and TOPMed. The catch: these panels were historically dominated by European genomes, with comparatively sparse representation of South Asian diversity.

That matters enormously for Indian ancestry. South Asia has some of the highest haplotype diversity on Earth, the legacy of thousands of endogamous communities that have married within themselves for dozens of generations. A Europe-centric panel simply does not contain the haplotypes that distinguish a Reddy from a Maratha from a Bengali Brahmin. When the panel is thin, imputation is forced to be conservative, and the downstream ancestry result collapses into the catch-all "Broadly South Asian" label.

Variant type Imputation accuracy Notes
Common variants (well-represented population) Very high (r² often > 0.9) Ideal for ancestry; the bulk of ancestry-informative markers
Common variants (under-represented population) Moderate Improves dramatically with a population-matched panel
Rare variants Lower Harder to impute; clinical calls prefer direct genotyping

Imputation and South Asian ancestry: the missing piece

This is exactly the gap Helixline is built to close. When you upload a raw data file, your genome is imputed against panels enriched for South Asian haplotypes, and then compared to 2,500+ curated South Asian reference samples spanning 1,000+ sub-populations, alongside ancient-DNA references (Indus Valley, Steppe pastoralist, and AASI). The combination of better imputation and a South-Asian-specific comparison set is what turns a flat result into a state- and community-level breakdown with ANI/ASI/AASI proportions and Y-DNA / mtDNA haplogroups.

Put imputation to work on your own file

Upload your 23andMe or AncestryDNA raw data and we'll impute the missing markers and predict your South Asian ancestry — from $25, no new kit needed.

Upload & Impute My DNA

What imputation can and cannot do

It can: dramatically increase marker density from a consumer file, enable far more granular ancestry inference, harmonise files from different providers (23andMe v3/v4/v5, AncestryDNA, MyHeritage and others) onto a common map, and let two people tested on different chips be compared fairly.

It cannot: conjure information about positions where the surrounding region is poorly covered, perfectly reconstruct rare or private variants, or replace clinical-grade sequencing for high-stakes medical decisions. This is why Helixline's ancestry results lean confidently on imputation, while the deepest health and carrier-screening detail is best served by directly measured genotypes — one reason a fresh Decode or Infinite microarray kit remains the gold standard for clinical-grade calls.

Why this beats buying yet another kit

Because imputation reconstructs ancestry signal so well, you usually do not need to spit into another tube to get a dramatically better ancestry report. The information is already in your existing file. Re-analysing it is faster (about 7 days), cheaper (from $25), and avoids shipping a sample internationally. See the practical walkthrough in how to upload your 23andMe raw data for South Asian ancestry.

Frequently Asked Questions

What is DNA imputation in simple terms?

DNA imputation is a statistical method that predicts the genetic variants a test did not directly measure. Your DNA is inherited in long blocks (haplotypes) that tend to travel together, so if you know some of the letters in a block you can confidently infer the rest by comparing your data to a large reference panel of fully sequenced genomes. In effect, imputation turns a sparse 600,000-marker consumer file into a much denser map of millions of variants, without any new lab work.

If my 23andMe file already contains my DNA, why can't it tell me everything?

A consumer SNP chip does not read your whole genome. It samples a fixed set of roughly 600,000 to 700,000 positions chosen by the manufacturer, out of the tens of millions of variable positions in the human genome. Many ancestry- and trait-informative markers simply are not on the chip. Imputation recovers those un-typed positions by inference, which is why re-analysing the same file with a richer pipeline can reveal detail the original report never showed.

Is imputed DNA data accurate?

For common variants in well-represented populations, imputation is highly accurate, with confidence (measured as an r-squared value) often above 0.9. Accuracy is highest for common variants and lower for rare ones, and it depends heavily on how well your ancestry is represented in the reference panel. That is why panel composition matters: imputing a South Asian genome against a South-Asian-rich panel gives substantially better results than using a Europe-centric panel. Imputation is excellent for ancestry; for clinical decisions, directly measured genotypes remain the gold standard.

Does imputation work well for South Asian and Indian genomes?

It works well when the reference panel adequately represents South Asian haplotype diversity, which is unusually high because of long-standing endogamy across thousands of communities. Generic global panels under-represent this diversity, leading to coarser results such as the familiar "Broadly South Asian" label. Helixline imputes against panels enriched for South Asian genomes and then compares the imputed data to 2,500+ curated regional reference samples, which restores state- and community-level resolution.

Can I upload my raw data to get an imputed ancestry report?

Yes. You can download the raw data file from 23andMe, AncestryDNA, MyHeritage, FamilyTreeDNA or LivingDNA and upload it to Helixline. We impute the missing markers and return a high-resolution South Asian ancestry report. Upload-only ancestry analysis starts at $25, and a full report with health traits and pharmacogenomics is $50, with results in about 7 days.

Curious what your existing test left on the table? Upload your raw data to Helixline and let imputation do the rest.

AV
Arjun Venkatesh Bioinformatics Lead
MTech Bioinformatics, IIT Madras

Arjun builds Helixline's imputation and ancestry-inference pipelines, specialising in adapting genotype-imputation reference panels for South Asian population structure.

A 23andMe file reads ~600k markers. Imputation infers the millions it missed — upload & impute from $25 Upload & Impute