DNA Testing Guide

23andMe & AncestryDNA Raw Data File Format Explained (rsid, chromosome, position, genotype)

🧬The 30-second answer: your raw DNA data file is plain text with four columns — rsid, chromosome, position, and genotype (your two alleles, written like AG). One row per genetic marker, about 600,000–700,000 rows. Below is exactly how to read each column — and how to turn that file into real Indian ancestry. Upload your raw data →

You downloaded your raw data from 23andMe or AncestryDNA, opened it, and were met with a wall of text: thousands of lines reading rs3131972  1  752721  AG. It looks like nothing — but that file is the single most valuable thing your DNA test produced, and once you can read its four columns, every line makes sense.

This guide is a plain-English reference to the 23andMe and AncestryDNA raw data file format: what each column (rsid, chromosome, position, genotype) means, what the two-letter genotype actually represents, how the formats differ between providers, and — the part most articles skip — what you can do with the file once you understand it. No biology degree required.

Key takeaway: A raw DNA data file is a tab-separated text file. After a short header, each row records one tested SNP as four values: rsid (the marker's name, e.g. rs4988235), chromosome (1–22, X, Y, or MT), position (the base-pair coordinate on that chromosome), and genotype (your two inherited alleles, e.g. AG). That is the whole format — everything else is just variations on those four fields.

What is a raw DNA data file?

When you spit into a 23andMe tube or swab for AncestryDNA, the lab does not sequence your entire genome. It uses a genotyping chip (a microarray) that reads several hundred thousand specific, pre-selected positions in your DNA — the positions known to vary most between people. Those variable positions are called SNPs (single nucleotide polymorphisms, pronounced "snips").

Your genome is roughly 3.2 billion base pairs long, and any two people are about 99.9% identical. The interesting 0.1% — a few million positions — is where ancestry, traits, and disease risk live. A consumer chip samples about 600,000–700,000 of those informative positions. The raw data file is simply the readout of that chip: one line per SNP, recording where the marker is and which two DNA letters you carry there.

That is why the file is so portable. It is not a proprietary "23andMe result" — it is your underlying genotype data, in a simple text format that any other tool (including Helixline) can re-read and re-analyse.

The four columns, explained

Every standard raw data file boils down to these four fields. Here is what each one means.

Column Example What it means
rsid rs4988235 The marker's unique name. "rs" stands for Reference SNP, and the number is its ID in NCBI's public dbSNP database. The same rsid refers to the same position in everyone's genome, which is what makes files comparable across companies and tools.
chromosome 2 Which chromosome the marker sits on: 1–22 for the autosomes, X and Y for the sex chromosomes, and MT (or 26 in some files) for mitochondrial DNA.
position 136608646 The base-pair coordinate — how far along the chromosome the marker lies, counted in bases. This number is relative to a specific reference genome build; 23andMe v5 and AncestryDNA both use GRCh37 (build 37 / hg19), so the same SNP can have a different position number in a build-38 file.
genotype CT Your two alleles at that position — one copy inherited from each parent — written as two letters drawn from A, C, G, T. (AncestryDNA splits this into two columns, allele1 and allele2.)

What does a 23andMe raw data file look like?

Open the .txt in any text editor (Notepad, TextEdit, VS Code) and you see a block of comment lines starting with #, then the data. Here is an annotated sample of the 23andMe format:

# This data file generated by 23andMe at: Fri Jun 26 2026 10:30:00 UTC # # This file contains raw genotype data, including data that is not used in # 23andMe reports. This data has undergone a general quality review however # only a subset of markers have been individually validated for accuracy. # # rsid chromosome position genotype rs12564807 1 734462 AA rs3131972 1 752721 AG rs12124819 1 776546 AA rs1815606 1 844113 AG rs4988235 2 136608646 CT rs1799945 6 26093141 CG rs1426654 15 48426484 AA i5000940 Y 2655180 A

Reading it line by line

Take a few of those rows and translate them into plain English:

What the genotype column actually means

The genotype is where people get confused, so it is worth slowing down. At each position you inherited two copies of the chromosome — one from your mother, one from your father — so you have two alleles at every autosomal SNP. Each allele is one of the four DNA bases: A (adenine), C (cytosine), G (guanine), or T (thymine).

The order of the two letters is not meaningful on its own — AG and GA describe the same genotype, and files are not telling you which letter came from which parent. What matters is which pair you carry. Multiply that decision across 600,000+ positions and you have the raw material for every ancestry estimate, trait prediction, and carrier-status call your reports are built from.

A note on strand: the same SNP can be reported on either the "plus" or "minus" DNA strand, so one company might write a genotype as AG and another as CT for the identical marker (A↔T and C↔G are complementary). This is why you cannot naively diff two files from different providers — good analysis software strand-aligns every marker first.

23andMe vs AncestryDNA: how the formats differ

All consumer raw data files carry the same four pieces of information, but the packaging differs. The biggest practical difference is that AncestryDNA splits the genotype into two columns (allele1 and allele2) while 23andMe combines them into one. Here is how the major providers compare:

Provider File type Genotype layout SNPs tested Genome build
23andMe (v5) .txt (tab-separated) Single column (AG) ~640,000 GRCh37 (build 37)
AncestryDNA .txt (tab-separated) Two columns (allele1, allele2) ~700,000 GRCh37 (build 37)
MyHeritage .csv (comma-separated) Single column, values quoted ~700,000 GRCh37 (build 37)
FamilyTreeDNA .csv (comma-separated) Single column ~700,000 GRCh37 / GRCh38
Whole-genome (WGS) .vcf Reference/alt + sample columns 4,000,000+ GRCh38 (build 38)

A 23andMe file looks like the sample above. An AncestryDNA file's header row instead reads rsid  chromosome  position  allele1  allele2, so the same lactase marker appears as rs4988235  2  136608646  C  T — the C and T in separate columns rather than as a single CT. MyHeritage and FamilyTreeDNA wrap the same data in comma-separated .csv files. The four concepts never change; only the punctuation does.

Important: always keep the file exactly as the provider exported it. Opening it in Excel and re-saving can silently mangle it — Excel famously turns rsids and gene names into dates, and may strip the leading zeros from positions. If you need to look inside, use a plain text editor, not a spreadsheet.

Build 37 vs build 38: why the position number matters

The position column only makes sense relative to a reference genome build — a numbered map of the human genome. 23andMe v5 and AncestryDNA report positions on GRCh37 (also called build 37 or hg19). Newer pipelines and whole-genome files use GRCh38 (build 38 / hg38). The same SNP — same rsid — can sit at a different position number in build 37 versus build 38, because the coordinate system shifted.

This is why a tool that accepts your raw data has to know which build it is in before it can place your markers correctly. It is also why the rsid is the safer "key" for comparing markers across files: rsids are build-independent, while raw positions are not.

What you can actually DO with this file

Here is the part the format guides usually skip. Once you understand that the file is portable genotype data, it stops being a curiosity and becomes a tool. The most useful things you can do with a downloaded raw data file:

For the ancestry use case specifically, you do not need to spit into another tube. You can upload the exact same file to Helixline, where it is statistically imputed (the millions of markers your chip never read are inferred from South-Asian-rich reference panels) and then compared against 7,600+ curated South Asian reference samples spanning 1,000+ sub-populations. The result is the state- and community-level breakdown, ANI/ASI/AASI proportions, and Y-DNA / mtDNA haplogroups that "Broadly South Asian" never gave you. (More on the underlying method in what DNA imputation is and why it matters, and on the specific fix in "Broadly South Asian" on 23andMe? Here's the fix.)

Get Your Real Indian Ancestry — from $25

You already own the file. Upload your 23andMe or AncestryDNA raw data and get state- and community-level South Asian ancestry in about 7 days — no new kit, your data stored in India.

Upload Your Raw Data

How to download your raw data file

From 23andMe

Sign in at 23andme.com → click your profile name → Settings → scroll to 23andMe Data → Download. Request the raw data file (not the report PDFs), re-enter your password, and you will be emailed when the .zip (containing a .txt) is ready.

From AncestryDNA

Sign in at ancestry.com → DNA tab → Settings → Download DNA Data. Confirm your identity, click the link in the confirmation email, and download the .zip / .txt. Our step-by-step walkthrough covers it in detail: downloading and uploading AncestryDNA raw data.

From MyHeritage, FamilyTreeDNA, or LivingDNA

Each provider exposes a "Download raw DNA data" option in account settings. Save the file exactly as exported — do not open and re-save it in a spreadsheet.

Frequently Asked Questions

What is the 23andMe raw data file format (rsid, chromosome, position, genotype)?

A 23andMe raw data file is a plain-text, tab-separated file. After a block of comment lines starting with "#", it has four columns in this exact order: rsid (the marker's reference SNP identifier, e.g. rs4988235), chromosome (1–22, X, Y or MT), position (the base-pair coordinate on that chromosome, on the GRCh37 / build 37 reference for the v5 chip), and genotype (your two alleles written as one two-letter string such as AA, AG or CT). One row equals one tested SNP, and a file has roughly 600,000–640,000 rows.

How do I read raw DNA data and what do the columns mean?

Open the file in any text editor. Each data row is one genetic marker. The rsid column is the marker's unique name from the NCBI dbSNP database. The chromosome column says which chromosome it sits on (1–22 for autosomes, plus X, Y and MT for mitochondrial DNA). The position column is the numeric base-pair coordinate of the marker on that chromosome, relative to a reference genome build. The genotype column is the pair of DNA letters you carry at that spot, one inherited from each parent: identical letters (AA, GG) are homozygous, different letters (AG, CT) are heterozygous.

Is the AncestryDNA raw data file format the same as 23andMe?

It is very similar but not identical. Both are tab-separated text on the GRCh37 build, but AncestryDNA splits the genotype into two separate columns (allele1 and allele2) instead of 23andMe's single combined genotype column, and its header uses "chromosome" and "position" as plain column names rather than 23andMe's leading "#"-comment block. AncestryDNA also tests a slightly different and slightly larger set of about 700,000 SNPs. MyHeritage and FamilyTreeDNA typically use comma-separated (.csv) files. The underlying information — rsid, chromosome, position, genotype — is the same in every case.

What does the genotype column (the two letters) actually mean?

The genotype is the pair of nucleotide bases you carry at that position — one copy from your mother and one from your father. Each base is one of A, C, G or T. So "AG" means one A allele and one G allele. When both letters match (AA, GG) you are homozygous; when they differ (AG, CT) you are heterozygous. A genotype of "--" or "00" means the chip could not read that marker (a no-call). This per-marker allele information is what every ancestry, health and trait report is ultimately computed from.

What can I do with my 23andMe or AncestryDNA raw data file?

The raw file is portable, so you can re-analyse it elsewhere instead of buying a new test. You can upload it to health-report tools like Promethease, to genealogy databases like GEDmatch, or to Helixline for state- and community-level Indian ancestry. Helixline re-imputes the missing SNPs your chip never read and compares your genome against 7,600+ curated South Asian reference samples, returning the granular caste/region ancestry that "Broadly South Asian" on the original test could not. Upload-only ancestry analysis starts at $25 with results in about 7 days.

Now that you can read the file, see what it has been hiding: upload your raw data to Helixline, learn more about raw DNA data basics, or compare full options on the Helixline International page.

AV
Arjun Venkatesh Bioinformatics Lead
MTech Bioinformatics, IIT Madras

Arjun builds Helixline's imputation and ancestry-inference pipelines, specialising in adapting genotype-imputation reference panels for South Asian population structure.

Already have a 23andMe or AncestryDNA file? Upload it for real South Asian ancestry — from $25 Upload Your Raw DNA