23andMe & AncestryDNA Raw Data File Format Explained (rsid, chromosome, position, genotype)
AG). One row per genetic marker, about 600,000ā700,000 rows. Below is exactly how to read each column ā and how to turn that file into real Indian ancestry. Upload your raw data ā
You downloaded your raw data from 23andMe or AncestryDNA, opened it, and were met with a wall of text: thousands of lines reading rs3131972 1 752721 AG. It looks like nothing ā but that file is the single most valuable thing your DNA test produced, and once you can read its four columns, every line makes sense.
This guide is a plain-English reference to the 23andMe and AncestryDNA raw data file format: what each column (rsid, chromosome, position, genotype) means, what the two-letter genotype actually represents, how the formats differ between providers, and ā the part most articles skip ā what you can do with the file once you understand it. No biology degree required.
Key takeaway: A raw DNA data file is a tab-separated text file. After a short header, each row records one tested SNP as four values: rsid (the marker's name, e.g. rs4988235), chromosome (1ā22, X, Y, or MT), position (the base-pair coordinate on that chromosome), and genotype (your two inherited alleles, e.g. AG). That is the whole format ā everything else is just variations on those four fields.
What is a raw DNA data file?
When you spit into a 23andMe tube or swab for AncestryDNA, the lab does not sequence your entire genome. It uses a genotyping chip (a microarray) that reads several hundred thousand specific, pre-selected positions in your DNA ā the positions known to vary most between people. Those variable positions are called SNPs (single nucleotide polymorphisms, pronounced "snips").
Your genome is roughly 3.2 billion base pairs long, and any two people are about 99.9% identical. The interesting 0.1% ā a few million positions ā is where ancestry, traits, and disease risk live. A consumer chip samples about 600,000ā700,000 of those informative positions. The raw data file is simply the readout of that chip: one line per SNP, recording where the marker is and which two DNA letters you carry there.
That is why the file is so portable. It is not a proprietary "23andMe result" ā it is your underlying genotype data, in a simple text format that any other tool (including Helixline) can re-read and re-analyse.
The four columns, explained
Every standard raw data file boils down to these four fields. Here is what each one means.
| Column | Example | What it means |
|---|---|---|
| rsid | rs4988235 |
The marker's unique name. "rs" stands for Reference SNP, and the number is its ID in NCBI's public dbSNP database. The same rsid refers to the same position in everyone's genome, which is what makes files comparable across companies and tools. |
| chromosome | 2 |
Which chromosome the marker sits on: 1ā22 for the autosomes, X and Y for the sex chromosomes, and MT (or 26 in some files) for mitochondrial DNA. |
| position | 136608646 |
The base-pair coordinate ā how far along the chromosome the marker lies, counted in bases. This number is relative to a specific reference genome build; 23andMe v5 and AncestryDNA both use GRCh37 (build 37 / hg19), so the same SNP can have a different position number in a build-38 file. |
| genotype | CT |
Your two alleles at that position ā one copy inherited from each parent ā written as two letters drawn from A, C, G, T. (AncestryDNA splits this into two columns, allele1 and allele2.) |
What does a 23andMe raw data file look like?
Open the .txt in any text editor (Notepad, TextEdit, VS Code) and you see a block of comment lines starting with #, then the data. Here is an annotated sample of the 23andMe format:
Reading it line by line
Take a few of those rows and translate them into plain English:
rs3131972 1 752721 AGā Marker rs3131972, on chromosome 1, at base-pair position 752,721. You carry one A allele and one G allele. Because the two letters differ, you are heterozygous here.rs12564807 1 734462 AAā Same chromosome, position 734,462. Two identical A alleles, so you are homozygous at this marker.rs4988235 2 136608646 CTā A famous marker near the LCT gene on chromosome 2, associated with lactase persistence (the ability to digest milk into adulthood). A CT genotype here typically means you can.rs1426654 15 48426484 AAā A well-known pigmentation marker in the SLC24A5 gene; the A allele is strongly associated with lighter skin and is near-universal in South Asian and European populations.i5000940 Y 2655180 Aā Note theiprefix instead ofrs: 23andMe uses internal "i" identifiers for markers that do not have a public dbSNP rsid. Y-chromosome markers also report a single allele, because the Y is present in only one copy.
What the genotype column actually means
The genotype is where people get confused, so it is worth slowing down. At each position you inherited two copies of the chromosome ā one from your mother, one from your father ā so you have two alleles at every autosomal SNP. Each allele is one of the four DNA bases: A (adenine), C (cytosine), G (guanine), or T (thymine).
- When both copies are the same letter ā
AA,GG,CC,TTā you are homozygous at that marker. - When the two letters differ ā
AG,CT,ACā you are heterozygous. --,00orNNmeans a no-call: the chip could not reliably read that position. A small fraction of no-calls in any file is normal.
The order of the two letters is not meaningful on its own ā AG and GA describe the same genotype, and files are not telling you which letter came from which parent. What matters is which pair you carry. Multiply that decision across 600,000+ positions and you have the raw material for every ancestry estimate, trait prediction, and carrier-status call your reports are built from.
A note on strand: the same SNP can be reported on either the "plus" or "minus" DNA strand, so one company might write a genotype as AG and another as CT for the identical marker (AāT and CāG are complementary). This is why you cannot naively diff two files from different providers ā good analysis software strand-aligns every marker first.
23andMe vs AncestryDNA: how the formats differ
All consumer raw data files carry the same four pieces of information, but the packaging differs. The biggest practical difference is that AncestryDNA splits the genotype into two columns (allele1 and allele2) while 23andMe combines them into one. Here is how the major providers compare:
| Provider | File type | Genotype layout | SNPs tested | Genome build |
|---|---|---|---|---|
| 23andMe (v5) | .txt (tab-separated) | Single column (AG) |
~640,000 | GRCh37 (build 37) |
| AncestryDNA | .txt (tab-separated) | Two columns (allele1, allele2) | ~700,000 | GRCh37 (build 37) |
| MyHeritage | .csv (comma-separated) | Single column, values quoted | ~700,000 | GRCh37 (build 37) |
| FamilyTreeDNA | .csv (comma-separated) | Single column | ~700,000 | GRCh37 / GRCh38 |
| Whole-genome (WGS) | .vcf | Reference/alt + sample columns | 4,000,000+ | GRCh38 (build 38) |
A 23andMe file looks like the sample above. An AncestryDNA file's header row instead reads rsid chromosome position allele1 allele2, so the same lactase marker appears as rs4988235 2 136608646 C T ā the C and T in separate columns rather than as a single CT. MyHeritage and FamilyTreeDNA wrap the same data in comma-separated .csv files. The four concepts never change; only the punctuation does.
Important: always keep the file exactly as the provider exported it. Opening it in Excel and re-saving can silently mangle it ā Excel famously turns rsids and gene names into dates, and may strip the leading zeros from positions. If you need to look inside, use a plain text editor, not a spreadsheet.
Build 37 vs build 38: why the position number matters
The position column only makes sense relative to a reference genome build ā a numbered map of the human genome. 23andMe v5 and AncestryDNA report positions on GRCh37 (also called build 37 or hg19). Newer pipelines and whole-genome files use GRCh38 (build 38 / hg38). The same SNP ā same rsid ā can sit at a different position number in build 37 versus build 38, because the coordinate system shifted.
This is why a tool that accepts your raw data has to know which build it is in before it can place your markers correctly. It is also why the rsid is the safer "key" for comparing markers across files: rsids are build-independent, while raw positions are not.
What you can actually DO with this file
Here is the part the format guides usually skip. Once you understand that the file is portable genotype data, it stops being a curiosity and becomes a tool. The most useful things you can do with a downloaded raw data file:
- Re-analyse your ancestry somewhere better. This is the big one for anyone of Indian or South Asian descent. 23andMe and AncestryDNA collapse the subcontinent into a single grey blob ā "Broadly South Asian" ā because their reference panels were built around European and East Asian customers. The signal that separates a Punjabi Jat from a Tamil Brahmin from a Bengali Kayastha is sitting in your file already; it just has nothing to be compared against on a global platform.
- Run health and trait reports. Tools like Promethease cross-reference your SNPs against the SNPedia literature database for a dense, technical health rundown.
- Find relatives across companies. Genealogy databases like GEDmatch let people who tested with different companies match against each other.
- Keep a permanent backup. With 23andMe winding down parts of its consumer business, downloading and safely storing this file is the only way to be sure you keep your own data ā see what to do as 23andMe shuts down.
For the ancestry use case specifically, you do not need to spit into another tube. You can upload the exact same file to Helixline, where it is statistically imputed (the millions of markers your chip never read are inferred from South-Asian-rich reference panels) and then compared against 7,600+ curated South Asian reference samples spanning 1,000+ sub-populations. The result is the state- and community-level breakdown, ANI/ASI/AASI proportions, and Y-DNA / mtDNA haplogroups that "Broadly South Asian" never gave you. (More on the underlying method in what DNA imputation is and why it matters, and on the specific fix in "Broadly South Asian" on 23andMe? Here's the fix.)
Get Your Real Indian Ancestry ā from $25
You already own the file. Upload your 23andMe or AncestryDNA raw data and get state- and community-level South Asian ancestry in about 7 days ā no new kit, your data stored in India.
Upload Your Raw DataHow to download your raw data file
From 23andMe
Sign in at 23andme.com ā click your profile name ā Settings ā scroll to 23andMe Data ā Download. Request the raw data file (not the report PDFs), re-enter your password, and you will be emailed when the .zip (containing a .txt) is ready.
From AncestryDNA
Sign in at ancestry.com ā DNA tab ā Settings ā Download DNA Data. Confirm your identity, click the link in the confirmation email, and download the .zip / .txt. Our step-by-step walkthrough covers it in detail: downloading and uploading AncestryDNA raw data.
From MyHeritage, FamilyTreeDNA, or LivingDNA
Each provider exposes a "Download raw DNA data" option in account settings. Save the file exactly as exported ā do not open and re-save it in a spreadsheet.
Frequently Asked Questions
What is the 23andMe raw data file format (rsid, chromosome, position, genotype)?
A 23andMe raw data file is a plain-text, tab-separated file. After a block of comment lines starting with "#", it has four columns in this exact order: rsid (the marker's reference SNP identifier, e.g. rs4988235), chromosome (1ā22, X, Y or MT), position (the base-pair coordinate on that chromosome, on the GRCh37 / build 37 reference for the v5 chip), and genotype (your two alleles written as one two-letter string such as AA, AG or CT). One row equals one tested SNP, and a file has roughly 600,000ā640,000 rows.
How do I read raw DNA data and what do the columns mean?
Open the file in any text editor. Each data row is one genetic marker. The rsid column is the marker's unique name from the NCBI dbSNP database. The chromosome column says which chromosome it sits on (1ā22 for autosomes, plus X, Y and MT for mitochondrial DNA). The position column is the numeric base-pair coordinate of the marker on that chromosome, relative to a reference genome build. The genotype column is the pair of DNA letters you carry at that spot, one inherited from each parent: identical letters (AA, GG) are homozygous, different letters (AG, CT) are heterozygous.
Is the AncestryDNA raw data file format the same as 23andMe?
It is very similar but not identical. Both are tab-separated text on the GRCh37 build, but AncestryDNA splits the genotype into two separate columns (allele1 and allele2) instead of 23andMe's single combined genotype column, and its header uses "chromosome" and "position" as plain column names rather than 23andMe's leading "#"-comment block. AncestryDNA also tests a slightly different and slightly larger set of about 700,000 SNPs. MyHeritage and FamilyTreeDNA typically use comma-separated (.csv) files. The underlying information ā rsid, chromosome, position, genotype ā is the same in every case.
What does the genotype column (the two letters) actually mean?
The genotype is the pair of nucleotide bases you carry at that position ā one copy from your mother and one from your father. Each base is one of A, C, G or T. So "AG" means one A allele and one G allele. When both letters match (AA, GG) you are homozygous; when they differ (AG, CT) you are heterozygous. A genotype of "--" or "00" means the chip could not read that marker (a no-call). This per-marker allele information is what every ancestry, health and trait report is ultimately computed from.
What can I do with my 23andMe or AncestryDNA raw data file?
The raw file is portable, so you can re-analyse it elsewhere instead of buying a new test. You can upload it to health-report tools like Promethease, to genealogy databases like GEDmatch, or to Helixline for state- and community-level Indian ancestry. Helixline re-imputes the missing SNPs your chip never read and compares your genome against 7,600+ curated South Asian reference samples, returning the granular caste/region ancestry that "Broadly South Asian" on the original test could not. Upload-only ancestry analysis starts at $25 with results in about 7 days.
Now that you can read the file, see what it has been hiding: upload your raw data to Helixline, learn more about raw DNA data basics, or compare full options on the Helixline International page.