The Open Source Y-DNA Project
OPEN Y

Michael Cooley • michael.cooley@ysnp.info

There had been discussion for years within the genetic genealogy community that a hybrid tree should be developed and that all publicly downloadable data across the internet should be accumulated into a single dataset.

Open Y's Facebook presence began during the summer of 2024. Work on the database started in earnest in March 2025 by combining the publicly available data for FTDNA.com and YFull.com. The TheYTree (Chinese) database was added at the beginning of 2026. But please note that all companies (naturally) retain all rights to their data. I present here only the resultant data. Open Y is not intended to be a replacement database but rather a synthesis of existing databases. As such, it aims to expand the conversation over Y-DNA haplotrees. The database's biggest caveat, however, is that it doesn't have the means to evaluate the sequenced data. It relies on the integrity of each imported database.

START HERE

Lineage Look Up: Enter a terminal haplogroup, i.e R-YP4491.

Group Look Up: Enter a terminal haplogroup, i.e R

SNP Look Up, i.e YP4491.

The source data for each database can be downloaded as follows,

The current, bare-bones, plain-text Open Y tree can be viewed and downloaded as OYtree.txt. A zipped version is found at the Download Archive.

Description

Open Y uses a simple and straightforward design philosophy. It avoids graphical interfaces and displays. It does this by sticking to HTML and CGI (usually perl code). Its data is stored in plain text files. Because Open Y is an open source project, much of the data is freely downloadable, as well as human readable. In other words, the website fully complies with the K.I.S.S. philosophy. Although it's presently running on a low-resource server, data retrieval is generally fast and efficient.

The interface is comprised of three principal pages: path information, haplogroup information, and SNP information. Paths are represented by a straightforward display from a terminal haplogroup to the tree's root. It can easily be pasted into any document. It's also fully interactive. Click on a haplogroup name to see all pertinent information. Users can display and download a selected haplogroup's vertical descendancy tree. Click on any SNP name to bring up its details.

The primary goal of the project is to provide a more complete Y chromosomal tree. For example, as of this writing, FTDNA has 105,208 branches or haplogroups in its tree. Open Y has 156,566. Daily stats are found at the Project Statistics page.

How It Works

The technical description of a SNP amounts to its position on the chromosome, its ancestral value, one of the four genetic letters (nucleotides) found in the cells of most men, and its derived value — the nucleotide to which it has mutated. That's really the whole picture. But the description isn't particularly convenient or human readable. To that end, the industry has long had a usual manner of naming SNPs — generally a letter code to represent the lab that first successfully sequenced the SNP, followed by an accension number such as FGC18226. But once the field became commercialized, a host of companies jumped in and were quick to take credit for the discovery of SNPs. This resulted in a number of naming conflicts. (Thanks to YBrowse.org, SNPs are now regularly registered and double-naming generally prevented.)

To reconcile all naming conflicts found among the databases, internal standards had to be adopted to allow for compatibility. There are over 200,000 SNPs having two or more names. Open Y has no choice but to select one name one over another. For example, SNP 7102835 C T is represented by four names,

A27156, FGC93789, FTD95617, Y201096

Using any name ordinarily doesn't matter — except when it comes to merging. Each database needs to be converted to the same name in order for the database to be usable. In this case, both FTDNA and TheYTree use FTD95617. However, TheYTree also uses A27156 in another path. And that points out one of the weaknesses of TheYTrees database. However, the issue is resolved by Open Y's algorithm. For reasons discussed below, Open Y uses A27156. Likewise, it's selected as the name for haplogroup J-A27156. I would advise that the community at large begin to decide on one name version over another. Doing so will aid all discussions across all platforms. To that end, a list of all multiply-named SNPs is available at the Download Archive. The list, like all Open Y data files, is updated every day.

Four methods are employed for resolving SNP names:

  1. If the same name is used in all three databases, it's preserved. Obviously, it's well-accepted in the community. All three databases, for example, are now using L448 ("The Young Scandinavian" SNP) rather than S200.

  2. As shown in the four SNP names above, the Open Y algorithm uses that SNP having the lowest accension number. This means that A27156 is used over FTD95617. But note that this process is open to discussion and possible reassesment.

  3. All appropriate SNPs/haplogroups are converted from its old style to the new. For example, "A" is converted to A-PR2921. Open Y draws from downloads/YF-translations.txt. The list isn't complete and includes only those instances that still survive at YFull and TheYTree. (Hint: that should change.)

  4. When there's conflict between a parent haplogroup, Open Y records the longest path from the terminal haplogroup to the MRCA (Most Recent Common Ancestor). For example, examining all three databases, we find this,

    A-BY36130 > A-FGC88929 > A-Y24715 (FTDNA)
    A-BY36130 > A-Y24715 (YFull)
    A-BY36130 > A-Y24715 (TheYTree)

    Note that BY36130 has another name, A-Y90375. (See Open Y's page for SNP BY36130.) It's used by YFull and TheYTree. But here's the point. FTDNA provides that the parent of A-FGC88929 is A-Y24715. Due to its longer path to the MRCA, the algorithm selects A-FGC88929 as the parent of A-BY36130. The new haplogroup occurs because a new tester at FTDNA caused a split to the A-Y90375 haplogroup.

    Analyzing SNP paths may be Open Y's greatest strength.

Nevertheless, most SNP/haplogroup paths to the tree's root (A-PR2921 or "A" in the old style) are unchanged. Yet the merge has added a sizeable number of tree branches, as demonstrated above. In fact, the merge has resulted in about 50% more haplogroups over what FTDNA displays. That's huge.

Open Y's Advantages

As it now stands, Open Y is useful for only those genealogists who have tested the Y chromosome. Generally, the testing website has sufficient information for them. But it is helpful for those who want to dig deeper into the structure of a comprehensive Y-DNA haplogroup tree or to understand the complexities involved in the differing nomenclature and paths among the several databases.

But Open Y can be very useful for group administrators. For one, it displays easy-to-paste paths that can be easily incorporated into custom-designed trees. I use the database to create reports for some of the FTDNA groups I admin at dna.ancestraldata.com/groups/. (Look especially at R1a-YP4248 Subclade Project.) But with a greater number of haplogroups comes a greater number of basal (the first letter) haplogroup designations. This can make it highly useful by haplogroup admins. For example, as of today, FTDNA has 1,605 Q haplogroups but Open Y has 2,732. See the Basal Haplogroup Report for a full list.

"I'd characterize Open Y as a federated phylogenetic layer rather than a competing tree." — ChatGPT

I couldn't have said it better. Of course, ChatGPT isn't a person. It knows only what is fed to it. Open Y is still young and has received no actual reviews. So, I've decided to post to my blog a fuller discussion I had with it on Sunday May 31, 2026, "I Asked ChatGPT about the Open Source Y-DNA Project." It compares Open Y with FTDNA and YFull and presented, I think, a commendable and fair evaluation. Similar to the above, it begins, "My evaluation of Open Y is that it is one of the most useful independent Y-DNA phylogeny tools currently available, but it should be treated as a tree synthesis and research platform, not as an authoritative source in the same sense as FTDNA or YFull."

Open Y does not intend to replace any entity nor to compete with anyone. After all, it's not a testing platform and utilizes no original data, processes no BAM (although it could), and has no "sequence-level quality control." Further, the above "chat" well-defines it's weaknesses, some of which are being worked on. The ChatGPT discussion even provides which sites provide the fullest data regarding specfic aspects of the haplotree. I agree. Indeed, ChatGPT's evaluation has given me some ideas.

Open Y's Shortcomings

Recurrent SNPs must first be discussed. They're SNPs found in multiple locations in a database. That's fine. SNP mutations are random and can appear at any time in any tester despite their prior human-made haplogroup designations. They're wholly legit as they're generated from fully sequenced data. At present, FTDNA has 31,725 recurrent SNPs. But due to the merge, a great many accidental recurrent SNPs appear in Open Y that have no bearing on genetic fact. (Open Y lacks the raw sequenced data from which to make accurate determinations.) Still, it's possible to sort them out to a degree. Much of that work, however, had been set aside due to its complexity. Instead, Open Y presents a brief description whenever they show up on a haplogroup page.

Because of this problem a timeline has not been put into place. Successful genetic timeline calculations depend on the number of SNPs present. The number of Open Y SNPs is likely overestimated. Once done, a timeline will be put into place. Note, however, that the depth of a haplogroup is present on the Open Y haplogroup pages. A haplogroup's depth is the interim between the time of its first SNP creation to its last. See, for example, R-L448. Open Y is temporarily using an average SNP mutation rate of 96 years. (It's actually calculated daily.) That rate is established by dividing the total SNP count by the number of haplogroup branches subtracted from the current presumed age of the root, A-PR2921, of 232,000 BCE (per FTDNA). Other estimates place the age as high as 300,000 years. Because R-L448 has four variant SNPs, the multiplication results in a depth of 490 years.

The current problem with recurrent SNPs and timelines will eventually be worked out.

Another shortcoming is the lack of haplogroup geographic locations. This can be a major annoyance for users since we all want to know where we came from. Only FTDNA's downloadable files contain that information. However, it is generally derived from user-supplied data rather than archaeologiclly derived data points. Open Y is considering using AI for that information. The needed work hasn't yet commenced.

Open Y Terminal Haplogroup Matching System

Registration for the upcoming Open Y Terminal Haplogroup Matching System opened on December 12, 2025! But a great deal more work needs to be done before it's fully functional. There are presently a mere 9 members!