Thursday, March 19, 2015

Triangulated Small Segments are Identical by Descent

   Autosomal DNA segment matching is a complex issue.  Through testing and observation, it is obvious that some segment matches are false positives.  Computer algorithms will detect any matching allele with no knowledge that the allele is of paternal or maternal origin.


   If we said that the left columns are from the father’s sides and the right from the mother’s, we would see that none of the columns match.  Obviously, we can’t just draw a line down the middle and say one side is the mother’s DNA.  To determine which DNA came from mm and which came from dad, the autosomal results would need to be phased.  To phase the results of an autosomal sample it must be compared to at least one parent result.  By difference, the child result can be split into its paternal and maternal contributions. 


   If it were possible to phase every sample to be matched, false positives by computer algorithm would be eliminated.  Unfortunately, phasing every sample is not always possible.  A person’s parents may be deceased or even unknown.

   Another method of reducing or eliminating false positives is to triangulate each matching segment.  If a segment from autosomal sample A matches the corresponding segment from sample B and sample B matches sample C and sample C matches the original sample A, then the segment is considered triangulated and identical by descent.  How confident are we that the triangulated matches aren’t just a circular series of false positives? 

   Let’s look at segment on chromosome 3 that starts at rs6796502 and is 2.5 cM and 946 SNPs.  For this exercise, any chromosome segment could be used. 

Table 1.  Allele frequencies of 20 loci on chromosome 3.
   On that segment, there are 20 published locations with allele frequencies (NCBI).  Table 1 shows the how often a certain allele combination (AA, AC, AG etc.) appears for a European population.  Based on allele frequency, the most common combination of alleles in this section of chromosome 3 for a population of European descent is listed in Table 2.  I have artificially selected the most common combination to simulate a large portion of the population with European descent.  About 1 in 3,400 or about or about 300,000 people should have this combination. 

Table 2.  Predicted allele combination.
   Imagine for a moment that you roll six dice.  The first die comes up with a one and the second is a two and so on.  The probability of rolling a one on the first die is 1/6 (one side up on a six-sided die).  The probability of rolling a one and then a two is 1/6 times 1/6 or 1/36.  It will happen once every 36 rolls.  The combination illustrated on six dice would happen once in every 46,656 rolls.  Now imagine that is your DNA and we are looking for a match.  The other person would need one through six in the same order.  To calculate that probability we multiply 46,656 by 46,656 and get 2,176,782,336.  DNA matching actual has a better probability of matching.


   Table 3 lists the most common alleles again along with potential alleles that would generate a half match and the corresponding summed frequency.  The probability of the set of 20 potential combinations existing is equal to the product of the frequencies - 0.759.  This probability has to be extrapolated from 20 loci to 946, giving us 2.45x10-6 or 1 in 400,000.  There is a 1 in 400,000 chance of a completely random match on this section of chromosome 3 for the alleles with the highest frequency.  It is well within reason to expect false positives for this one-to-one match.

Table 3.  Probability of a half match within a European population.
   In the event of a three-way match (triangulation), we multiply by 2.45x10-6 again, giving us a probability of 1 in 167 billion.  Now we are outside of what is statistically reasonable.

   The most common set of European alleles doesn't produce the highest probability of a random match.  When the alleles are not the same (AC, AG, CT etc.), there is a higher chance of an autosomal half match.  Table 4 shows an actual set of alleles and the corresponding set of alleles to generate a half match.

Table 4.  Probability of a half match within a European population using actual sample.
   This actual sample takes us from a false positive probability of 1 in 400,000 to 1 in 5,900 (0.000169).  A probability of 1 in 5,900 indicates that we should be seeing completely random matches that have no genetic relationship on a regular basis.  Considering a population of about 1.6 million autosomal tests taken, each of us would have 270 false positive matches on a segment similar to the one shown.     

   Triangulated matches exist for this segment of chromosome 3.  For the probability of this triangulated segment, we multiply by 0.000169 again, giving us 2.87x10-8 or about 1 in 35 million.  Considering the number of results available for matching (about 1.6 million), it is not realistic that we are matching randomly.  In fact, most triangulated matches involve more than three test results.  If four test results are triangulated, the probability goes to 1 in 205 billion.  These probabilities indicate that triangulated results cannot be random and are matching due to common genetic descent.

   I have intentionally used two examples that have a higher probability of having false positive matches.  As soon as we look at matches that don’t have the higher frequency European alleles, the probability of a false positive diminishes. 

Table 5.  Probability of a half match within a European population with a Mediterranean sub-component.
   Table 5 shows a typical set of alleles.  There are two alleles at rs7630053 and rs4558783 that are not typical European and may indicate a Mediterranean ethnicity.  The probability of a one to one match on this segment being a false positive calculates to be 1 in 7 quadrillion. 

   Currently, we cannot examine the allele frequency for every SNP in every match we attempt.  When looking for autosomal matches consider phasing or triangulation.  Phasing the data is very valuable, yet the resources are not always available.  I’ve shown that triangulation eliminates false positives and those matches are statistically identical by descent.  Triangulated small segment matching is very valuable in our research.



References:

Maglio, MR (2015) Autosomal DNA and the Triangulation of Small Segments:  A Statistical Approach (Link)

© 2015 Michael Maglio and OriginsDNA.  All Rights Reserved. 

Friday, March 6, 2015

Press Release: Michael Maglio and OriginsConnector Launch Breakthrough Autosomal DNA Service

OriginsConnector is an autosomal DNA service that can connect you to your distant ancestors between 10 and 14 generations ago. 

Boston, MA - 6th Mar, 2015

Autosomal DNA testing has become extremely popular for use within family history.  Major companies, including 23andMe, AncestryDNA and Family Tree DNA have tested nearly 2 million people.  Autosomal DNA is being used every day to find cousins, determine ethnicity and understand health factors.


Typically, Autosomal DNA can be reliably analyzed to generate relationship matches only as far back as 5th to 6th cousins.  Michael Maglio and his company OriginsConnector want to change that.  “We’re breaking through the generation barrier by connecting descendants to their distant ancestors using autosomal DNA,” said Maglio.

New research shows that small portions of our distant ancestor’s DNA still exists in our chromosomes.  Maglio has analyzed and catalogued the autosomal DNA for over 700 “gateway” ancestors.  A “gateway” ancestor is the Holy Grail for a family historian.  Once a connection to “gateway” ancestors has been proven, it opens up relationships to royalty and Old World history.

Maglio has launched OriginsConnector to analyze and compare a person’s existing autosomal DNA results to the growing catalogue of historic ancestors.  There are plans to expand the catalogue to include ancestors associated with the Salem Witch Trials, Signers of the Declaration of Independence and other requested ancestors.

At OriginsConnector, a single Ancestor match is $19, four matches for $49 and the complete catalogue of matches is available for a one-year subscription of $99.  Those opting for the one-year service are automatically compared to new Ancestors as they are added to the catalogue.  The subscription option is great for folks that are adopted or who have genealogical brick walls.

To find out more about the science behind the breakthrough, please visit Academia.edu (http://Independent.Academia.edu/MikeMaglio) or OriginsDNA (http://originsdna.blogspot.com/2015/03/breaking-through-autosomal-dna.html)

The official website of this service is http://www.OriginsConnector.com

Michael Maglio can be contacted at MMaglio@OriginsConnector.com


About Michael Maglio:


Michael Maglio is a professional genetic genealogist, writer, and speaker. He graduated from Northeastern University with a B.S. in Chemical Engineering and has spent his career developing collaborative technologies. As a genetic genealogist, Mike advises on the use of DNA as a tool for genealogy through his site OriginsDNA.com. His focus combines science and history to unravel ancestral identities and genetic migrations. 


Thursday, March 5, 2015

Breaking Through the Autosomal DNA Generation Barrier: Connecting to Distant Ancestors

   There has been much debate over the use of small autosomal DNA segments.  It is important to understand where they come from and how they can be used for genetic genealogy.  Small segments are considered noise and false matches.  There are too many small matches to make sense out of, but they are not necessarily false matches.  These segments have been in the population for longer than we thought.  When I match someone at 2 cM it is very likely that they are a 12th cousin, not a 5th cousin.  There is no reason for us to look for small segment matches until we understand where these segments originated.

   When we talk about autosomal DNA, we often over simplify the process of genetic inheritance.  The simple answer is that we inherit half of our DNA from dad and half from mom.  The common message is that with every generation the DNA contribution from an ancestor is randomized and reduced until it is insignificant.  Genetic inheritance is actually much more complex than that.  Complex in a great way.  There is a tremendous amount of ancestral information that we are just beginning to tap into.

   We inherit DNA from our parents and their ancestors in large sections.  Take a look at the graphic below.  Each example is the comparison of a grandchild to a set of paternal grandparents.  You can see in the first example that the grandchild inherited over two-thirds of their grandfather’s first chromosome intact (blue bars).  The remaining section of the first chromosome is from their grandmother.  In the third example, the grandchild has inherited the entire chromosome 14 from their grandmother.  It is physically possible that this grandchild could someday give one of their children the grandmother’s complete chromosome 14.  


In an effort not to over simplify, this is just half the story.  That grandchild has an equal contribution from their maternal grandparents. 

   In the examples above, we can visualize what happens when DNA recombines.  The first example shows where one section of the grandfather’s DNA swapped places with the grandmother’s DNA before it was inherited by the grandchild.  This is called crossover.  In the examples, a) is a single crossover, b) is a double crossover and c) has no crossover.  On average, each of our chromosomes experienced 2 or 3 crossovers before we inherited them.

   Where DNA crossover takes place on a chromosome is not random.  There are approximate locations where the chromosome is more likely to split.  These locations are cleavage sites. 


These locations exist because there are groups of genes along a chromosome that have a tendency to stay together.  These groups are part of gene linkage.  These linked genes only allow for chromosome splits at either end of their linked section.  In my research, the minimum size for one of these gene-linked sections is about 2.5 cM.  These small segments then travel in larger groups.


   In the graphic above, the blue bar represents about a 60 cM match.  The intersection between the black and orange ovals is about 2.5 cM and represents a minimum segment.  In this crossover recombination, the large segment actually split to the right of the minimum segment.  In a future crossover, the chromosome could split on the left side of the minimum segment, giving a large segment bound by the orange oval.

   Why are these minimum segments important?  My research shows that these segments stay in the gene pool for dozens of generations.  Over time, naturally occurring SNP mutations take place.  These minimum inherited segments (MIS) can be differentiated into family groups.

   In my research, I started with 28 well known US colonial surnames and 393 autosomal kits.  For each surname, the associated kits were triangulated.  If three or more kits match on the same segment, you can deduce that it came from a common ancestor.  Each of the surnames investigated had 6 to 13 distinct triangulated segments.  Taken together, these triangulated ancestral segments represent an autosomal haplotype that can be used to identify a descendant’s genetic connection to an ancestor.  Across all of the surnames, these distinct segments appear at recurring locations on each chromosome.  I have listed 21 of these ancestral loci in my paper.

   Not all ancestral segments are the same type.  The segments can be categorized into three groups.  The first category is Common to All.  The surnames in this study are predominantly European.  One segment has been identified on chromosome 2 that triangulates across all surnames.  This segment correlates to a Western Atlantic ethnicity and I call it the Western Atlantic Autosomal Haplotype (WAAH).  The Western Atlantic Autosomal Haplotype should not be confused with ancestry informative markers (AIMs).  The WAAH is composed of about 800 SNPs and there are only about 100 AIMs SNPs in that same stretch of chromosome 2.

   The next category is Shared.  Some segments can be attributed to two or more surnames.  There was considerable intermarriage between US colonial families.  That period was a bottleneck genealogically and genetically.  As two major families married, their combined DNA segments entered the gene pool and were reinforced as their descendants intermarried. 

   The third category is Unique.  These shared segments cannot be attributed to intermarriage of families.  Yet the resulting familial autosomal haplotypes are not composed of a single surname.  In the case of Benjamin Franklin, the genetic proximity to his wife, Deborah Read and his mother, Abiah Folger, may make it impossible to distinguish between Folger, Franklin and Read DNA.  Therefore, the haplotype represents the combined inheritance.  

   Here is one of my case studies.   Augustine Bearse was born in England in 1618 and died in Barnstable, MA before 1697.  The Bearse family was chosen due to my familiarity with the genealogy and the debate surrounding Augustine’s wife.  His wife Mary was supposedly the granddaughter of the Chief of the Cape Cod Native American tribes.  The goal was twofold;  to identify the autosomal haplotype for the Bearse family and determine whether any of the ancestral segments had Native American ethnicity.

   The Bearse study was composed of 48 autosomal samples.  These samples were collected based on claimed genealogical connections.  The triangulated samples generated 8 ancestral loci and indicated an additional 5 loci that had the potential to triangulate with more samples.  The resulting Bearse autosomal haplotype is found below.

Bearse Autosomal Haplotype

   The Bearse haplotype contains the Western Atlantic Autosomal Haplotype (chromosome 2) which is common to all haplotypes in the study.  The other 12 loci are more valuable for genealogical validation.  One of the Bearse descendants triangulates on six of the ancestral segments.  It is highly unlikely that a descendant would match on all of the segments.  Although ancestral segments survive over the generations, the randomness of their distribution makes it difficult for any one person to have received them all.  Yet, triangulating on just one segment unique to Bearse is enough to indicate and validate a relationship.  Lack of a match could mean that an ancestral segment was not inherited or that a non-familial event (adoption, infidelity, etc.) has occurred and the individual’s family tree is incorrect.

   In order to investigate the origins of Augustine’s wife Mary, each ancestry segment from the haplotype was evaluated for ethnicity.  Only the segment on chromosome six at location 55850885 had any Native American ethnicity.  This ancestral segment had not fully triangulated, yet a few of the samples match exactly on Native American SNPs.  With additional samples, the segment could triangulate.  Once validated, the segment might be shared across multiple surnames or unique to Bearse, indicating Native American genes in the Bearse descendants.

   While the amount of autosomal DNA received by each successive generation is only half from each parent, that does not mean that given enough generations a distant ancestor’s genetic contribution will become negligible.  Through genetic linkage, portions of DNA are inherited intact.  Naturally occurring cleavage sites allow for ancestral segments averaging 2.5 cM to be passed from generation to generation as a minimum inherited segment (MIS). 

   Ancestral segment analysis is invaluable for the identification of distant ancestors.  All of the triangulated ancestral locations combine to become a Familial Autosomal Haplotype (FAH) that can be used to validate family history.

   Since finishing my initial research, I have gone on to identify over 50 ancestral loci and over 700 autosomal haplotypes for US colonial ancestors.  Stay tuned for further advances in autosomal research.

References:

Maglio, MR (2015) Minimum Inherited DNA Segment Size and the Introduction of Familial Autosomal Haplotypes (Link)

Website:

© 2015 Michael Maglio and OriginsConnector.  All Rights Reserved.


Monday, February 2, 2015

Before They Sailed: Mayflower DNA


Please share the details about this first of its kind book that will identify the DNA and trace the genetic ancestry of twenty families that sailed aboard the Mayflower. 


The story behind the story….

Who were the Mayflower passengers before they were pilgrims? Where did they come from? England has a long history of migrations and invasions. Were the Pilgrim’s ancestors Anglo-Saxons, Normans or Vikings? This book will use traditional genealogy and DNA to answer those questions and more.

The DNA of the Mayflower descendants will lead us on a path of discovery that will first allow us to confirm the genetic identity of each Pilgrim and then open a door to the past, before they sailed.

Following the DNA

The DNA falls into three Mayflower categories: descendants, genetic cousins and genetic ancestors. These categories will be defined in more detail in the book. The descendant's DNA gives us the ability to confirm the genetic identity of each Pilgrim and gives us deep ethnicity. The identity allows us to find genetic cousins in England, giving us the location of each Pilgrim's ancestral city and village. When we expand the search for cousins beyond England, we get a view of genetic ancestors, tribes and migrations across Europe.

Mayflower Surnames

The book will identify the DNA and trace the genetic ancestry of the following Mayflower families: Alden, Billington, Bradford, Brewster, Chilton, Cooke, Doty, Eaton, Fuller, Hopkins, Howland, Mullins, Rogers, Samson, Soule, Standish, Tilley, Warren, White and Winslow.

Looking for Descendants

For this project to be a success, I'm looking for both direct paternal line and maternal line descendants of the Pilgrims. If you have a solid genealogy back to the Mayflower, I would like to include your story. It would be great if you already have your y-DNA or mitochondrial DNA tested.

Contents

In addition to Mayflower ancestry, the book will illustrate the basics of genetic genealogy. How can we tell if two people are related? Where does the ethnicity come from? How do we know where someone's ancestors lived 500, 1,000 or 2,000 years ago? Why do the Anglo-Saxons, Celts, Normans and Vikings matter in a discussion about Pilgrims?

Fieldnotes

There will be a tremendous amount of data and information that gets generated during the research phase of this book. All of the info is valuable, but not all of it will make it into the book. The "members-only" Fieldnotes section of the companion website will contain all of the background and detail the progress for those who want to follow along. Even after the book is published, new information will be added as more is learned about the Mayflower Pilgrims and their DNA.

What will be produced?

This full-color book will be produced in a digital format, a paperback edition and a limited hardcover edition. There will be a companion website with behind the scenes details exclusive to members. A presentation for genealogy and history conferences is also planned.



Wednesday, January 28, 2015

Ghosts of DNA Past: Irish Kings

   In 2006, Laoise T. Moore and the folks at Trinity College in Dublin published a paper famous for identifying the modal haplotype of Irish High King Niall of the Nine Hostages.  In their work, they used seventeen Y-DNA STR markers.  While time to most recent common ancestor (TMRCA) calculations have accuracy issues, having only 17 markers gives a common ancestor over 2,000 years ago.   What the Trinity folks really accomplished was the identification of Niall’s paternal ancestor from over 400 years earlier.  The media in 2006 had a field day in their interpretation that most of Ireland is descended from Niall.  “Niall may be the most prolific male in Irish history.”  Also at 17 markers, there is a very high probability of convergence.  Through normal mutations, haplotypes can change over time to appear similar or identical to other haplotypes.  The lower the number of markers, the higher the chance of convergence.  At that time only high level SNPs were tested to determine haplogroup.  Without terminal SNPs it would have been impossible to recognize convergence, if it existed in the samples.

   In my research on the Kings of Ireland, I have used 67 markers to reduce the chance of convergence and to calculate the age of common ancestors on the descendant side of the target rather than the ancestor side.  I will demonstrate traditional median-joining networks and novel “tribal” markers for the identification of four historic Kings of Ireland.  Did Trinity get Niall’s haplotype correct with the limited data they had at the time?

Ghost:  a manifestation of a dead person

Modal haplotype:  a derived haplotype based on the DNA tests of a group of people

   A modal haplotype is a ghost of a person.  When we look at multiple DNA test results and calculate the mode, by definition we are just taking the values that appear most often.  There is no way to determine if the modal haplotype is the actual haplotype of the historic individual we are researching (short of historic samples).  While the modal is not perfect, it will be close enough at 67 markers for us to determine the genetic “ghost”.

   The septs of Ireland provide us an opportunity to develop genetic genealogy techniques and processes.  Irish surnames are typically patronymic.  The surnames generally take the form of Mac Cárthaigh (McCarthy), meaning son of Cárthaigh or Ui Néill (O’Neill), meaning grandson / descendant of Néill.  Irish septs serve as a collective of related families with shared ancestry and patronymic surnames.  Multiple septs then belong to larger dynasties such as the Eóganachta and the Dál gCais.

   If septs are patrilineal, then Y-DNA haplotypes should be consistent across sept surnames.  Research on the Uí Néill haplotype started with a geographical selection and then a subsequent reduction by sept surnames (Moore et al 2006).  For each target sept, affiliated surnames were identified.  In the case of Uí Néill, the following surnames and associated Y-DNA STR records were accessed from Family Tree DNA projects: O’Neill, Gallagher, Doherty and O’Donnell.  The selection includes 600 records and 5 common European haplogroups.

   Median-joining networks have been in use for over a decade for the visualization of genetic relationships.  The use of them at 67 STR markers has been rare, but it should be the norm.  This first image has the central cluster of a median joining network based on 25 STR markers from the Uí Néill group.  It is just a single cluster with no differentiation.



Figure 1 - Using only 25 STR markers, the Uí Néill network collapses to a single cluster.

When we look at the same group using 67 markers, we get four distinct clusters, each with their own SNP.  The cluster at the far right is predominantly R-L159 and the cluster at the lower right has R-P311/R-L151 nodes.  The cluster at the left contains all of the Uí Néill dynastic surnames, has the majority of nodes and is SNP R-M222, which is consistent with earlier studies.


Figure 2 - View of the Uí Néill network torso showing four distinct clusters.  Three groups on the right are O’Neill only.

As a double check to make sure that I wasn’t seeing some other phenomena, I analyzed three random Irish surnames; Duffy, Kelly and McCormick.  The random sample produced over ten unique clusters with no surname overlap.  This comparison shows that septs are patrilineal and that Y-DNA haplotypes are consistent across sept surnames. 

Figure 3 - Median-joining network of yDNA sampled from three random Irish surnames; Duffy, Kelly and McCormick.  

Re-evaluating the Uí Néill data also shows that Trinity was correct in their identification of a 17-marker Uí Néill haplotype.  New data and new techniques allow us to produce a 67-marker haplotype.


Figure 4 - Sixty-seven STR Uí Néill Modal Haplotype (Niall of the Nine Hostages).

   A different technique that I’d like to illustrate involves the fact that not all STR markers are created equal.  This method takes advantage of “slow” mutating STR markers.  Each marker has its own mutation rate.  By selecting the 15 “slowest” markers with an average mutation rate of 0.00024, a virtual tribal haplotype is created that would be stable within the last 2,000 years (90% probability of 80 generations).  This is an order of magnitude lower than the average rate of 0.0029 used as a constant in typical TMRCA calculations.  The “tribal” markers isolated are DYS426, DYS388, DYS392, DYS455, DYS454, DYS578, DYS590, DYS641, DYS472, DYS594, DYS436, DYS490, DYS450 and DYS640.

   To manipulate the “tribal” haplotype of 15 microsatellites faster the resulting values are concatenated into a string – ex. 12121411119168108101212811.  The “tribal” haplotypes are summarized per surname and plotted to illustrate majority and affinity.


Figure 5 - Uí Néill dynastic haplotypes converted into 15 marker “tribal” haplotypes and summarized.

   The Uí Néill dataset resolved into 37 unique “tribal” haplotypes.  Figure 5 shows that haplotype 12121411119168108101212811 is the most dominant across the Uí Néill surnames.  As with the median-joining network analysis, this “tribal” haplotype is consistent with SNP R-M222. 

   I repeated these two techniques for the Uí Briúin sept using the following surnames and associated Y-DNA records: O’Brien, Hogan, Kennedy and McMahon.  The selection includes 615 records.  The Mac Cárthaigh dataset has the following surnames: McCarthy, Callaghan, Donovan and Sullivan.  The selection includes 319 records.  The Ua Conchobhair data has the following surnames: O’Connor, McManus, Reilly and Rourke.  The selection includes 352 records.

For more details, see my paper at Academia.edu.



Figure 6 - Sixty-seven STR Uí Briúin Modal Haplotype (Brian Boru).


Figure 7 - Sixty-seven STR Mac Cárthaigh Modal Haplotype (McCarthy Eoganachta Kings).



Figure 8 - Sixty-seven STR Ua Conchobhair Modal Haplotype (Last High King Roderick O'Connor).


   Here are a couple of interesting insights from my research.  Niall Noígíallach was High King of Ireland around 378 CE and founder of the Uí Néill dynasty.  Historically, his half-brother Brión, was one of the founders on the Connachta dynasty and an ancestor of the last High King of Ireland, Ruaidrí Ua Conchobair.  If their genealogies are correct, the evidence is in their descendant’s DNA.  The data shows that Uí Néill and Ua Conchobair share the same SNP, R-M222.  The Uí Néill and Ua Conchobair modals are a 6-step match at 67 markers.  There is a 99% probability of a relationship not further than 1,260 years ago.  The results make a strong case for the validity of this historic genealogy.

   Brian Boru, High King of Ireland in 1002 CE, belonged to the Dál gCais dynasty and Tadhg Mac Cárthaigh, the first King of Desmond, belonged to the Eóganachta dynasty.  Ancient genealogies have the Eóganachta and Dál gCais dynasties descended from Ailill Aulom, the son-in-law of legendary king Conn of the Hundred Battles.  The Mac Cárthaighs and Uí Briúins do not share the same SNP (R-L226 vs. R-CTS4466), but by descent they would share a common R-DF13 ancestor.  The Mac Cárthaigh and Uí Briúin modals are an 11-step match at 67 markers.  There is a 99% probability of a relationship not further than 1,920 years ago.  This puts a Mac Cárthaigh-Uí Briúin common ancestor as a contemporary of the legendary Conn.

   New and improved genetic genealogy techniques are invaluable for the identification of historic individuals and the reconstruction of distant family trees at the macro level.

Reference:


Maglio, MR (2015) Identifying Y-Chromosome Dynastic Haplotypes: The High Kings of Ireland Revisited (Link)

Monday, December 8, 2014

Atrocities and Assimilation: Crusader DNA in the Near East

   This paper got its start back in February of this year while I was researching R1b-DF100 for my posting, The Third Brother.  Among the data, primarily Western European haplotypes, was a single Armenian record.  The R1b-L11>DF100 group that I was working with had as one of their theories that L11 was a fairly recent, 3,000 to 4,000 years, arrival from the Near East and that the Armenian record was part of that evidence.  I looked at the Armenian record, ran a phylogenetic test on it, the L11 group and some similar Near East records.  The Armenian record fell squarely within a Baltic cluster on the tree with a rough TMRCA of about 1,200 years.  This Armenian was clearly more European than Armenian, at least on the paternal line.  My comment back to the L11 group was that their Armenian was probably the descendant of a Crusader based on the timing and directionality.

   In September, I ran across Pierre Zalloua’s paper - Y-ChromosomalDiversity in Lebanon Is Structured by Recent Historical Events (2008).  He and the other authors had put together a good correlation between Crusader DNA and haplogroup R1b in Lebanon.  The paper also correlated haplogroup J and the Muslim expansion.  The paper received quite a bit of feedback about haplogroup J and little or no mention about haplogroup R1b.  Considering the extent of the Crusader’s presence in the Near East from 1096 to 1343, if they left DNA behind it would have been spread farther than Lebanon. 

   The real question is not – if they left DNA behind.  There is significant literature that details the atrocities; raping and pillaging was standard operating procedure for the Crusaders.  There are also numerous accounts of assimilation.  During the Crusader’s 247-year occupation and roughly eight generations, they married local women and raised families.  The real question is did Crusader DNA survive to modern time. 

Crusader DNA Distribution
   If Crusader DNA survived, it would be spread from Istanbul to Jerusalem and beyond.  The graphic above shows the potential for DNA distribution during the Crusader occupation (red) and the distribution over the past 918 years (gray).  My research focused on the following Near East countries - Armenia, Georgia, Iran, Iraq, Israel, Jordan, Lebanon, Palestine, Saudi Arabia, Syria and Turkey.

   Here is something I found bizarre.  Zalloua and team published their paper in 2008.  Every researcher looking at Near East R1b should be taking a lesson and validating that their data is not of Crusader origin.  Obviously, Crusader DNA wasn’t restricted to Lebanon.  In 2010, Balaresque, et al and again in 2011, Myres, et al, published papers using Near East R1b data (Turkish).  Forty-two percent of the Turkish R1b haplotypes from Balaresque and Myres was identical to Zalloua’s Lebanese R1b data.  This didn’t seem to raise any flags as Balaresque and Myres used the Turkish data to suggest a Near East origin and Neolithic expansion for R1b.  These folks must not talk to each other.  Two of Zalloua’s team members went on to work with Balaresque and Myres on their papers.  The first thing I would have said was – “Considering what Zalloua found, we need to validate the origins of the Turkish data further back than one or two generations”.

   When presenting an analysis it is always good to show comparison data.  I collected R1b data and haplogroup G and J data from multiple Family Tree DNA projects.   I have a higher comfort factor that G and J are associated with the Neolithic expansion, so they were used as a basis for comparison.  For each 37-marker Near East record obtained, I used the haplotype to query a larger set of related records from ySearch (I call this haplotype aggregation).  A Near East set and a Western European set of data was developed for each haplogroup.  I then compared each individual Near East haplotype against the entire Near East set and the entire Western Europe set.  You would expect that the Near East haplotypes would be more closely related to their peers in the Near East set.

   The haplogroup J data tells the best story.  The results cluster down J1-M267 and J2-M172 lines.  The neutral line (diagonal triangles) represents zero affinity towards the Near East or Western Europe.  Points falling to the right of neutral show an affinity toward the Near East and to the left of neutral, an affinity towards Western Europe.


   J1 haplotypes (diamonds), which are rare in Europe, are closely related to their peers in the Near East.  The J1 data only shows an affinity toward the Near East.  The trend line for J1 indicates a fairly stationary population pattern with no suggestion of migration to Western Europe.  A trend line that doesn’t cross the neutral represents a strong peer affinity and little or no migration between the Near East and Western Europe.  J2 data (squares) shows a tipping point at which the more distantly related records lean toward the Near East and the closely related records lean toward Western Europe.  That transition shows a TMRCA of about 3,900 ± 800 years.  The tipping point indicates a point in time where the Near East J2 haplotypes became more common in Western Europe, illustrating a migration. 


   Haplogroup G shows very similar results as J2. Haplogroups J2 and G have been associated with the Neolithic spread of agriculture from the Near East to Western Europe.  Both J2 and G present a consistent distribution from distant relationship (high variance) to closer relationship (low variance).  The trend lines for J2 and G represent migration events from the Near East to Western Europe.  The trend line for J1 represents no migration event.  These results are consistent with other published information.

   Haplogroup R1b does not exhibit either a migration or a non-migration pattern.  The haplotypes cluster in a fairly homogenous group.  There is a slight lean toward Western Europe and essentially no continuum from high variance to low variance.  The more distantly related haplotypes don’t exist in the Near East.  The Near East individuals are just as related to the Western European individuals as they are to their own peers.  The approximate TMRCA for the R1b Near East – Western European group is 1,800 ± 500 years.


   Through atrocities and assimilation, Western European DNA from Crusaders was permanently introduced into the Near East less than 1,000 years ago.  Western European and Near East R1b haplotypes are highly and recently related.  The data indicates that within the last 2,000 years there was a migration from one geography to the other.  There is no documented migration in the past 2,000 years that would account for Western European R1b populations coming from the Near East and replacing indigenous European populations.  The introduction of Western European DNA into the Near East by Crusaders accounts for the west to east genetic flow.

   The sampling practices of research studies are questionable.  The origin of participants is typically only validated for one or two previous generations.  This is equivalent to not knowing the origin for study participants.  Sampling needs to be undertaken with a genetic genealogy approach and 37 markers or greater.  The population genetics approach of less than 17 markers, poor origin validation and haplogroup generalization needs to change.

   Previous papers (Balaresque & Myres) that have used Near East R1b data as the basis of their research are suspect.  In light of the introduction of Crusader DNA into the Near East within the past 1,000 years, any theory on a Neolithic origin for haplogroup R1b will have to be re-evaluated.

Reference:

Maglio, MR (2014) Y-Chromosomal Haplogroup R1b Diversity in Near East is Structured by Recent Historical Events (Link)


© Michael R. Maglio

Friday, December 5, 2014

DNA Convergence and Chicken Little

   For me, the topic of convergence in yDNA first came up early in 2014.  I had just posted a paper and one of the comments was – “What about convergence?”  I said to myself, “What convergence?”  I admit I had to look up the topic.

Convergence: A term used in genetic genealogy to describe the process whereby two different haplotypes mutate over time to become identical or near identical resulting in an accidental or coincidental match. - Turner A & Smolenyak M 2004.

My response back to the comment was - “All of the haplotypes in my paper are unique.”  My data did not exhibit convergence. 
Convergence casts a shadow on genetic genealogy
   I started to poke around on the topic of convergence within yDNA STR haplotypes and the immediate impression that I got was that folks were ready to give up on STRs in favor of SNPs and the sky was falling.  Chicken Little was running around in the genetic genealogy circles.  Here is a small sample:

Y-STRs are effectively dead” - Dienekes Pontikos, 2011

Convergence of Y chromosome STR haplotypes from different SNP haplogroups compromises accuracy of haplogroup prediction” – Wang, et al, 2013

   Okay, convergence happens, but it’s an illusion.

   Let’s take a big step backwards in this story.  Did you know that most scientific papers relating to genetic genealogy use 17 STR markers or less?  Some use as few as 9 or 10.  For any of you who ever took one of the original 12 STR marker tests, you know that the results were essentially useless for anything except deep haplogroup association and history.

   Many researchers in the last couple of years are using the AmpFLSTR® Yfiler® to get their 17 marker results.  This equipment is approved for forensic cases.  Research papers are not forensic cases and researchers don’t need to limit themselves to 17 markers.  Thirty-seven marker yDNA tests have been available since 2004.

   Why does the number of STR markers matter?  I’m going to release my inner math geek to help explain.  If we look at marker DYS19, usually listed first in science papers and third in Family Tree DNA results, it can have a value within the range of 7 to 22 across all haplogroups.  Looking at R1b specifically, DYS19 ranges from 10 to 17 and statistically at two standard deviations (2 sigma) the range of values narrows to 13, 14 and 15.  From a probability point of view, there is a 1 in 3 chance that DYS19 will be 13, 14 or 15.  Making the odds even better in our favor, 95% of the time DYS19 for R1b will already be 13, 14 or 15.  This means there is a 1 in 2 chance that DYS19 could change to another value on its way to converging with another haplotype.

   Taking standard deviation into account to determine the possible number of values for the STR markers and then multiplying each probability gives the odds that a haplotype could converge.
STR
DYS393
DYS390
DYS19
DYS391
DYS385a
DYS385b
DYS426
DYS388
DYS439
DYS389i
DYS392
DYS389ii

Total
# of possible
marker values
2
4
2
2
2
4
1
1
2
2
2
2

4096

   There is a 1 in 4096 chance that two R1b 12 marker haplotypes could converge.  This is not the probability that one marker will change.  This is the probability that all 12 markers will change enough to match another haplotype.  These are very good odds and the reason why a 12-marker test is practically useless. 

   With a high probability that 12 STR markers will converge, haplotypes start to blend together.  Two different haplogroups or family lines will appear to be the same.  Converging also means that when we calculate the time to the most recent common ancestor (TMRCA), it will look like less time has passed.  Convergence makes a 12-marker test result unusable for genealogical matching, haplogroup prediction and TMRCA calculations.  The Chicken Littles are correct, we have a problem with 12 marker STR results.

   What about 17 markers, a quasi-industry standard for science papers?  Taking the same approach with statistics and probability, a 17-marker yDNA R1b result has a 1 in 2 million chance of converging with another haplotype.  Each haplogroup has slightly different odds.  There is a 1 in 500,000 chance of an R1a 17 marker haplotype converging.  Those odds are better than any lottery.  Convergence is still a problem at 17 markers.

   When Dienekes Pontikos proclaimed the death of yDNA STRs, he was commenting on the attempt to get good TMRCA dates from 10-marker results.  I agree, you can’t get valid TMRCA dates from 10-markers.  When Wang, et al, determined that convergence compromises haplogroup prediction, they were correct, 17 marker haplotypes can converge to make one haplogroup look like another.

   In a quick analysis of 4,300 unique 37-marker R1b haplotypes, the average genetic distance is 17 steps for 37 markers.  That means there are 17 mutations required for convergence in a 37-marker haplotype.  Nearly half of the markers in the haplotypes would need to change.  When we look at the probability of 25-marker haplotype convergence, the chances are 1 in 84 million.  Considering there are about 3.6 billion men on the planet, one in 84 million is still in the realm of possibility.  By the time we get to 37-markers, the odds are 1 in 49 trillion.

   There is a 1 in 49 trillion chance that all the necessary mutations will occur in order for two 37-marker haplotypes to converge.  The odds are likely much higher.  I’ve only looked at the probable values for each marker and I haven’t taken into account the STR mutation rates, the possibility that a marker will change over time. 

   There is essentially no such thing as convergence when 37 or more markers are tested and researched.  If you eliminate the possibility of convergence by using 37 STR markers, then immediately TMRCA calculation become more accurate and haplotypes from different haplogroups no longer resemble each other.  The reports of the death of yDNA STR results have been greatly exaggerated.


   I can’t tell you why researchers are currently stuck on 17 markers.  I can tell you that any research using less than 37 markers runs the risk of convergence in their data, which in turn could lead to the wrong conclusions.  I still consider genetic genealogy to be in its infancy.  Every month new research papers are published and the new concepts introduced are latched onto immediately.  It is understandable that papers from over a decade ago used a dozen STRs and a handful of SNPs, that was the height of technology.  If the latest technology and best data are not being used in today’s research papers, is that equivalent to scientific negligence?  Or, am I missing something and this is a case of scientific ignorance on my part?