DNA Analysis, Part 4 - Cluster Fun

Feb 18, 20226 min read

Ancestry has listed five of Elder's matches as having a 'Common Ancestor'. Practically this means that it's used an algorithm to scour all of the family trees on the site and come up with a solution for the link to a given individual. These are usually correct, but must be verified with records. They are a good starting point for tree development, but the first thing I should do is to organise the dataset into clusters of related individuals.

From experience, I would expect 4 large clusters (of about 100), 20 smaller clusters (of about 5-10), and 50 individuals/pairs for a total of around 75 groups. I will append the dataset spreadsheet I have just created with a cluster identifier so that I can subsequently use the filter function depending on which ancestors I'm investigating.

This exercise is time-consuming, but valuable. In briefly reviewing each and every DNA match, I can understand which matches belong to individuals who are genealogy hobbyists (or rarely, professionals), the surnames common to various clusters, and the links between main and sub clusters. For the first two clusters, I have set aside a few hours to complete the task. Smaller clusters will be much quicker, and I can revisit them with breaks if necessary, but I intend to get each of the first two done in their own whole sessions so that I don't lose track.

I begin by adding a column to the left of the existing records in the dataset as space for a number to be used as a cluster identifier. I also add a column to the right for notes on surnames of ancestors associated with each match.

Typically, the closest relatives at the top of the list will have the most shared matches. This means the lion's share of this exercise will be identifying the first two clusters (the main maternal and paternal groups). It may be slightly more challenging to identify which side the smaller clusters are on, but as the tree grows, many will fall into place.

Starting at the top of Elder's shared matches list, with the closest match, I place a '1' in the cluster ID column. This is my first member of my first cluster. I then open the link to them in a new browser tab and switch to it.

The first thing to do is look at the individual's Ancestry account to gauge whether they are a good candidate for (later) communication with regards the tree. There's no harm in contacting all matches via the Ancestry messaging system eventually, but experience shows that most will not respond. By clicking on their name, I can take a look at their userpage.

A good candidate can be understood from their short bio. If they openly state 'please send me a message' or describe themselves as passionate about genealogy, or list family names and areas they have an interest in, then this is someone who will likely lead to valuable communication. A bad candidate - and these are the majority on the site - will not have entered any further information about themselves or their genealogical interests and will either have not created a tree at all, or added the very basic self and parents only. This is fine; people take DNA tests for different reasons after all. Some may even be using Ancestry for research purposes only, keeping their main tree and communications on another site such as MyHeritage or myFTDNA, or even offline.

However, my experience shows that collaboration is a key part of family tree development. It can be fun working with people who may be so remotely related that it has become irrelevant, but the passion with which you can both investigate particular shared 6GGs (who are a single pair out of 128 pairs of ancestors at that generation) leads to discoveries you wouldn't previously have found alone. To this end, I differentiate any Ancestry user who seems open to collaboration in the dataset by highlighting their record in a nice light green. In Elder's case, these individuals account for 10-15% of DNA matches.

Elder's closest match is a blank user with no tree attached, so I won't be highlighting him as a potential contact. That's not to say I won't contact him; but the reasons for doing so will be very different to those 10-15% of hobbyists on the site. Then, I click the back button in the browser to bring me back to the page that shows the shared match detail.

Next, I see if there's any family information I can record. If the user has attached their DNA results to a tree, a pedigree will show. In this case, I note down in the notes column up to sixteen surnames displayed; there's no need to note duplicates; the purpose of this exercise is to gain familiarity with familial names and feel out the common names between matches which will highlight the ancestral links. If there is no tree, a private tree, or all members are alive and hence private, I write 'Nothing'. It may be that I can find a name from the individual's bio, where they may have listed their maiden name, or the name they've given to a private tree. It can be helpful to note those instead if any pedigree is missing.

Finally, I click over to the shared matches tab for the individual. This is a list of individuals in my dataset that share DNA with Elder and with the match who's page I'm on. Most often it refers to the same DNA, but the list doesn't distinguish whether these are completely different pieces of DNA or not. Such an occurrence is rare, but certainly not unheard of - maybe a couple in the dataset will be this way. This would mean that the three individuals involved are related, but not all to each other. It can therefore artificially inflate a cluster, but as long as I bear in mind that this is possible in rare cases, I can ignore the concept in my first pass of cluster creation. As an aside, both MyHeritage and GEDmatch have tools to identify the particular DNA segments that match between individuals, which would be very useful in this exercise (though they, too are not perfect, as we will see later on); hopefully it's something Ancestry will include at some point soon.

Generally, the higher on Elder's DNA matches list an individual is, the more shared matches they will have. As both the dataset and related individuals' shared matches are in the same order, it is reasonably simple to go down this list, putting a '1' in the cluster ID column next to each name. Once that is done, I go down the list again and open the link to each individual on the list in a new tab. Once that is done, I am finished with this first individual and his page can be closed.

On the first new tab I now repeat the process I've just completed. Step one: click through to their bio and identify whether they are a good potential contact, and if so highlight their dataset entry. Step two: record their ancestral surnames in the Notes column of the dataset, or 'Nothing' if there is no further information. Step three: open in new tabs their unviewed shared matches - note that this time, not all shared matches will have the blue dot denoting an unviewed page. As I have opened a number of them in other tabs already, they are considered 'viewed'. Appreciate that I may end up having 20-30 tabs open at the same time at some points, as I work my way through all of these linked matches. Step four: close the tab of the individual I have finished working on.

After a good few cups of tea, I have completed my first cluster. Now is a good time to create a summary sheet in the spreadsheet to get a feel for the scope of the work I've done. I create a new sheet tab in the dataset spreadsheet file and head four columns with "Cluster", "Members", "% of Matches", and "Notes". The first column is for cluster ID numbers, so going down I can enter 1, 2, 3, etc. to about 50 for now. The second column collates the quantity in the cluster from the dataset using the formula "=COUNTIF(Dataset!A:A,A2)", which can then be copied down. The third column gives me a mathematical representation of that number in context using the formula "=B2/(COUNTA(Dataset!B:B)-1)". This can also be copied down. Finally, the notes column is for my reference. I have noted that cluster 1 is most likely a maternal cluster as there is no link to the paternal surname and a large number of Irish names are present.

I can see that cluster 1's 134 matches represent 28.21% of the total. Repeating the task for the next cluster, from the next unviewed individual on Elder's main match list, I find that cluster 2 has only 76 matches (16.00%). It is also very likely a paternal link, as the top match shares Elder's rare surname.

This clustering took me three sit-down sessions of a few hours each to complete. I ended up with four large clusters of 50+, eleven clusters of 3-30, and thirty-one clusters of 1-2. I also had the rare treat of coming across one of Elder's DNA matches that also matched mine and my mother's - quite random and unexpected! That's not to say I'm related by DNA to Elder, but there is a family out there which is related to both of us.

With this complete, I can now move on to expanding Elder's family tree proper.

DNA Analysis, Part 4 - Cluster Fun

Recent Posts

Comments