Using AI to Analyze Your Family Tree with ChatGPT

Using AI to Analyze Your Family Tree with ChatGPT

In the digital age, the pursuit of understanding our roots and ancestry has become more accessible and intriguing than ever before. With the vast resources of genealogy platforms like Ancestry.com, we can unearth fascinating stories of our ancestors and the journeys that have shaped us. However, as our family trees grow, so does the complexity of uncovering meaningful connections and historical context. Let us show you how to explore your family tree with ChatGPT.

This is where artificial intelligence (AI) steps in to revolutionize the way we explore our genealogical heritage. Among the innovative AI tools available, ChatGPT stands out as a cutting-edge language model, capable of comprehending and generating human-like text. Powered by OpenAI’s GPT-4 architecture, ChatGPT is a versatile and intelligent assistant that can lend its capabilities to a myriad of tasks, including genealogy research.

In this blog post, we delve into the powerful synergy between AI and genealogy, focusing on how ChatGPT can be harnessed to analyze and enhance your family tree research. We’ll also explore the significance of utilizing FTAnalyzer, specifically designed to process GEDCOM files exported from platforms like Ancestry.com.

Join us on this exciting journey of exploring the past with the aid of AI, and let’s unravel the untold stories buried within your family tree, one generation at a time. Let’s embark on a transformative quest to uncover the rich tapestry of your ancestry and celebrate the captivating tales of your forebearers.

Getting started with ChatGPT

In order to use the power of ChatGPT, you need to follow a few steps to get the dataset ready to upload.

Obtaining a GEDCOM file

The first thing we will need is a GEDCOM file of your family tree file. My family tree is located in Ancestry.com and I explained how to export my tree in the following article: How to Use FTAnalyzer to Fix Errors in an Ancestry.com Tree

If you use a different application to store your family history, refer to their documentation on how to export to a GEDCOM file.

Converting your data to CSV format

As I explain later on in this article, ChatGPT does not currently understand how to process GEDCOM files directly. We need an alternative way to upload our data to be able to analyze the data. One format that ChatGPT does understand is CSV files.

Using the tool FTAnalyzer you can read in your GEDCOM file and the tool provides a means to export out various types of data. In this article I focus just on the individuals. After opening your file, you can go to the Export menu item and select “Individuals to Excel”. This will save a file with all the individuals from your tree.

Uploading CSV data to ChatGPT

In order to upload files to ChatGPT you will need access to version 4 which is a paid subscription.

Once you have access you will need to make sure you turn on code interpreter. You can go to your settings and select “Settings and Beta” and then “Beta Features”. There will be an option for “Code interpreter” and you will want this enabled.

Code interpreter option

Once the above is done, you are ready to upload your file. You will want to make sure you select GPT-4 and also make sure that code interpreter is enabled.

gpt-4 code interpreter

Now you are ready to upload your CSV file. If you look at the ChatGPT prompt, on the left is a circle with the + sign. You click this and will be prompted for a file. Go ahead and select your CSV file.

ChatGPT prompt

Uploading GEDCOM directly to ChatGPT

I first tried to upload my GEDCOM file directly and then providing some prompts but ChatGPT currently cannot process these types of files. Here is the feedback that was provided:

To provide you with the number of individuals in your data, I’ll need to parse your GEDCOM file. GEDCOM (an acronym for Genealogical Data Communication) is a proprietary and open de facto specification for exchanging genealogical data between different genealogy software.

Let’s load and parse the data to determine the number of individuals in the file. I’ll use the Python library gedcompy for this task. Let’s get started.

I apologize for the confusion, but it appears that the Python environment in our current session does not have the gedcompy library installed. As of my last training cut-off in September 2021, there is no built-in functionality in Python to parse GEDCOM files.

However, there are several third-party libraries available, such as gedcompy, python-gedcom, and ged4py, which are capable of parsing GEDCOM files. Unfortunately, I am not able to install new libraries in this environment.

Is there another way I can assist you, or do you have the data in a different format I might be able to work with?

Using ChatGPT to analyze your family tree

Now you are ready to start asking questions. I start off with some basic prompts which are all possible using the CSV file directly in a program like Excel. Even though you could perform these tasks directly on the CSV file, using a tool like ChatGPT allows you to do so much quicker and without needing to know how to sort and filter columns, create pivot tables, and figure out how to generate a graph or chart.

Example basic ChatGPT prompts

How many individuals with name Michael

I first started simple and I asked, “How many people have the name Michael?”. ChatGTP analyzed my file and was able to provide an answer of 89. It was able to figure out what column to use in this case:

It seems you’ve uploaded a file named “individuals.csv”. I assume this file contains information about individuals including their names. I will load this file to check how many individuals have the name “Michael”.

The data has multiple columns but the relevant one for our task seems to be the “Forenames” column, which appears to contain individuals’ first names. Let’s check how many individuals have “Michael” as their first name.

Please note that this method will consider only individuals named exactly “Michael” and not variations like “Mike” or “Micheal”, nor will it consider individuals with “Michael” as a second or further name. If you want me to consider these cases, please let me know.

There are 89 individuals named “Michael” in the provided dataset.

How many individuals with the name Tina?

My wife’s name is Tina so I wanted to see what would show up with a similar request as above. I used the prompt “How many individuals have a forename containing ‘tina’?”.

There are 4 individuals in the provided dataset whose forenames contain “Tina”. This would include names like “Tina”, “Christina”, “Martina”, etc.

I then asked to list their forenames and surnames, and sure enough if provided a list of 4 individuals with their forename and surname and my wife was included in the results.

Top surnames

I asked “Can you list the top 10 forenames?”. Here is what was provided:

Here are the top 10 most common forenames in the provided dataset:

  1. UNKNOWN: 541 individuals
  2. Mary: 361 individuals
  3. John: 351 individuals
  4. Elizabeth: 244 individuals
  5. William: 229 individuals
  6. James: 174 individuals
  7. Sarah: 154 individuals
  8. Joseph: 149 individuals
  9. Thomas: 144 individuals
  10. Samuel: 106 individuals

Chart based on number of individuals born by year

I asked “Generate a bar chart based on number of individuals born by year” and here is what I received:

Number of individuals born by year

Chart based on gender

I asked “Generate a pie chart based on gender” and here is the result:

Distribution of individuals by gender

Most common forenames

I asked “Generate a bar chart of the top 20 forenames excluding unknown”:

Top 20 most common forenames

Most common surnames

I asked “Generate a bar chart of the top 20 surnames” and it did but the number one entry was for UNKNOWN. I then asked “can you redo that but exclude unknown” and it produced the following:

Top 20 most common surnames

Distribution of lifespans

I then used the prompt “Generate a chart showing lifespan of individuals”:

Distribution of lifespans chart

Relation to root

I asked ChatGPT to generate a chart based on the “RelationToRoot” and what it produced was just not readable as the labels were too long and overlapped one another. I then asked “Can you generate a table instead of a chart?” and here is what it produced:

Relation To RootNumber of Individuals
04th cousin 3 times removed142
12nd cousin 4 times removed134
22nd cousin 5 times removed124
34th cousin123
43rd cousin 3 times removed120
52nd cousin 8 times removed107
61st cousin 6 times removed104
74th cousin twice removed96
83rd cousin once removed96
93rd cousin 4 times removed96
102nd cousin twice removed96
113rd cousin 6 times removed95
123rd cousin86
134th cousin once removed84
142nd cousin once removed81
155th cousin twice removed76
165th cousin once removed73
171st cousin 5 times removed73
181st cousin twice removed71
191st cousin 9 times removed70

More advanced prompts

Using AI to help analyze data even with simple prompts was awesome as you could do it with simple prompts and not need to be an Excel wizard, but I wanted to try some more complex prompts to see what could be done. Here are a couple I just quickly came up with.

Finding individuals who may have been drafted in WWII

I asked the following prompt:

Analyze the file and count how many individuals may have been 18 to 30 years old during world war 2 based on using their birthdate, deathdate, marriagedates, and lifespan

Here is what it generated:

Let’s start by loading the data from the file and inspecting the first few rows to understand its structure and the types of data it contains. I’ll use pandas, a popular data analysis library in Python, for this.

The dataset contains various columns, some of which are relevant to our task. The relevant columns for our task are:

  • BirthDate: Date of birth of the individual.
  • DeathDate: Date of death of the individual.
  • BirthStart and BirthEnd: The start and end of the birth date range, respectively.
  • DeathStart and DeathEnd: The start and end of the death date range, respectively.

To determine if an individual was between 18 to 30 years old during World War II, we would need to know the years of World War II. World War II took place from 1939 to 1945.

So, we need to find individuals who were born between 1909 (1939-30) and 1927 (1945-18), inclusive.

Let’s perform this calculation. Please note that we need to handle ‘UNKNOWN’ and other non-standard date entries appropriately. Also, it’s necessary to convert ‘BirthStart’ and ‘BirthEnd’ to datetime objects for this analysis. Let’s proceed with this.

Based on the birth dates available in the data, there are 2,323 individuals who would have been between 18 and 30 years old during World War II (1939-1945). This count is based on individuals born between the years 1909 and 1927, inclusive.

Please note that this analysis assumes that the BirthStart and BirthEnd columns accurately represent the range of possible birth dates for each individual. If these ranges are not precise, the count may not be entirely accurate. Additionally, the count could also be affected by any errors or inconsistencies in the data.

As you can see, I didn’t even need to put in the years of the war. The above data also contained both males and females, so I did have a follow up where I asked how many of the above would have a gender of M and it came back with 1,191.

I could then refine this list even further by adding additional prompts and I could also have the list generated for me if I wanted to use it as a guide to do further research on the individuals.

Average number of children over time

I wanted to try something different and wanted to make sure the analysis didn’t pick up from where I left off. Here is the prompt I used for this request:

Can you start a new analysis and analyze how the birthdate relates to the childrencount over time

Here is what was generated

Sure, we can analyze how the ChildrenCount relates to the BirthDate over time. To do this, we need to convert the BirthDate into a suitable format for plotting (year) and then examine the relationship with ChildrenCount.

First, let’s convert the BirthDate field to datetime and extract the year. We’ll also need to handle ‘UNKNOWN’ and other non-standard date entries appropriately. We’ll then group the data by birth year and calculate the average number of children for each year.

Let’s proceed with this analysis.

We have successfully grouped the data by birth year and calculated the average number of children for each year. Let’s proceed to visualize this data. We’ll create a line plot with the birth year on the x-axis and the average number of children on the y-axis. This will show us how the average number of children changes over time.

Average number of children over time

The line plot displays the average number of children over time, based on the birth year of individuals. The x-axis represents the birth year, and the y-axis represents the average number of children.

From the plot, we can observe trends in the average number of children over time. However, please note that the interpretation of these trends can be complex and may be influenced by many factors, such as societal changes, availability of data for different time periods, and others.

Also, please note that the number of children is likely to be underreported for more recent years, as some individuals may still have more children after the data was collected. This might explain the decrease observed towards the right end of the graph.

The data points before the year 1800 should also be interpreted carefully due to likely limited data availability for these early years, as indicated by the high variability.

Individuals without exact death date who may be found in SSDI

The Social Security Death Index (SSDI) is a great tool for figuring out when an individual may have died. I wanted to generate a list of individuals that I could focus my search on using the SSDI to help pinpoint a more specific death date.

I went ahead and asked ChatGPT the following:

Can you start a new analysis and generate a list of individuals who do not have an exact death date that may be able to be located in the social security death index, please only include those individuals that just have a year or year estimate for their death date. I would like the data set in a CSV file that I can download

This actually worked and I was able to open up the CSV in excel. Here is the response that I was provided:

I have filtered the dataset to include only individuals who:

  1. Do not have an exact death date.
  2. Were born after 1870 (since they would be 66 at the start of the SSDI in 1936, and hence potentially part of it).
  3. Have only a year or year estimate for their death date.

I have saved this filtered dataset to a CSV file. You can download it using the link below:

Download potential_SSDI_year_or_estimate_death.csv

This was great and I was able to download the file and open it up in Excel. The bummer is that it included individuals who had died before 1936. I then just provided an updated prompt “Can you update this and remove any individuals who died prior to 1936”. Sure enough this worked and I now have a “to do” list of individuals to focus on. Note that there are few individuals listed before 1962 in the SSDI, so I did have ChatGPT generate an updated CSV with the reduced set.

Note that this example may not be that useful if you are using a genealogy program and it provides hints for online databases either free or paid ones. You can still use this process to generate lists of individuals matching other criteria for external databases or alternative sources that these other tools can’t assist with. For example a cemetery database that isn’t integrated yet with any of these genealogy programs or other records on microfilm or microfiche that haven’t been digitized.

Conclusion

In conclusion, embracing Artificial Intelligence, particularly ChatGPT, in your family tree research can truly revolutionize your genealogical journey. By preparing your dataset, utilizing GEDCOM files from Ancestry.com, and prompting ChatGPT with questions, you unlock an unparalleled ease in gaining insights. Within seconds, ChatGPT’s lightning-fast responses lead you through a treasure trove of information, painting a vivid picture of your heritage. The once intricate process of uncovering historical connections is now effortless and efficient, allowing you to delve deeper into your past and cherish the stories of your ancestors. With AI as your ally, the exploration of your family tree with ChatGPT becomes a delightful and enlightening experience, connecting you to the profound tapestry of human history. Embrace the power of ChatGPT and embark on this transformative journey to discover the roots that define you!

Notes

My CSV file contained 21,030 individuls and this seemed to work just fine. Your results may vary if you are using extremely large data files.

Please be aware that you are uploading information into the ChatGPT software and be cautious of any privacy information you do not want to share.

Ruler separator

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *