Advanced Geographic Data Analysis: Lake Huron Shipwreck Analysis

Introduction

In 2014, the Thunder Bay National Marine Sanctuary expanded by over 3,800 square miles to include an additional 100 known and possible historic shipwrecks. The surveyed wrecks are from the 1800s and the 1900s.  The study area for this project is shown below as figure 1.  It is difficult to gather the pertinent data because the wrecks need to be surveyed individually to assess their proper location, cause of sinking, and cargo along with many other factors.  Out of the water, each of the wrecks need to be researched to identify specific information about the wrecked ship using the data from the survey.  Some shipwrecks are more difficult to identify due to their condition or even a lack of historical data to cross reference.  While there is a lot that can be learned from the analysis of historic shipwrecks, finding the initial information takes time and effort.  Even when information is available about a shipwreck, the data may not always be complete. Working around this in an analysis can be challenging.

Figure 1. Overview of the study area using RStudio.

Methodology

For the purposes of my analysis, I consolidated what data was available into a single “shipwreck sites” which means only one point per shipwreck was used.  These points also have the most complete information for each shipwreck. My initial strategy was to create a linear model to determine any relevant relationships between the fields. This was unsuccessful for a number of reasons.   One of the difficulties I encountered during my analysis was the fact that many shipwrecks had large amounts of missing data. The other was the data types many of which were strings and not numerical. This forced me to adapt my project to accommodate the data types. I then switched to chi-squared comparisons and k-means clustering.  For the chi-squared comparisons, the missing data did not affect the outcome of the analysis.  I was able to compare a number of categories including the year built, hull type, and loss type which were all non-numeric (Figure 2).  The k-means clustering on the other hand would not run with missing values.  This in turn further limited the available data for the analysis.  The requested fields could not have any null values which limited the accuracy of the analysis. The data was further limited by the field types available for the analysis.

Figure 2. Chi-square code snippet

Results

I did a range of different chi-square tests, some proving more significant than others. I will discuss the ones with high significance below. The benefits of using the chi-square test with this dataset is that it has a lot of categorical variables. The highest correlation, with a p-value of 8.671e-07, was between the year built and the hull type. I expected this relationship to have a high correlation. Having proved my expectations I reviewed the table and saw that only a small number of the vessels were built of material other than wood. Given the skewed sample it makes sense that the correlation was so high. If there was a more even distribution of steel and fiberglass ships then the analysis would be more compelling. The dataset would have to include more recent shipwrecks not simple historic shipwrecks. There are some recent wrecks but not as many to offset all the ships from the 1800s that are exclusively made of wood.

Another significant relationship, with a p-value of 0.02452, was between loss type and hull type. As with the previous hull type analysis most are wood so the loss type is pretty evenly distributed. Although if this correlation is associated with the year and loss type table then other statements can be extrapolated. All the abandoned shipwrecks were older ships. Within this dataset the last abandoned ship was in 1937. It is interesting that collisions are common in any year. Collisions are the most common loss type for steel ships. This makes sense because steel ships are more likely to be salvaged or sold for scrap then wooden ships. Owners would be less likely to simply abandon them. There are more strandings in earlier years but they persist up to present day. Of all the loss types the shipwrecks in this dataset are more likely to be stranded.

The final analysis is to conduct two clustering tests the Calinski-Harabasz and the Silhouette index. This type of analysis tests different numbers of clusters from 2 clusters to 20 clusters to find the best solution. The results of the Calinski-Harabasz index showed a semi-parabolic shape with no specific peaks. Based on the shape of the plot for the Calinski-Harabasz index there was no solution that fit best (Figure 3). The silhouette index was different there was a peak at 5 clusters. This means for this distribution 5 clusters will best reflect the data (Figure 4).

Conclusion

It was really challenging to work with this data set because of the missing values.  In the future, I would like to find an improved data set that is not missing as many values.  For a later project, it would be interesting to analysis the debris pattern of each shipwreck to examine loss type with area of the debris field.  It would be a unique challenge to work with the entire data set but it could yield some interesting results about nautical shipwreck debris patterns.

Skills

  • Spatial Analysis
  • Spatial Data and Algorithms
  • Communication

Design a site like this with WordPress.com
Get started