Vertebrate microfossils have broad applications in evolutionary biology and stratigraphy research areas such as the evolution of hard tissues and stratigraphic correlation. Classification is one of the basic tasks of vertebrate microfossil studies. With the development of techniques for virtual paleontology, vertebrate microfossils can be classified efficiently based on 3D volumes. The semantic segmentation of different fossils and their classes from CT data is a crucial step in the reconstruction of their 3D volumes. Traditional segmentation methods adopt thresholding combined with manual labeling, which is a time-consuming process. Our study proposes a deep-learning-based (DL-based) semantic segmentation method for vertebrate microfossils from CT data. To assess the performance of the method, we conducted extensive experiments on nearly 500 fish microfossils. The results show that the intersection over union (IoU) performance metric arrived at least 94.39 %, meeting the semantic segmentation requirements of paleontologists. We expect that the DL-based method could also be applied to other fossils from CT data with good performance.
Paleozoic vertebrate microfossils provide important evidence for biostratigraphy, paleobiodiversity, and paleogeography (Zhao and Zhu, 2014; Zhao et al., 2018; Ogg et al., 2016; Märss et al., 1995; Žigaitė et al., 2011; Wang, 2006), as well as oil and gas exploration (Hackley et al., 2017; Funkhouser and Evitt, 1959). As one subset of vertebrate microfossils, fish microfossils significantly contribute to the study of early vertebrate evolution (Janvier, 1996; Cui et al., 2020; Wang, 1984; Chen et al., 2016; Botella et al., 2007; Cui et al., 2021).
In recent years, thanks to the widespread use of computed tomography (CT) technologies in paleontology, virtual paleontology (VP) has rapidly developed (Lautenschlager, 2016; Sutton et al., 2017; Sutton, 2008; Cunningham et al., 2014). Paleontologists can nondestructively obtain more comprehensive three-dimensional (3D) fossil CT data, including 3D microstructures from inside to the surface (Fernandez et al., 2012, 2013, 2015). Digital techniques have also been used to classify fish microfossils in 3D volumes (Cui et al., 2020, 2021). Semantic image segmentation is a crucial step in the reconstruction of 3D volumes. However, this task requires a lot of time in addition to expertise in paleontology.
The purpose of this research paper is to create an effective method for semantic segmentation of vertebrate microfossils from CT data. We chose the deep-learning-based (DL-based) U-Net (Ronneberger et al., 2015) and ResNet34 (He et al., 2016) model for semantic image segmentation. First, we compiled a dataset containing four types of fish microfossils from CT data, which were segmented and labeled by reconstructors. Second, we used ResNet34 as the main encoder part of the U-Net model. An end-to-end U-Net model with ResNet34 was designed and trained to solve the semantic segmentation problem. The weights of the optimal network were saved during the training process and then used to semantically segment the microfossils from CT data. Finally, the performance of the DL-based segmentation method was compared to the popular segmentation methods and verified using the global intersection over union (IoU) scores.
In practice, paleontologists usually use 3D virtual models of microfossils to classify them (Andreev et al., 2016; Cui et al., 2020; Qu et al., 2017). They collect microfossil CT data using micro-CT, segment the data, and generate 3D models. This part provides an introduction to the popular binary segmentation methods. The CT data quality may be degraded by various factors including ring artifacts and background noise. To achieve accurate segmentation, the CT data stack should be optimized, which usually includes noise reduction, image enhancement, and image simplification (Buser et al., 2020). Popular binary segmentation methods include thresholding, morphological filtering, region growing, and boundary detecting (Serra and Vincent, 1992; Ziou and Tabbone, 1998; Sahoo et al., 1988). Various image segmentation methods have been applied to fossil images (Ni et al., 2012; Pérez-Ramos and Figueirido, 2020). Thresholding is performed by assigning a visual attention level to the greyscale values. Any part of the CT data above this level is considered a region of interest (ROI) (Goh et al., 2018). Region-growing methods employ seed points manually added by reconstructors. A segmented ROI is then spread from the seed to neighboring voxels that meet certain predefined criteria (Adams and Bischof, 1994). Segmentation based on an edge detector offers an alternative method that discerns the boundary between ROIs and defines these voxels as an edge (Bhardwaj and Mittal, 2012).
In our laboratory, image processing technologies were used to segment fish
microfossils in CT data. First, we filtered out the influence of random
noise from the data. Based on the excellent performance of a median
filtering algorithm, we used a
Fish microfossils have a much higher mass density than the surrounding air and transparent plastic wrap for fossil support. The microfossil images have significantly different grey values from the surrounding background. We used this contrast to segment the fossils in CT data. A typical binary thresholding algorithm such as the Otsu (Otsu, 1979) selects objects of interest from the background, producing binary images of the objects (fish microfossils in our case). A binarization process converts the grey value of each pixel to 0 (black) or 1 (white) using black to represent the background and white to represent the objects of interest. In this paper, black denotes air and plastic wrap, while white denotes fish microfossils. With an appropriate threshold, most fish microfossils in CT data can be binarized using the Otsu method (see Fig. 2a).
CT data before and after median filtering.
Reconstructing the 3D structure of fish microfossils.
The fish microfossils are inevitably in contact during the scanning process. It is difficult to obtain ideal object segmentation on some digitally connected areas. We can only manually segment these areas one by one. Fish microfossils are manually segmented by reconstructors, who usually determine the assignment or affinity of the fish microfossil image regions based on their gross morphology (see Fig. 2b). The reconstructors apply color to the masks automatically and create multilabel images for training the network (see Fig. 2c). Finally, 3D models of the microfossils are generated using a surface-rendering technique (Racicot, 2017) with different colors, each color associated with a unique type of microfossil (see Fig. 2d). This process is one of the main methods to reconstruct the 3D structure of fish microfossils. Figure 3 shows the workflow of the manual semantic segmentation of fish microfossils in our laboratory.
Manual semantic segmentation consumes a lot of time and requires expertise in paleontology to accomplish. We need to find an alternative technique. Deep learning is a popular research area in the field of machine learning and artificial intelligence, which has made great progress in the last decade (LeCun et al., 2015). DL-based techniques have achieved excellent performance in various computer vision tasks such as image denoising (Tian et al., 2020), target detection (Khan et al., 2017), image classification (Xu et al., 2020), and image segmentation (Jin et al., 2018). Paleontologists are utilizing the capabilities of deep neural networks (DNNs) to solve paleontological problems (Marchant et al., 2020; Tetard et al., 2020; Bourel et al., 2020). DNNs can be exploited not only for the accurate classification of vertebrate fossils from their 3D volumes (Hou et al., 2020), but also for the rapid documentation of discrete fossiliferous levels (Martín-Perea et al., 2020). DNNs have achieved impressive performance and present great potential in the field of paleontology.
Semantic image segmentation is also an important application of deep learning that separates a single image into different parts. U-Net is a classic network for semantic segmentation that performs well in microfossil images, especially CT data. The U-Net model has been successfully utilized for planktonic foraminifera recognition (Carvalho et al., 2020; Ge et al., 2020), charcoal particle identification (Rehn et al., 2019), and other micropaleontology tasks. In this paper, we chose an improved U-Net model to semantically segment fish microfossils from CT data. We used the CT data with manual semantic segmentation as the training set. The boundaries between touching particles in the dataset were manually labeled by reconstructors. The DNNs could continuously learn the marked boundaries through feature extraction. Therefore, we could use the data to train the network and solve the separation problem of touching particles.
The U-Net model can perform semantic segmentation of CT data at the pixel level. The model consists of three parts: encoder, connector, and decoder. The encoder uses a convolutional layer to extract features from the input images. Within the encoder layer, pooling layers are applied to decrease the scale, speed up feature detection, and reduce the computational burden. The connector represents a copy operation to concatenate the features at the same scale extracted by down-sampling and up-sampling on the same channel. The decoder uses a deconvolution layer to restore the characteristic image to the size of the input images and predict the results.
Similarly, our defined network also comprises three parts: encoder, connector,
and decoder. A residual module is introduced into the encoder part. The
calculation of the residual module is shown in Eqs. (2) and (3):
Our network uses a pre-trained ResNet34 as the backbone for the encoder. ResNet34 is a residual network based on the convolutional neural network (CNN). Compared to traditional neural networks, the most significant difference is that the original input is added to the output of the convolution block. When the layers of the network are deeper, more features can be extracted and the image semantics can be better expressed. Traditional neural networks face serious vanishing gradient and network degradation problems (Wu et al., 2019). The addition of a residual block solves this problem, and the resulting network is much easier to optimize. ResNet34 in the encoder part is used to extract features. In this process, there are four stages and each stage has several residual modules.
The pixel level has a great influence on the computing costs and prediction
results. To support the training process and graphics processing unit (GPU)
memory limit, the CT data and labels on the datasets are randomly cropped
into small patches of
Workflow of the manual semantic segmentation method.
Random cropping of CT data and labels with patches of 256 ×256 pixels.
The input image is designed to be a patch of
The connector, the middle jump connection part inspired by the feature pyramid network (FPN), is designed as a pyramidal hierarchical structure (Lin et al., 2017). The connector concatenates the feature maps from the encoding unit to the decoding unit to achieve multiscale feature fusion. Then the feature maps are input to the decoder to semantically segment the fish microfossils from CT data.
The decoder part also consists of four stages. Each stage includes an
up-sampling process that uses a transposed convolutional layer with a
The U-Net
U-Net
The U-Net model can perform semantic segmentation for an arbitrarily sized
image. To match the network and GPU memory, we randomly cropped the CT data
and labels with a sliding window to patches having a fixed size of
Semantic segmentation is like a multiclass classification problem that
assigns labels to pixels in an image. We chose a multiclass cross-entropy
function as our loss function. The function calculated the difference
between true labels and predicted labels. Then we updated the weights of the
network and improved its performance on the training set with this function.
To obtain better-performing model parameters, the Adam algorithm was chosen
to optimize the weights (Kingma and Ba, 2014). The training process ran a
total of 20 iterations. The batch size was set to 8, and the initial learning
rate was given to 0.0001. The formula of the loss function is shown in Eq. (4):
We applied the U-Net
The method proposed in this paper was implemented using Keras (Chollet, 2015), which has been widely used in many other tasks such as medical image segmentation and fossil classification (Hou et al., 2020). The computer has an Nvidia RTX 2080Ti GPU, 128 GB of memory, and an Intel XEON silver 4114 CPU.
We dissolved the matrix surrounding the samples from the Xitun Formation
(Early Devonian) (Zhao et al., 2021) with a 3 %–7 %
acetic acid solution. The fish microfossils were separated under a
microscope from the processed samples (Cui et al., 2020; Li et al., 2021;
Cui et al., 2021). All fish microfossils were collected and examined at the
Institute of Vertebrate Paleontology and Paleoanthropology (IVPP) of the
Chinese Academy of Sciences (CAS). We applied plastic wrap and a specially
customized plastic tube to fix the specimens. We scanned the fixed
microfossils with a 225 kV micro-CT scanner that had three main parts: X-ray
tube (Phoenix XS-225D), detector (Varian 4030CB), and rotary table
(HUBER 410). The scanner was designed by the Institute of High-Energy
Physics (IHEP), CAS (Wang et al., 2019). The potential difference applied to
the tube was 100 kV. The target current was set to 100
Process of the data preparation.
A total of six experimental datasets were compiled for this research. The
number and placement of microfossils were different in different datasets.
The details of the datasets are shown in Table 1. All the CT data in the
datasets were manually marked with color masks as true labels, such as
yellow associated with
Details of experimental datasets.
The multiclass approach was used to semantically segment all the pixels in
CT data. Therefore, we performed a multiclass IoU as the evaluation
criterion. IoU is a popular evaluation metric for DL-based semantic
segmentation. We evaluated both the popular methods and the DL-based method
using IoU. The IoU score is defined as the size of the intersection divided
by the size of the union of the sample sets and computed as follows:
The predicted labels were evaluated based on manually marked true labels. We
performed semantic image segmentation of multiple types of fish
microfossils. We calculated the IoU score of each type and their average to
obtain a global index. IoU
Comparison between popular methods and the DL-based method.
We compared the DL-based method with the popular method for all six
experimental datasets. The popular segmentation method was based on
automatic thresholding and watersheds (Roerdink and Meijster, 2000). For
example, on the dataset SN1, we used the Otsu method for binary image
segmentation. The digital connection problem of fish microfossils appeared
as indicated by the red circles in Fig. 7. We also tried to use the
watershed method to separate the fossils, yet the results were not as
expected. It was difficult to isolate individual fossils using the watershed
algorithm. A single fish microfossil was divided into several parts as
indicated by the green circles in Fig. 7. The DL-based method showed better
performance in segmenting the details than the popular method. Table 2 shows
the global IoU
Global IoU
We obtained a limited number of tomograms and labels manually segmented by
reconstructors to train the network. Then, we employed the full CT dataset
with the trained DL program to accomplish semantic segmentation. We chose
the global IoU
Global IoU
Accuracy and loss function curves on the dataset SN1.
Performance of the DL-based semantic segmentation method on the dataset SN1.
We proposed a DL-based semantic segmentation method for fish microfossils from CT data. We demonstrated that our method is effective and produces results close to those of manual segmentation. Our methodology was compared favorably with the popular segmentation method.
An essential step in the popular segmentation method selects the threshold for optimal binarization segmentation in the absence of prior knowledge (Nosrati and Hamarneh, 2016). The noise in images is manifested in different ways that depend on the target application (Sagheer and George, 2020). On the experimental dataset, for instance, the light grey-level noise from CT data corresponds to the relatively low-density plastic wrap used to fix the microfossils, while the remaining high grey represents the fish microfossils, mainly scales and teeth. The plastic wrap could not cleanly isolate each microfossil, leading to problems of digital connections between two or more fish microfossils. Experimental results show that the watershed algorithm (Roerdink and Meijster, 2000) cannot automatically detect the boundaries of fish microfossils. Therefore, the reconstructors have to manually segment the fish microfossils digitally connected. This process is time-consuming and requires expertise.
In this paper, we used the U-Net
Similarly, the DL-based method also encountered some challenges. The digitization of microfossils requires a series of tasks, such as the selection of specimens, the cost of using micro-CT, and the process of labeling each fossil by reconstructors. We do not have enough data to verify whether our method could be successfully applied to other types of fish microfossils. The universality of the DL-based method should be tested in follow-up studies.
However, our contribution is a well-established method for semantically
segmenting vertebrate microfossils, specifically fish microfossils. Our goal
at this research stage is to obtain more CT data on fish microfossils to
expand our dataset. We have a professional labeling team that can provide
high-quality data. Our proposed method is relatively successful and
promising. Currently, for four types of fish microfossils, we have obtained
nearly 500 specimens, which represent an abundance of samples. There is no other
publicly available CT dataset of fish microfossils that is comparable in
size, let alone containing expert-labeled images. We believe that
our work may be helpful in the processing of CT data from fish microfossils
and even data from other microfossils. The fully labeled CT dataset and
DL-based semantic segmentation method that we will make public in a publicly
accessible repository (ADMorph) at
In summary, we have provided a labeled CT dataset and proposed a baseline for a DL-based method of semantically segmenting vertebrate microfossils in CT data. Our preliminary study by means of extensive experiments on nearly 500 fish microfossils shows that the intersection over union (IoU) performance metric arrived at least 94.39 %, meeting the semantic segmentation requirements of paleontologists. Along with improving our existing hardware and framework structure, our future work aims to increase the types of fossils in our dataset. Further network training could lead to the automatic segmentation of more types of microfossils and add to the knowledge of the distribution of vertebrate microfossils in the strata.
The images and the original material used and published here are stored at the Institute of Vertebrate Paleontology and Paleoanthropology, Chinese Academy of Sciences, Beijing, China.
YH performed the experiments, acquired CT data for the datasets, and was the primary author of the paper. MCK performed the experiments and edited the paper. XC acquired and labeled the data for the datasets and edited the paper. RHB edited the paper. MZ organized the project and also wrote the publication.
The contact author has declared that neither they nor their co-authors have any competing interests.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The authors would like to thank Liping Dong and Zhikun Gai for discussions and Liantao Jia, Pengfei Yin, Qiang Li, and Penghe Wang for specimen preparation and data collection. We thank Vincent Fernandez and an anonymous reviewer for their constructive suggestions. This paper was edited by Emanuela Mattioli, who provided additional insights and comments that improved the paper.
This research has been supported by the Chinese Academy of Sciences (grant nos. XDA19050102, XDB26000000, and QYZDJ-SSW-DQC002) and the National Natural Science Foundation of China (grant no. 42130209).
This paper was edited by Emanuela Mattioli and reviewed by two anonymous referees.