Motif caller for sequence reconstruction in motif-based DNA storage

Abstract

DNA data storage is rapidly emerging as a promising solution for long-term data archiving, largely due to its exceptional durability. However, the synthesis of DNA strands remains a significant bottleneck in terms of cost and speed. To address this, new methods have been developed that encode information by concatenating long data-carrying DNA sequences from pre-synthesized DNA subsequences – known as motifs – from a library. Reading back data from DNA storage relies on basecalling–the process of translating raw nanopore sequencing signals into DNA base sequences using machine learning models. These sequences are then decoded back into binary data. However, current basecalling approaches are not optimized for decoding motif-carrying DNA: they first predict individual bases from the raw signal and only afterward attempt to identify higher-level motifs. This two-step, motif-agnostic process is both imprecise and inefficient. In this paper we introduce Motif Caller, machine learning model designed to directly detect entire motifs from raw nanopore signals, bypassing the need for intermediate basecalling. By targeting motifs directly, Motif Caller leverages richer signal features associated with each motif, resulting in significantly improved accuracy. This direct approach also enhances the efficiency of data retrieval in motif-based DNA storage systems.

Full text

Translate

Turn on search term navigation

Introduction

The volume of archival data is growing exponentially, driven by factors such as future analytics needs and regulatory compliance requirements¹. As a result, the world’s archive of stored data continues to expand at a rapid pace. Yet, it is estimated that over 80% of this archived data will either never be accessed again or only accessed very infrequently. This category of rarely accessed information is referred to as “cold data”².

Traditional cold storage solutions, such as hard drives, tape, and optical devices, have limited lifespans – typically between 5 and 20 years, depending on the medium. This limitation means that long-term storage data must be periodically migrated to new media, increasing both costs and complexity over time³.

In light of these limitations, DNA – nature’s own information storage system, refined over billions of years – has emerged as a promising medium for long-term data preservation^{4, 5–6}. DNA offers unmatched data density and durability. In fact, estimates suggest that all the digital data in the world could theoretically be stored in a space no larger than a shoebox⁷. However, despite its immense potential, the prohibitive cost and speed of DNA synthesis remains a significant barrier to its widespread adoption⁸.

To address the high synthesis costs, several approaches have emerged to circumvent the standard base-by-base process. One class of novel synthesis writes data to the DNA with pre-synthesized motifs, which are short subsequences of DNA⁹. Doing so simplifies synthesis, reduces its cost and improves downstream decoding. In addition, encoding information over a combination of motifs from a fixed motif library increases the logical information density as well^10,11. These are broadly classified as motif-based methods for synthesizing DNA for DNA storage.

To read the data, the motifs must be recovered from the raw signal coming from DNA sequencing devices. Today, high-throughput technologies, such as Nanopore sequencers, are readily accessible, making sequencing more cost-effective^8,12,13. Nanopore sequencing involves threading a DNA strand through a Nanopore that has an electrical current running through it^14,15. As the DNA strand passes through, it disrupts the current, generating a series of electrical signal deflections that correspond to the individual bases. Machine learning based algorithms are then used to convert the recorded signal trace into a sequence of DNA bases in a process known as basecalling^16,17. After basecalling, a motif detection algorithm called Motif Search then finds the motifs in the sequence to decode the data¹⁰.

This paper explores an innovative approach to shorten the process and proposes a machine learning model — a Motif Caller — that directly maps the raw signal to motifs. Doing so means that the machine learning model can work with more features (mapping several bases in one go instead of just one) which improves the ability of the model to detect motifs. We aim to increase the percentage of motifs detected per read, which in turn, leads to a lower effective sequencing coverage. We start by breaking down the two methods of motif-based synthesis (Sec. 2), the Helixworks approach¹⁰ and the CatalogDNA approach⁹. Following this, we introduce our motif-inferring methods, the Motif Search pipelines (Sec. 3) and consequently our proposed approach of the Motif Caller (Sec. 4). We then introduce the datasets that we use to compare the methods (Sec. 5) and evaluate the methods on the datasets (Sec. 6). We conclude with a brief discussion of the results, potential further improvements of the Motif Caller (Sec. 7) and its applications outside of DNA storage. By addressing the key challenge of sequencing efficiency, our research brings us one step closer to realizing the immense potential of DNA as a revolutionary data storage medium.

Motif-based synthesis

The practical application of DNA for data storage has been constrained by the high cost and low throughput of conventional phosphoramidite chemistry, which synthesizes DNA strands one nucleotide at a time. To overcome this, motif-based synthesis has emerged as a new paradigm where information is encoded by assembling pre-synthesized DNA subsequences, or “motifs”^9,10,18. This shift is driven by write-side gains that conventional base-by-base chemistry cannot deliver. By assembling oligos from a prefabricated and reusable motif library, synthesis cost & throughput have been decoupled from sequence length and instead scale with the smaller number of assembly steps - amortizing the initial library across thousands of reactions. Motifs are automation-friendly, enabling parallelism and high-throughput assembly on liquid handlers and inkjet platforms. They standardise sequence blocks so that addresses or checksums can be embedded for random access, adaptive sampling or robust integrity checks. Finally, combinatorial encoding maximises bits per biochemical step, collapsing cycles per bit and raising the overall information density.

Structuring data into motifs rather than individual bases fundamentally changes the nature of the decoding challenge. Firstly, a motif produces a richer and more distinctive electrical signature than any single base, leading to a robust signal for a machine learning model to detect. Secondly, and more critically, a motif-based structure provides powerful resilience against the most challenging sequencing errors - insertion/deletion errors (indel), which cause a loss of synchronization in the base-by-base encoding and lead to downstream data corruption. This can be corrected as a known motif sequence can serve as an anchor to resynchronize the reading frame.

This principle of read-side resilience was effectively demonstrated by the CMOSS system¹⁹. While it used phosphoramidite chemistry, its innovative columnar data layout enabled an integrated consensus and decoding pipeline that prevented indel propagation, achieving full data recovery with as little as 4x sequencing coverage. The prevailing trend in scalable systems is to employ the enzymatic ligation of prefabricated motifs, a method that also confers the read-side advantages previously discussed.

Three distinct architectural philosophies have been reported for conducting motif-based storage by Catalog, BISHENG-1, and Helixworks, which differ in their assembly mechanics, encoding strategies, and optimization goals. Two of these strategies, from Catalog and BISHENG-1, prioritize massive parallelism and cost reduction through high-throughput assembly. Though they use different encoding schemes, a Cartesian product of motifs for Catalog⁹ versus a “movable type” block assembly for BISHENG-1²⁰, both are implemented using custom-developed, automated inkjet printers that dispense and ligate prefabricated, double-stranded DNA (dsDNA) fragments with sticky overhangs. The economic potential of this approach is significant; as demonstrated by the BISHENG-1 system, a single synthesis of a motif can support up to 10,000 assembly reactions. This reusability, combined with the potential to scale down droplet volumes from microliters ( L) to nanoliters (nL), indicates that write costs for such systems can be reduced from $122 per megabyte to potentially as low as $0.12 per megabyte²⁰. However, a significant practical limitation of these high-throughput systems is the capital cost associated with both the initial synthesis of the library and the expensive, dedicated inkjet nozzles required for each unique motif.

The strategy used by Helixworks prioritizes maximizing information density per payload slot^10,18,21. This system utilizes a different assembly methodology, performing ligations of single-stranded DNA (ssDNA) on a more conventional OpenTrons liquid handling platform, where a “bridge” oligo facilitates the joining of adjacent motifs. This is shown in Fig. 1, where the procedure of this ligation using a bridge oligo is demonstrated. The resulting oligo architecture is payload dominant, comprising a small number of identifier slots (three address slots) followed by eight payload slots to form an oligo of approximately 625 bp. Information is encoded using a “composite motif” technique where, for each of the eight payload slots, a single combinatorial symbol is represented by a mixture from a subset of motifs (4 chosen from a library of 8). This yields a logical density of bits per synthesis cycle/slot.

Fig. 1 [Images not available. See PDF.]

Illustration of the typical block-like oligo structure in the Helixworks DNA storage pipeline. Using the bridge ( ), the spacers (S) of two motifs are concatenated using their complementary sequences. Motifs are added to the pool one-by-one, and once all the motifs are pooled, the complete oligo is generated in a one-step assembly process.

Despite their differences, all the systems are dependent on the accurate identification of their constituent motifs from noisy sequencing data. An error in identifying a single motif can corrupt a data symbol, invalidate an address, or select an incorrect block, leading to data loss. This common point of failure highlights the universal need for a robust motif identification tool that is agnostic to the overarching oligo architecture. The conventional two-step decoding pipeline, basecalling raw signals to nucleotides and then searching for motif sequences, is suboptimal as it discards valuable signal-level information. A tool like Motif Caller, which directly interprets raw signals to identify entire motifs, is therefore broadly applicable and essential for improving the fidelity of any motif-based storage system.

The data generated by the the Helixworks’ multi-slot, composite motif architecture was used to develop and test the Motif Caller. The system’s reliance on sequential combinatorial encoding presents a particularly demanding and relevant use case for direct motif detection. The high logical density achieved through composite motifs amplifies the negative impact of identification errors, and the multi-slot structure requires precise sequential decoding, making it an ideal environment to validate the performance gains of bypassing intermediate basecalling.

Basecalling & motif search - the baseline

Figure 2 (a-d) illustrates the baseline, a motif-based DNA storage sequencing pipeline. A motif-based oligo passes through the nanopore (Fig. 2a) to produce an electrical current readout, called the squiggle (Fig. 2b). This squiggle is the deflection of the current as the bases of the strand pass through the nanopore. The remainder of the pipeline aims to reconstruct the motif sequence from the squiggle in two steps.

Fig. 2 [Images not available. See PDF.]

Figure illustrating the difference in operation between the Motif Search and Motif Caller pipelines for sequence reconstruction. A motif-based oligo containing the symbols passes through the nanopore (a) and produces a trace of the bases in the squiggle (b). The signal is used to predict a base level sequence using basecalling (c). From this base level sequence, Motif Search (d) is used to obtain the final motif sequence prediction. The Motif caller (e) shorts the two-step reconstruction by predicting the motif sequence directly from the squiggle.

Basecalling (Fig. 2c) reconstructs the sequence of bases from the squiggle created by the oligo passing through the nanopore. Basecallers learn the representation of individual bases within a squiggle, based on which they make inferences to reconstruct the base-level sequence. In our dataset, Guppy is used to basecall in the first step of the motif search pipelines.

Zero-error (ZE) search

Once a base-level prediction is made from the squiggle, the original stored motif sequence can be reconstructed. (Fig. 2d). The challenge is to infer the motifs from a noisy base-level prediction, which is especially prone to deletion errors. Zero-error motif search¹⁰ looks for exact matches of motifs, which means that the entire sequence of bases in a motif must be basecalled correctly in order for a motif to be detected.

The performance of Zero-error search is inherently limited by the ability of the basecaller to predict a sequence of bases correctly. This dependence leads to limited performance, and a large number of deletions in the prediction.

Approximate-Matching (AM) search

Approximate-matching motif search²² overcomes this limitation by making the best guess of the motifs at a certain cycle position. First, during segmentation, the algorithm locates candidate spacer positions by matching short k-mers from the read to a pre-built index of the spacer sequences. To handle sequencing errors, it filters out weak candidates, merges nearby candidates, and corrects spacer positions using randomized embedding and Hamming distance comparisons, producing accurate spacer locations. It then identifies chains of spacers spaced according to the expected motif-spacer structure, allowing for small variations due to insertions or deletions.

Next, in the mapping step, AM search extracts segments between the spacer positions and aligns them to a reference library of the motifs that were used for storing information. This is done using a sequence alignment library (ksw-lib²³), which selects the best-matching motif for each segment. Finally, an overlap check ensures that each nucleotide in the read is assigned to only one motif/oligo: overlapping chains are grouped, and only the chain with the highest mapping score in each group is retained. This process enables accurate inference of the motif sequence despite sequencing errors and read variability.

Motif caller

The Motif Caller pipeline, illustrated in Fig. 2 (a, b & e), attempts to shorten the two step process of basecalling followed by Motif Search by predicting a motif-level sequence directly from the squiggle. With perfect data, we would expect a motif level accuracy similar to basecallers, or even superior - owing to a larger number of features per motif as compared to a base.

However, generating perfect labels in a real world scenario is challenging, especially in the combinatorial pipeline, since the synthesis process leads to uncertainty about the motifs within a particular block (as we elaborate further in Sec. 5). We expect the traditional basecalling architecture to generalize well to this use-case. The difference with the standard basecalling models is the larger size of the motif (25 nt) and a larger motif library (8 motifs as compared to 4 bases). The Motif Caller is not restricted to a particular library size and, in principle, can be extended to larger and more complex libraries. However, scaling to libraries with a larger number of motifs will require retraining to ensure that the model learns the expanded feature space, and proportional increases in the model capacity. The limiting factor is the availability of sufficient training data to capture the variability of the larger library. We therefore expect the approach to generalize, but with practical trade-offs in model size and training requirements as the library grows.

We use a similar training strategy to the standard basecalling models such as Dorado, Chiron and Guppy¹⁶. The model uses connectionist temporal classification (CTC) loss in order to operate directly with unsegmented raw data²⁴. The model architecture is shown in Fig. 3. The convolutional layers (four layers with growing filters from 32 to 128) extract important features from raw data and feed it to the sequential layers of the model, that are bidirectional gated recurrent units²⁵ (three layers with a hidden size of 256). These approximate the function that conduct the prediction for each squiggle (current signal), whilst being able to deal with the temporal dependencies of the labels. The CTC loss helps the model to learn the alignment between the squiggles and the motifs while it trains, without any explicit event segmentation.

Fig. 3 [Images not available. See PDF.]

Model architecture used for the Motif Caller. The architecture is similar to models used for basecalling such as Guppy and Dorado.

Training procedure

We aim to predict the motif sequence directly from raw squiggle data without any segmentation steps. Let M denote the original set of motifs used to store information, comprising eight distinct motifs. Each motif is a random sequence of bases of length l. In our datasets, all motifs have the same length of bases, without loss of generality. The task is to map the squiggle sequence input to their corresponding motif labels. Let the raw signal input be, , which needs to be mapped to their respective motif labels . Training samples are drawn from the dataset . By using the model architecture and loss function described below, the input time series can be directly translated to the sequence without any segmentation steps.

For each training sample, the convolutional layers extract local features from the raw signal, generating a sequence of feature maps. These are progressively downsampled by a factor of 64 through a combination of two convolutional layers with stride 2 and two max-pooling layers with stride 4. There are four convolutional layers in total, with filters increasing from 32 to 128 and the last layer retaining the 128 filters and passing the resulting feature sequence to a stack of Gated Recurrent Units (GRUs), which capture the temporal dependencies present in the signal. There are three hidden GRU layers, each with a hidden size of 256. The output of the GRUs is fed into a fully connected layer of size 512 that maps the temporal representations to a set of label classes through a log-softmax operation. The model outputs a series of probability distributions, which are the probabilities of observing each token over a number of timesteps (T). The number of timesteps are about three times the number of targets in the label. CTC introduces an additional “blank” token, denoted as , to the set of possible labels. This token allows the model to learn the transitions between the tokens. The model thus outputs a probability distribution of shape . From the model output, a sequence of labels can be produced, by selecting a token at each timestep. Each is one of the motifs in M or . This sequence of labels is called an alignment. Each alignment must be mapped to the target sequence . This is done by first removing all the repeated characters, and then dropping the blanks. For example, the alignment maps to the sequence . The probability of an alignment given the input sequence is the product of the probabilities of the labels at each window t: . The total probability of the target sequence given the input sequence is the sum of the probabilities of all possible alignments that map to : where is the set of all alignments that collapse to . The CTC loss is the negative log probability of the correct label sequence: This encourages the model to maximize the probability of the correct target sequence after considering all possible alignments. The CTC criterion computes these alignments using a dynamic programming algorithm²⁶, that allows dealing with large sequences without causing an explosion in computational complexity. For larger and more complex motif libraries, the convolutional encoder may need to be scaled modestly (e.g., by adding a convolutional block), together with larger GRU hidden sizes and less aggressive downsampling.

Decoding

During inference, the final sequence is constructed using either a greedy decoder or a beam-search decoder, following standard techniques as described in²⁴. The greedy decoder selects the token of maximum probability at each timestep and collapses the alignment by removing all repeated characters and dropping the blanks. The beam search decoder (with beam width W), maintains a list of the W most probable sequences at each timestep. For the following timestep, it constructs the probability of possible extensions of the sequences by collapsing and summing up over repeated bases or repeated blanks that are terminated by non-blanks.

In the specific case of this dataset, each payload is separated by unique spacer motifs, which eliminates the need to model transitions between repeated characters. For this reason, the difference between greedy and beam search decoding becomes negligible. Therefore, greedy decoding is used during inference, while accounting for quality thresholds²⁷, which are calculated as:

Where denotes the model confidence of each predicted token. Quality thresholding is applied both at the read level (via average quality score) and the token level (by filtering out low-confidence predictions). In the Motif Search pipelines, basecalled reads with quality scores below 10 are excluded. Similarly, in the Motif Caller pipeline, reads with average quality scores below 11 are filtered out to match the primary dataset’s criteria. Additionally, within the Motif Caller, individual predicted motifs with confidence scores below 0.85 are treated as blanks and discarded.

Dataset

The empirical dataset used to train the Motif Caller model was obtained from an experimental run conducted by Helixworks (Sec. 2) who have developed a combinatorial approach to motif-based DNA storage^10,21.

The primary dataset comprises blocks of information, with an average of nanopore reads per cycle. Each block consists of one address motif and eight payload (information-carrying) motifs which are separated by position-specific spacer sequences. All motifs are 25 bases in length.

Due to the combinatorial nature of the synthesis pipeline, there is an inherent uncertainty in determining which specific motifs are within a particular information block. This creates a clear distinction between the original information that is stored (i.e., the pre-synthesis ground truth), and the actual motifs present in the resulting DNA strands. As a result, generating accurate labels for training the Motif Caller becomes non-trivial, since the true motif sequence within each read is unknown.

To construct training labels, we combine predictions from the Approximate-matching (AM) Motif search pipeline (Sec. 3) with the pre-synthesis ground truth. The spacer motifs are used to segregate the payload motifs into their cycle positions. The reads that have the largest proportion of detected motifs matching the ground truth are selected. Specifically, the top of the reads from the AM search pipeline are retained for training. Since the labels are generated in this fashion, the performance of the Motif Caller is influenced by the best performance of the Motif Search pipeline, where it detects of the motifs within a read. Despite this influence, as we show in Fig. 4, the Motif Caller learns the spacer motifs that are not present within the labels, showing that the model is able to generalize outside of its training labels.

Fig. 4 [Images not available. See PDF.]

The Motif Caller successfully learns to identify spacer motifs that are not explicitly present in its training labels, indicating its ability to generalize beyond the provided supervision even during training.

To further evaluate the model’s generalization capabilities, we test it in two distinct scenarios. First, from the EIC04 sequencing run, four datasets are created with varying DNA concentrations per run, resulting in different sequencing coverage levels per information block. Dataset T1, with the greatest sequencing coverage, is used to generate labels and train the Motif Caller. The model is then tested on datasets T2, T3, and T4, which exhibit progressively lower sequencing coverage, to assess decoding accuracy under limited read conditions. Second, we evaluate the model on data from three independent sequencing runs – 01–04, 01–13, and 01–14. These runs follow the same combinatorial design and constraints; however, differences in sequencing conditions may affect the average quality of the resulting reads. For this evaluation, a subset comprising approximately 10% of the total information blocks is sampled from each run to compare motif inference performance across methods.

Results

To evaluate the Motif Caller pipeline, the performance of the Motif Caller is compared with the Motif Search methods on the dataset on which the model was trained (Sec. 5). Following this, the performance of the Motif Caller on data produced by sequencing runs from separate experiments is evaluated. During basecalling, all predictions with quality scores below 10 are filtered out. To maintain consistency with this filtering level, the Motif Caller excludes all predictions with quality scores below 11. The proportions of reads retained after filtering are presented in Table 3.

Read-level motif detection and error analysis

The motifs detected per read is the sum of the number of motifs that are detected per payload in a single read, divided by the total number of motifs in an information block (8 such motifs). The spacer motifs are used to segregate the motifs into their respective payloads, and compare these payloads to the pre-synthesis ground truth. Due to the combinatorial nature of the ground truth, there is an uncertainty about the payload motifs that are present within a particular read. We compare the motifs detected per read to the pre-synthesis ground truth. In Fig. 5a, the percentage of motifs detected between the Motif Caller (red) and the Motif Search methods - Approximate Matching (blue) and Zero-error (orange) is compared over the reads for all the unseen blocks of information of the empirical dataset. It can be seen that the Motif Caller consistently detects a higher percentage of motifs as compared to the Motif Search methods.

Fig. 5 [Images not available. See PDF.]

Results on the primary dataset (EIC04). (a) Comparison of the percentage of motifs identified per read by the Motif Caller and the Motif Search pipelines on unseen information blocks. The Motif Caller consistently detects a higher proportion of motifs per read. (b) Correlation of the errors between the AM Search pipeline and the Motif Caller. (c) Information recovery as a function of the number of reads. The threshold for complete recovery is marked at 95%, illustrating the lower coverage requirement of the Motif Caller compared to baseline methods.

Since ZE search is dependent on exact matches of a whole motif at the base-level, it requires 25 bases to be identified in a row without any errors. Due to the error prone nature of basecalling, this is unlikely. In contrast, AM search looks for the best approximate prediction from the base-level inference, and thus performs significantly better than ZE search. Nevertheless, the Motif Caller consistently outperforms both methods in terms of motif detection.

In Table 1, the accuracies of the methods are presented for forward and reverse-oriented reads. All methods perform significantly better on forward reads as compared to reverse reads. Both AM search and the Motif Caller have a higher error rate as compared to ZE search, highlighting a trade-off between detection sensitivity and precision.

Table 1. Performance metrics for different motif detection methods (ZE search, AM search, and Motif Caller) on the EIC04 sequencing run, which served as the training dataset for the Motif Caller. Metrics include the average percentage of motifs detected per read, error rates, and the effective sequencing coverage–defined as the number of reads required to achieve 95% information recovery.

Method	Orientation	Motifs detected (%)	Motif Error (%)	Coverage (reads)	Reads/second
ZE Search	Forward	32.6	1.0
	Reverse	26.5	1.0	51	400
	Combined	30.9	1.0
AM Search	Forward	80.9	2.8
	Reverse	63.8	3.3	23	38
	Combined	75.8	3.0
Motif Caller	Forward	94.9	2.8
	Reverse	77.1	2.5	14	56
	Combined	91.1	2.6

In Fig. 5b, the correlation of the error rates per read between the Motif Caller and the AM search is shown. With the observed correlation of r=0.67 between the errors of the two methods, with a highly significant p < 1e-3, it can be seen that the labelling noise largely seeps into the model performance on the test set. The Motif Caller largely inherits its error rate from the noise in its training labels that are generated by the AM search pipeline. While the Motif Caller’s error rate is higher than ZE search, the probabilistic outputs of the model can be passed directly to error-correcting codes as soft information, allowing downstream decoders to exploit uncertainty rather than discard low-confidence reads. This facilitates integration without significantly increasing error-correction costs. Nonetheless, systematic evaluation with specific coding schemes would be valuable future work to quantify the net effect on overall pipeline performance.

Information recovered per block with increasing reads

To recover all the information that is stored in a block, majority votes are taken over repeated reads. This is done until all the voted motifs in all the payloads of a block match the pre-synthesis ground truth. Since the last few motifs will take an increasing number of reads to be recovered (due to particularly erroneous blocks), the methods approach a recovery close to 100 % faster than they converge to the full information. In practice this would be intervened with an appropriate error-correcting scheme²¹.

In Fig. 5c, we compare this information recovery averaged over all the blocks of information, plotted for the Motif Caller (red with circle markers), the AM Search (blue with triangle markers) and ZE Search (orange with square markers). We mark a recovery close to 100% in order to investigate how many reads the methods take to reach an arbitrary threshold before converging to the full information.

The Motif Caller reaches the threshold recovery at 14 reads, AM search reaches the threshold at 23 reads and ZE search reaches the threshold at 51 reads. Even with a difference in the motifs detected per read, the overall sequencing coverage required can be reduced by 30%.

We present these metrics in Table 1, where we compare the motifs detected and the effective sequencing coverage between the Motif Caller and Motif Search. By training the Motif Caller on the set of motifs, even with imperfect labels, we can significantly reduce the effective sequencing coverage required over the blocks of information stored.

Comparison of inference speeds

In Table 1, we compare the number of reads processed per second across the motif detection pipelines. For ZE search and AM search, reads are first basecalled from the raw signal before the motif detection methods are applied. Since basecalling is highly optimized (400 reads per second on a single GPU), the bottleneck lies primarily within the motif-search methods.

For ZE search, this overhead is minimal because exact matches avoid character-by-character comparisons. In contrast, AM search experiences a larger bottleneck due to the computational cost of approximate matching. The Motif Caller does not rely on such a step and therefore processes reads faster than AM search. Importantly, the Motif Caller has not yet been optimized for speed; with further optimization, its throughput could become comparable to that of modern basecallers.

Decoding percentage for diluted reads

For the EIC04 run, three additional sequencing runs were performed using progressively lower concentrations of DNA (Sec. 5), resulting in varying numbers of reads per information block. Table 2 presents the decoding accuracy (%) of the different motif detection methods–ZE Search, AM Search, and Motif Caller–on the diluted datasets T2, T3, and T4.

As sequencing coverage decreases, all methods show a decline in decoding accuracy; however, the Motif Caller consistently outperforms both ZE Search and AM Search across all datasets. For instance, on dataset T2 (1.07 amol/block, 18.0 reads/block), the Motif Caller achieves 96.9% accuracy compared to 89.9% for AM Search and 71.6% for ZE Search. Even at the lowest coverage in T4 (0.11 amol/block, 2.3 reads/block), the Motif Caller maintains a performance advantage, demonstrating its robustness under low-read conditions.

Generalization of performance to alternative runs

To evaluate the generalization ability of the model, it was tested on data from three independent sequencing runs: 01–04, 01–13, and 01–14. Reads with a basecalling quality score below 10 were filtered out in the Motif Search pipelines, while for consistency with the primary dataset, reads with quality scores below 11 were filtered out in the Motif Caller pipeline.

Figure 6 presents the number of motifs detected across these runs using the different methods. The Motif Caller (red) consistently outperforms both the AM search (blue) and ZE search (orange) across all sequencing runs. Notably, all methods exhibit reduced performance on the 01–04 run relative to the others. Table 3 summarizes the number of motifs detected, error rates, the effective sequencing coverage required, and the percentage of reads retained after applying the filtering criteria. Consistent with the results from the primary dataset, ZE search consistently achieves a significantly lower error rate compared to both AM search and the Motif Caller, which exhibit comparable error rates.

Fig. 6 [Images not available. See PDF.]

Percentage of motifs detected by the different motif detection methods (AM search, ZE search, and Motif Caller) across multiple sequencing runs. The Motif Caller consistently detects more motifs than the baseline methods, demonstrating its strong generalization. EIC04 was the primary dataset used for generating labels for training the Motif Caller.

Table 2. Performance of the different motif detection methods (ZE search, AM search, and Motif Caller) on the dataset with varying concentrations from the EIC04 sequencing run. The Motif Caller achieves higher decoding accuracy across all datasets.

Method	Decoding accuracy	Dataset	Concentration (amol/block)	Reads/block
ZE Search	71.6
AM Search	89.9	T2	1.07	18.0
Motif Caller	96.9
ZE Search	41.6
AM Search	64.9	T3	0.44	7.1
Motif Caller	71.8
ZE Search	17.4
AM Search	36.0	T4	0.11	2.3
Motif Caller	38.7

Table 3. Summary of the key performance metrics for the different motif detection methods (ZE search, AM search and Motif Caller) across multiple sequencing runs. Reported metrics include the average percentage of motifs detected per read, error rates, the effective sequencing coverage required for threshold data recovery (95%), and the proportion of the reads retained after quality filtering. EIC04 refers to the original sequencing run used to generate labels for the Motif Caller. The Motif Caller achieves higher motif detection rates and lower coverage requirements across more runs, with the exception of 01–04, where fewer reads meet the model’s confidence threshold.

Method	Sequencing run	Motifs detected (%)	Motif Error (%)	Coverage (reads)	% of total reads
ZE Search	EIC04	30.9	1.0	51	59.8
	01–04	29.0	1.2	46	61.3
	01–13	38.8	1.9	37	51.9
	01–14	39.0	1.6	36	67.6
AM Search	EIC04	75.8	3.0	23	45.6
	01–04	63.8	2.3	35	48.6
	01–13	78.9	3.5	28	41.6
	01–14	77.4	4.5	31	54.8
Motif Caller	EIC04	91.1	2.6	14	71.1
	01–04	77.0	3.6	26	23.4
	01–13	86.0	3.6	17	68.3
	01–14	89.1	4.0	18	72.5

The Motif Caller achieves threshold information recovery using a significantly smaller fraction of the reads. For the 01–13 and 01–14 sequencing runs, a significantly larger proportion of reads meet the Motif Caller’s filtering criteria compared to the Motif Search pipelines, thereby further reducing the effective sequencing coverage required. In contrast, only 20% of reads from the 01–04 run pass the Motif Caller’s quality threshold, which leads to an increased effective sequencing coverage requirement for that dataset.

Quality proportions and finetuning

The Motif Caller demonstrates consistent performance on sequencing runs other than the one it was trained on, with the notable exception of the 01–04 run. For this particular dataset, both the motif recovery performance at the quality threshold and the proportion of reads that meet this threshold are substantially lower.

In Fig. 7, the quality proportions of the reads of different sequencing runs along with their mean accuracy are demonstrated. It can be seen that the mean accuracy at a particular quality threshold is retained, even when the model does not generalize well to that run, leading to a shift in the quality proportions below the threshold. This suggests that the quality proportions of the model can be used to determine its generalisability to a particular run and then appropriately train it further.

Fig. 7 [Images not available. See PDF.]

Quality proportions for different sequencing runs. EIC04 was used as the training dataset for the Motif Caller. In the lower-performing run (01–04), the proportion of reads below the quality threshold increases, while the mean accuracy within each threshold remains consistent.

To investigate whether model performance could be improved through adaptation, the Motif Caller was fine-tuned for 15 epochs on a subset of 30,000 reads from the 01–04 sequencing run, spanning various blocks of information. As shown in Table 4, the fine-tuned model exhibits a marked improvement in performance at the quality threshold. However, the percentage of reads that fall within this threshold does not change. These results suggest that when only a small fraction of reads meet the model’s confidence criteria, retraining the model on a large amount of data from the specific sequencing run may be necessary to achieve comparable coverage to the original training conditions. Fine-tuning effectively enhances performance for high-confidence reads, without shifting the proportion of reads at the different quality thresholds. In these cases, the Motif Caller may still be used in conjunction with baseline methods such as AM search to reduce the effective coverage requirement, leveraging the model’s confidence scores to guide this integration.

Table 4. Comparison of quality threshold distributions and corresponding accuracies for the 01–04 sequencing run, evaluated between the original Motif Caller and a version finetuned on data from the same run. Finetuning improves motif detection and reduces error rates at given quality thresholds; however, it does not increase the proportion of reads that meet the confidence threshold criteria.

Model	Quality Threshold	% of total reads	Motifs detected (%)	Motif Error (%)
Motif Caller	Q0	100.0	49.8	6.0
	Q10	48.4	70.8	5.3
	Q15	19.9	87.1	3.4
	Q20	11.2	93.6	2.6
Finetuned Caller	Q0	100.0	53.0	3.5
	Q10	45.3	76.9	3.1
	Q15	18.0	90.3	2.5
	Q20	8.5	92.3	2.8

Discussion

In this paper, we introduced Motif Caller, an approach for sequence reconstruction for motif-based DNA storage directly from the raw current signals (squiggles). Even when trained on imperfect labels generated by the Approximate-matching motif search pipeline, the Motif Caller is able to detect more motifs per read than the current methods. As a result, it significantly reduces the amount of sequencing coverage needed to recover each block of stored information. This was demonstrated by consistently recovering more information than the other methods across all coverage levels.

The Motif Caller demonstrates strong generalization across sequencing runs, performing comparably on alternate datasets despite being trained on a single run. Model confidence scores provide a useful post hoc measure of how well the model generalizes to a particular sequencing run: for instance, fewer reads passed the quality threshold in the 01–04 run, indicating lower compatibility or higher noise in that dataset. These scores can guide decisions on whether retraining the model or repeating the sequencing run is necessary.

Nonetheless, the Motif Caller’s robustness is inherently constrained by the conditions it was trained on. Performance dropped on the 01–04 dataset, and fine-tuning on this set did not substantially improve results, suggesting that minor adjustments alone are insufficient. Broader training datasets or transfer learning could enhance generalizability, but effective adaptation generally requires significant data and some degree of retraining. Substantial shifts in chemistry, pore type, or instrumentation may still demand fully specialized models.

Performance could also benefit from synthesis runs with known motif payloads to generate higher-quality labels and strengthen the training signal. As shown in the paper, a large amount of the error of the Motif Caller is derived from the labels produced by the AM Search pipeline. Incorporating structured penalty terms that encode prior knowledge of valid motifs may further reduce prediction errors, even when the exact motifs are unknown. Together, these strategies offer a path to greater accuracy and generalizability across diverse sequencing conditions.

As demonstrated in this study, direct inference of motifs from squiggles can outperform traditional basecalling approaches when supported by high-quality training data. Beyond DNA storage, the Motif Caller holds potential for applications in biological sequence analysis where the problem of motif detection in natural genomic data shares structural similarities, as it involves identifying patterns from noisy, base-level signals²⁸. However, natural genomic motifs exhibit greater variability than synthetic libraries, particularly due to sequence context and degeneracy. The Motif Caller architecture is capable of capturing signal traces over multiple bases, which should provide sufficient discriminatory power, but success depends strongly on the diversity and representativeness of the training data. In practice, robust application to genomics would likely require training on a large, heterogeneous dataset that reflects natural motif variability, as well as the inclusion of background sequences to reduce false positives. Potential applications include barcode and UMI demultiplexing – especially in highly multiplexed or single-cell nanopore sequencing – where direct squiggle-to-barcode/UMI inference could reduce assignment errors and recover low-quality reads as well as chimeric read splitting²⁹, although performance in these contexts has not yet been benchmarked against Dorado.

The Motif Caller represents a significant advancement in the sequence reconstruction pipeline for motif-based DNA storage. Together with ongoing progress in error-correcting codes and DNA synthesis, this work contributes to making DNA-based storage systems more practical, scalable, and efficient.

Author contributions

P.A. developed the model and performed the analysis. N.P. provided the processed data used for training and testing the methods. P.A., N.P., and T.H. contributed to the final version of the manuscript. T.H. supervised the project. All authors reviewed the manuscript.

Funding

This work was supported by the Horizon Europe programme under grant IDs 101115389 and 101115317.

Data availability

The data supporting the findings of this study are available in the Open Science Framework repository at https://osf.io/pcdtj/.

Code availability

The code used to develop, train, and evaluate Motif Caller is publicly available at: https://github.com/Parvfect/Motif-Calling.

Declarations

Competing interests

The authors declare no competing interests.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Borovica-Gajić, R; Appuswamy, R; Ailamaki, A. Cheap data analytics using cold storage devices. Proc. VLDB Endow.; 2016; 9, pp. 1029-1040. [DOI: https://dx.doi.org/10.14778/2994509.2994521]

2. Storage, C.-b. C. Cold storage in the cloud: Trends, challenges, and solutions. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/cold-storage-atom-xeon-paper.pdf. Accessed: 2024-12-20.

3. Balakrishnan, S. et al. Pelican: A building block for exascale cold data storage. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 351–365 (2014).

4. Church, GM; Gao, Y; Kosuri, S. Next-generation digital information storage in DNA. Science; 2012; 337, pp. 1628-1628.1:CAS:528:DC%2BC38Xhtl2ntrrL [DOI: https://dx.doi.org/10.1126/science.1226355] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22903519]

5. Fan, Q et al. De novo non-canonical nanopore basecalling enables private communication using heavily-modified dna data at single-molecule level. Nat. Commun.; 2025; 16, 4099.1:CAS:528:DC%2BB2MXhtlalu7%2FL [DOI: https://dx.doi.org/10.1038/s41467-025-59357-2] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/40316536][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12048662]

6. Zhang, C et al. Parallel molecular data storage by printing epigenetic bits on dna. Nature; 2024; 634, pp. 824-832.1:CAS:528:DC%2BB2cXitlSms7bP [DOI: https://dx.doi.org/10.1038/s41586-024-08040-5] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/39443776][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11499255]

7. Erlich, Y; Zielinski, D. DNA Fountain enables a robust and efficient storage architecture. Science; 2017; 355, pp. 950-954.1:CAS:528:DC%2BC2sXjsVCgsrc%3D [DOI: https://dx.doi.org/10.1126/science.aaj2038] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28254941]

8. Ceze, L; Nivala, J; Strauss, K. Molecular digital data storage using DNA. Nat. Rev. Genet.; 2019; 20, pp. 456-466.1:CAS:528:DC%2BC1MXptFeqsrs%3D [DOI: https://dx.doi.org/10.1038/s41576-019-0125-3] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31068682]

9. Roquet, N. et al. DNA-based data storage via combinatorial assembly. bioRxiv 2021–04 (2021).

10. Yan, Y; Pinnamaneni, N; Chalapati, S; Crosbie, C; Appuswamy, R. Scaling logical density of DNA storage with enzymatically-ligated composite motifs. Sci. Reports; 2023; 13, 15978.1:CAS:528:DC%2BB3sXitVWrs7zN

11. Preuss, I; Rosenberg, M; Yakhini, Z; Anavy, L. Efficient dna-based data storage using shortmer combinatorial encoding. Sci. Reports; 2024; 14, 7731.1:CAS:528:DC%2BB2cXosVKksLk%3D

12. Mardis, ER. Next-generation sequencing platforms. Annu. Rev. Anal. Chem.; 2013; 6, pp. 287-303.1:CAS:528:DC%2BC3sXhsVCnsr3K [DOI: https://dx.doi.org/10.1146/annurev-anchem-062012-092628]

13. Wang, Y; Zhao, Y; Bollas, A; Wang, Y; Au, KF. Nanopore sequencing technology, bioinformatics and applications. Nat. biotechnology; 2021; 39, pp. 1348-1365.1:CAS:528:DC%2BB3MXisVChu7bK [DOI: https://dx.doi.org/10.1038/s41587-021-01108-x]

14. Mikheyev, AS; Tin, MM. A first look at the Oxford Nanopore MinION sequencer. Mol. Ecol. Resour.; 2014; 14, pp. 1097-1102.1:CAS:528:DC%2BC2cXhslCqtr7K [DOI: https://dx.doi.org/10.1111/1755-0998.12324] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/25187008]

15. Deamer, D; Akeson, M; Branton, D. Three decades of nanopore sequencing. Nat. Biotechnol.; 2016; 34, pp. 518-524.1:CAS:528:DC%2BC28XnsVahtrY%3D [DOI: https://dx.doi.org/10.1038/nbt.3423] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27153285][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6733523]

16. Wick, RR; Judd, LM; Holt, KE. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome biology; 2019; 20, pp. 1-10.1:CAS:528:DC%2BC1MXht1GjsbbI [DOI: https://dx.doi.org/10.1186/s13059-019-1727-y]

17. Wang, Z et al. Training data diversity enhances the basecalling of novel rna modification-induced nanopore sequencing readouts. Nat. communications; 2025; 16, 679.1:CAS:528:DC%2BB2MXhsFGrsrw%3D [DOI: https://dx.doi.org/10.1038/s41467-025-55974-z]

18. Pinnamaneni, N., et al. Prototype of enzymatic dna synthesis approach. Tech. Rep. Deliverable 4.1, OLIGOARCHIVE Project, Grant agreement ID: 863320, European Commission (2020). https://doi.org/10.3030/863320.

19. Marinelli, E., Yan, Y., Magnone, V., Barbry, P. & Appuswamy, R. Cmoss: A reliable, motif-based columnar molecular storage system. In Proceedings of the 2024 ACM SIGMOD/PODS Conference on Management of Data, 1–26 (2024).

20. Wang, C., Wei, D., Wei, Z. et al. Cost-effective dna storage system with dna movable type. Adv. Sci. 2411354 (2024).

21. Sokolovskii, R., Agarwal, P., Croquevielle, L. A., Zhou, Z. & Heinis, T. Coding over coupon collector channels for combinatorial motif-based dna storage. IEEE Transactions on Commun. (2024).

22. Yiqing, Y. Motif search. In https://gitlab.eurecom.fr/yan1/motif-search (2022).

23. Suzuki, H; Kasahara, M. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC bioinformatics; 2018; 19, pp. 33-47. [DOI: https://dx.doi.org/10.1186/s12859-018-2014-8]

24. Teng, H. et al. Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning. GigaScience7 (2018).

25. Schuster, M; Paliwal, KK. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing; 1997; 45, pp. 2673-2681. [DOI: https://dx.doi.org/10.1109/78.650093]

26. Graves, A., Fernández, S., Gomez, F. & Schmidhuber, J. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine learning, 369–376 (2006).

27. Ewing, B; Green, P. Base-calling of automated sequencer traces using phred. ii. error probabilities. Genome research; 1998; 8, pp. 186-194.1:CAS:528:DyaK1cXitlWlu7g%3D [DOI: https://dx.doi.org/10.1101/gr.8.3.186] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/9521922]

28. Das, MK; Dai, H-K. A survey of DNA motif finding algorithms. BMC bioinformatics; 2007; 8, pp. 1-13. [DOI: https://dx.doi.org/10.1186/1471-2105-8-S7-S21]

29. Smith, MA et al. Molecular barcoding of native rnas using nanopore sequencing and deep learning. Genome research; 2020; 30, pp. 1345-1353.1:CAS:528:DC%2BB3MXltl2huw%3D%3D [DOI: https://dx.doi.org/10.1101/gr.260836.120] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32907883][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7545146]

Word count: 7001

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Motif caller for sequence reconstruction in motif-based DNA storage

Content area

Abstract

Full text