Content area
Limited by uncertain base errors in DNA storage, additional correction measures may introduce redundancy or even expand errors, resulting in poor reconstructed image. DNA-CTMF is proposed to reconstruct high quality images at high errors and indels. Firstly, Pixel-Base codebook and chaotic system ensure DNA sequences meet biological constraints. Then, codebook adjusts offset base-groups affected by indels to their original position. Finally, median filter removes salt-and-pepper noise caused by base errors. Simulated experiments show reconstructed images by DNA-CTMF exhibit high quality with minimal variation at different error compositions. Even at 5 % error rate and indels accounting for 2/3, DNA-CTMF reconstruct high quality images with PSNR approximately 23 and MS-SSIM exceeding 0.9. Tests on 4000 images demonstrate DNA-CTMF's superiority on multiple images. Wet experiments proves that DNA-CTMF can reconstruct images close to original at low error rate, which is consistent with the results of simulated experiments. Different from researches which adopted error correction codes, DNA-CTMF addresses base errors by image processing technology, providing a new interdisciplinary solution and perspective for storing images into DNA.
ABSTRACT
ARTICLE INFO Keywords: DNA storage Synthetic biology Median filter Image quality Codebook
State Key Laboratory of Digital Medical Engineering, School of Biological Science and Medical Engineering, Southeast University, Nanjing, 210096, China
1. Introduction
The total amount of data and information is increasing at an exponential rate as a result of the rapid expansion of modern information technology. It has been projected that the global data will reach 1.75 x 1014 GB by 2025 [1]. DNA, a biomolecule naturally chosen to store the genetic information [2], stands out as the most attractive alternative for vitro data storage [3-6]. However, due to base errors in DNA synthesis and sequencing, DNA storage may result in information loss without error correction measures. There are differences in the error rates of different DNA storage channels. DNA column synthesis and Sanger sequencing can achieve close to 100 % accuracy, while microarray-chip large-scale synthesis and Illumina sequencing have relatively higher error rates [7,8]. Because the error rates and the error locations of DNA storage are uncertain, simply adding error correction codes or physical redundancy to ensure data integrity are unstable and wasteful.
In contrast to text data, which need to be stored precisely, base errors appear as noise in images, which can be solved by image processing techniques in computer science [9]. Therefore, images are compatible with DNA storage, so storing images into DNA has great research potential. In order to store images into DNA and improve the quality of reconstructed images, researchers have taken two types of methods.
The first is to regard the image file as an ordinary file type or compressed image such as PNG, JPEG format, and store it in DNA sequences. Pre-compression is necessary, because it can improve information density and reduce the storage cost. In 2022, Song et al. encoded ten digital pictures of Dunhuang murals, which were zipped in 6.8 MB file, into 210,000 DNA sequences of 200 nt with 7.8 % strand redundancy by constructing de Bruijn graph and greedy path, and guarantee sufficient reliable data copies and perfect data recovery with the total 120 PCR cycles [10]. In 2023, Zheng et al. used quantized ResNet VAE model and LC codes for image compression and error correction. At 0.1 %-2 % error rate of, the SSIM of reconstructed images was stable at 0.917 [11]; however, the loss of image information comes from compression rather than base errors. In 2025, Zhou et al. encoded four images of Southeast University using DNA fountain code with RS codes and used Mf-PCR to achieve perfect image recovery [12]. The advantage of the above methods is that they have high information density, but the disadvantage is that they may lead to all data loss at high error rate.
The second is to encode pixels matrix of image into DNA sequences. In 2021 and 2022, Li et al. proposed two methods, IMG-DNA and HLDNA, focused on lossy recovery of images from DNA to encode pixels into DNA sequences and minimize the effect of indels on images by using rotating code and barrier strategy [13,14]. The SSIM of the reconstructed image reaches 0.55 at 0.5 % error rate. In 2024, Liu et al. approximated each pixel of the RGB image as triplet codes in the microfluidic pool, and read information through melting temperature during decoding to achieve lossy image recovery [15]. Wang et al. employed DNA-base128 to convert image pixels into DNA sequences and completed internal error correction through threshold and drift comparison [16]. At 0.5 % sparse error rate, the SSIM of reconstructed image reached 0.8. Cao et al. adopted parity encoding and local mean iteration method named PELMI to fulfil robust DNA storage of images [17]. The SSIM of the reconstructed image reaches 0.653 at 1 % error rate. In 2024, in order to better alleviate the impact of base error on the image and improve the storage density, Xu et al. using transposition and random interleaving to convert base errors into salt-and-pepper noise that can be effectively removed by median filtering, thus greatly improving the quality of reconstructed images [18,19]. Experiments showed that the reconstructed image was close to the original image with an error rate of 0.5 %, and its SSIM reached 0.889. Su et al. used deep learning to extract image features to form the feature matrix [20], and the change of feature matrix in DNA storage may form noise on the reconstructed image. Yan et al. used unordered index-free oligonucleotides to encode binary information of image pixels [21], which helps to improve the storage density, but the next image will inherit the noise caused by base errors in the previous image. Zhang et al. developed a parallel molecular data storage by printing epigenetic bits on DNA, each methylation site corresponded to one pixel in the tiger rubbing bitmap image; they also store the compressed panda image with BCH codes for error correction and the image was restored perfectly [22]. However, the above methods have high redundancy and poor quality of reconstructed images at high indels.
Mapping with codebook can stably ensure DNA sequences meet biological constraints, and possesses a certain error correction capability [23-25]. Therefore, inspired by the above research, in this paper, we propose the DNA-CTMF method for storing images into DNA and improving the quality of reconstructed images at the decode stage. The
key idea of DNA-CTMF is to adjust the offset base group affected by indels to its original position. Given that the majority of base groups are accurate, the pixel errors caused by incorrect base groups are discrete distributed in the reconstructed image as salt-and-pepper noise, which can be easily and effectively removed by median filtering, thus greatly improving the quality of the reconstructed image. As shown in Fig. 1, in the encode stage, DNA-CTMF converts the image pixels into DNA sequences using chaotic system and Pixel-Base codebook to ensure sequences satisfies the GC content and homopolymer constraints. In the self-correct stage, according to Pixel-Base codebook, the insertion and deletion errors will be converted to substitutions. Finally, median filtering is applied to the reconstructed image to remove the noise. The input and output details of DNA-CTMF are shown in Fig. S1. The remaining of this paper is arranged as follows: Section 2 introduces the procedure of DNA-CTMF method; Section 3 shows the experimental results of DNA-CTMF and other DNA storage methods; Section 4 discusses related information of DNA-CTMF; Finally, Section 5 concludes this paper.
2. Methods
2.1. Constructing Pixel-Base codebook
Inspired by codebook [24] and ASCII-DNA [25], we constructed the Pixel-Base codebook for mapping 256 pixels to 256 5-bases groups, the details of Pixel-Base codebook as shown in Table S1. Pixel-Base codebook is consisted of 5-bases all satisfy the GC content constraint and the homopolymer constraint, and the combination of any groups does not produce a homopolymer with length greater than 3.
2.2. Encoding method
Compared to coding schemes such as HL-DNA, codebook-based approaches are more susceptible to generating repeats and palindromes in longer DNA sequences, which may hinder DNA synthesis, PCR amplification and DNA sequencing. Especially for image files, when the pixels are mapped directly to the DNA sequence based on codebook, a pattern similar to BMP format appears, which may cause security problems [26].
To solve the potential problems mentioned above, we first set initial value г, хо and generate a set of chaotic sequences according to Eq. (1), more details are presented in Fig. S2.
...
For each pixel, it is processed based on Eqs. (2) and (3), which helps to upset the pixel arrangement, thereby reducing the probability of consecutive pixels and reducing the lengths of continuous identical 5bases group. The related results are shown in Fig.S3 and Fig.S4.
...
According to Pixel-Base codebook (Table S1), every processed pixel is mapped into 5-bases. In this method, each 30 pixels are encoded into one DNA sequence with a length of 150 nt for subsequent experiments, and an example of DNA-CTMF encode method is shown in Fig. S5.
2.3. Base errors in DNA storage
DNA storage mainly involves four chemical technologies: DNA synthesis, PCR amplification, media storage and DNA sequencing. Due to differences in methods, reagents and other factors, there are differences in the error rates of different DNA storage channels [27-29]. Currently, although several tools have been developed to simulate base errors in DNA storage [30,31], they are only applicable to a few cases and not compatible with the majority. Random errors, especially errors with high percentage of indels, are difficult and complex for error correction, and it can show the robustness of methods in extreme cases. Therefore, in this experiment, we introduced base errors with different error rates and different error compositions for encoded sequences to demonstrate the validity and robustness of DNA-CTMF, and more details are present in Fig. 56.
2.4. Self-correcting and decoding method
Base errors refer to nucleotide substitutions, insertions and deletions during the process of DNA synthesis, PCR, storage, and sequencing [32, 33]. In contrast to substitutions, insertions and deletions cause shifts for the subsequent bases, which significantly influence the image reconstruction. Therefore, the key is to locate where the base error occurred and adjust the misplaced base groups. As the legal codes in Table S1 are 25 % of the 5-bases (total number is 2048), it is probable that the base shift caused by indels meet a legal code, however, it is highly impossible that the next 5-bases is existing in the Pixel-Base codebook, or extremely impossible that the next but one 5-bases satisfy the codebook. Assumed that consecutive 5-bases exceeding the threshold are considered likely to become right interval, and others are deemed base errors. Itis important to identify the approximate location where base error occurred and modify the impacted 5-bases to its original position.
The general error correction process of DNA-CTMF as shown in Fig. 2. Firstly, one sequence is divided into right and wrong intervals based on the threshold. For wrong intervals, find an operation with minimum edits that adjusts all error intervals to a multiple of 5 and the total length of the sequence to 150 nt; otherwise, this sequence is discarded if no feasible operation exists, as shown in Fig. S7. For each wrong interval which need to insert or delete base(s), bases are randomly inserted or deleted according to its operation obtained, and after traversing all possibilities (pruning out some possibilities that do not meet the GC content and homopolymer constraints), the smallest hamming distance of combination of codebook with it (them) is selected and replaced it (them); for each wrong interval which need to substitute base(s), the smallest hamming distance of combination of codebook with it (them) is selected and replaced it (them). To save time, for each wrong interval, once the timeout is exceeded (set 0.1s in this experiment), the wrong interval is replaced with a number of times "NNNNN". Finally, the right and corrected intervals are merged sequentially to obtain the corrected sequence. When the right and corrected intervals
are merged, the base groups (pixels) of the right intervals have a high probability of being in the original position; even if the corrected intervals are not identical with the original base groups, which are regarded as relatively isolated changed pixels (noise) in the image. Examples are shown in Fig. S8.
When decoding, the DNA sequence is divided into groups of 5-base. For each base group, it is first converted to the decoded_pixel according to the codebook (Table S1), and then converted to the corresponding pixel value according to Eq. (4). Because of the correlation between image pixels, for base group NNNNN, the median of 30 pixels (a sequence) replaces it (them). An example of DNA-CTMF encode method is shown in Fig. 59.
...
2.5. Median filtering
Noise, described as the random change of the original pixel, may significantly reduce the image quality. A variety of filtering algorithms have been developed to address specific noise characteristics more effectively to improve the quality of the acquired images [34]. Mean filtering is adopted for gaussian noise reduction, while median filtering is specially used for removing salt-and-pepper noise [35,36]. In DNA storage channel, the proportion of error bases is much smaller than that of non-error bases [37,38]. After DNA-CTMF error correction, the base errors can be discretized and converted to salt-and-pepper noise by adjusting the non-error bases affected by indels to their original positions. Therefore, median filtering is chosen to improve the quality of reconstructed images in DNA-CTMF method.
2.6. Reconstructed image quality indexes
2.6.1. MSE and PSNR Mean Squared Error (MSE) and Peak Signal-to-Noise Ratio (PSNR)
are two widely used indexes for evaluating the quality of reconstructed images. MSE measures the average squared difference between the corresponding pixels of the original and reconstructed images. PSNR, which is derived from MSE, represents the ratio between the maximum possible power of a signal and the power of corrupting noise. Lower MSE and higher PSNR indicates better image quality. MSE and PSNR focus on pixel-level differences in images and fail to capture perceptual differences in human vision [39].
2.6.2. SSIM and MS-SSIM
Structural Similarity Index (SSIM) and Multi-Scale Structural Similarity Index (MS-SSIM) are advanced indexed for evaluating the quality of reconstructed image. Compared with MSE and PSNR, the results of SSIM and MS-SSIM are more consistent with human visual perception. As an extension of the SSIM, MS-SSIM includes multi-scale analysis, which better simulates the human visual system perceives images at different levels of detail. The higher SSIM and MS-SSIM, the higher quality of the reconstructed image [40].
3. Results
3.1. Visualization results of reconstructed images at different error rate
To demonstrate the capability of DNA-CTMF in reconstructing highquality images, we simulated and visualized the effect of base errors on reconstructed peppers images at 1 %-5 % error rates, as shown in Fig. 3. The above row presents the reconstructed images without median filtering. As the error rate rises, the noise on the image gradually increases and the quality of the image decreases rapidly, but the noise appears as salt-and-pepper noise and the general outline of the image is still visible, which indicates that most of the pixels (base groups) are adjusted to the original position after error correction. The noise on the reconstructed image is discretely distributed to form salt-and-pepper
noise, which is easily and effectively removed by median filtering, as shown in the bottom row of Fig. 3. The quality of reconstructed images after median filtering and the corresponding indexes are much better than those without median filtering, and the quality of the reconstructed image at high error rate (5 %) is better than that without denoising at low error rate (1 %). At 1 % error rate, SSIM and PSNR of reconstructed image are more than 0.9 and 30, which indicates that it is extremely close to the original image. Even at 5 % error rate, MS-SSIM is approximately 0.93 and PSNR exceeds 24, which means that the vast majority of details of the image are retained, despite the presence of little noise. We also tested other denoising methods on reconstructed image, such as least square filtering, bilateral filtering, gaussian filtering, wavelet filtering, as shown in Fig.S10 and Fig.S11. Compared with other filtering methods, median filtering can effectively remove salt-and-pepper noise caused by base errors (DNA-CTMF method).
3.2. The quality of reconstructed images at different error composition ratios
In order to explore the difference between reconstructed images of DNA-CTMF at different error composition ratios, we set the ratio of substitution: insertion: deletion to 8:1:1, 7:1.5:1.5, 6:2:2, 5:2.5:2.5, 4:3:3, 1:1:1, respectively, and calculated the MSE, PSNR, SSIM and MSSSIM of reconstructed images at 1 %-5 % error rates. Each combination
was repeated 1000 times and the mean and standard deviation were calculated to ensure reliability and accuracy of the results, and the image quality indexes are shown in Fig. 4. At 1 % error rate, there is little difference in the quality of the reconstructed image at different error composition ratios, with all PSNR above 30, SSIM above 0.93, and MSSSIM above 0.98, indicating that reconstructed images are close to that of original image. With the increase of error rate, the quality of reconstructed images with different error composition ratios all decreases. Especially, the quality of reconstructed images with high insertions and deletions decreases faster than that of reconstructed images with high substitutions. At 5 % error rate, there is a certain difference in image quality with different error composition ratios. The quality of reconstructed images with higher insertion and deletion ratios is worse, which implies that the impact of insertion and deletion errors on images is higher than that of substitutions. Although there is a difference in image quality at 5 % error rate, the difference is not significant; the difference of PSNR is approximately 0.1, and the MS-SSIM is less than 0.002, indicating that image remains highly similar. This demonstrates that DNA-CTMF is robust against high indels and can reconstruct highquality images even in high errors and indels.
3.3. Comparison with other representative works
In order to demonstrate that DNA-CTMF can significantly improve
the quality of reconstructed images while satisfying the relevant biological constraints, we compared it with four representative methods in the field of DNA storage, such as HL-DNA [13], DNA-base128 [16], PELMI [17], and DP-ID [18], both of which are DNA storage methods for storing images in DNA and lossy recovery.
3.3.1. General performance analysis
The general performance of several representative DNA storage methods, such as GC content constraint and homopolymer constraint, is shown in Table 1. HL-DNA, DNA-base128, PELMI, and DNA-CTMF can satisfy both GC content constraint and homopolymer constraint; DP-ID employs dynamic programming compression and direct bits-to-base mapping strategy to increase the coding density to more than 2 bits/ nt, but it fails to meet GC content constraints and homophobic constraints, which may lead to higher base error rates in theory [41]. HL-DNA converts data to DNA sequences with loss during the encoding phase to make them satisfy biological constraints and obtain a high coding density, but this irreversible encoding results in unpredictable data loss before synthesis. DNA-base128 converts image pixels into DNA sequences through statistical frequency and constraint encoding set. In the decode stage, it completes internal error correction by threshold setting and drift comparison and reconstructs images with information loss. However, DNA-base128 handles sparse errors and may not work well for continuous and uncertain errors. PELMI adopts the parity codes and local mean iteration to achieve robust DNA storage. Parity encoding is used to converts pixels (binary sequences) into DNA sequences and satisfy the common biological constraints, and identifier bits and RS codes as redundancies to label and correct errors. When recovering images, local mean iteration is performed to achieve image enhancement. However, local mean iteration can improve image quality, but it cannot really remove noise of image, resulting poor reconstructed image. DP-ID requires RS codes to protect the Signal part, so repeated sequencing and multiple sequence alignment are necessary, which may further increase the cost.
3.3.2. Grayscale and color image reconstruction
The grayscale image, being a single-channel image, contains solely brightness information, with pixels ranging from 0 to 255, where O denotes black, 255 denotes white, and intermediate values denote various shades of gray. We simulated the visualization of reconstructed grayscale image for different DNA storage methods at 1 % and 5 % error rate (substitution: insertion: deletion is 8:1:1), as shown in Fig. БА. At 1 % error rate, the reconstructed images of HL-DNA, DNA-base128, and PELMI have obvious noise due to base errors. DP-ID and DNA-CTMF adopt random interleaving and codebook, respectively, to convert base errors into salt-and-pepper noise, which is removed by median filtering, so that the noise is basically eliminated completely, and the PSNR, MS-SSIM are over 32, 0.99, which indicates that the reconstructed image is very close to the original. At 5 % error rate, for HLDNA, DNA-base128, and PELMI, although the general outline of the image is visible, the reconstructed images with a substantial loss of details, and a significant decrease in image quality indexes compared to those at 1 % error rate. For DP-ID and DNA-CTMF, although a small amount of noise is still present, the quality of the reconstructed images
at 1 % and 5 % error rates are relatively close, most of the details are preserved, and the quality of image is significantly better than the others.
Color images in the RGB model are composed of three-color channels: red (В), green (С), and blue (В) and each channel carries brightness information of the image. Compared with gray images, color images contain color information and can offer more comprehensive visual data. Similar to the grayscale image, we simulated the effect of base errors on the quality of color image on different DNA storage methods. When evaluating the quality of the RGB image, the three-color channels (red, green, and blue) are calculated separately and averaged. The visualization results of reconstructed color images at 1 % and 5 % error rates for different DNA storage methods are shown in Fig. 5B. As for HLDNA, lossy encoding is utilized during the coding stage to ensure sequences compliance with biological constraints, which causes the loss of some information before storing; compared with the gray image, the reconstructed color image produce worse results due to color information. Similar to the results of grayscale image, the noise is randomly and continuously distributed on the image of DNA-base128 and PELMI, while DP-ID and DNA-CTMF convert base errors into salt-and-pepper noise, which can be easily removed with median filtering, so the image quality is much better than other methods.
In order to verify that DNA-CDMF can still reconstruct high-quality images at high indels compared with other DNA storage methods, we simulated the reconstruction of grayscale images by different DNA storage methods at different error rate with indels accounting for 2/3 (substitution: insertion: deletion is 1: 1: 1), and each simulation was repeated for 1000 times, and the quality indexes of grayscale reconstructed image are shown in Fig. 6. At 1 % error rate, the quality of reconstructed image by DP-ID and DNA-CTMF is significantly better than other methods. With the increase of error rate, the image quality of DP-ID decreases significantly, while the decline rate of DNA-CTMF is relatively slower. This is because the median filtering has a good processing effect on isolated noise points, but at high indels and high error rates, the salt-and-pepper noise caused by DP-ID interleaving spreads throughout the entire image, and the median filtering cannot be able to distinguish between noise points and real image pixels, resulting in poor denoising effect. The purpose of error correction adopted by DNA-CTMF is to adjust the displaced bases caused by insertion and deletion errors to the original position as much as possible, so the base errors can still be converted into discrete salt and pepper noise at high indels. The median filtering can exhibit superior effectiveness in reconstructed image by DNA-CTMF, as shown in Fig. S12 and S13. The reconstruction of color image by different DNA storage methods at different error rate with indels accounting for 2/3 as shown in Figs. S14-S16.
Similar to the DP-ID, DNA-CTMF also converts base errors into saltand-pepper noise and then removes them with median filtering. The difference is that DP-ID uses direct mapping to increase coding density, but does not control GC content and homopolymer constraints, which may lead to higher base error rates. In addition, the direct mapping of DP-ID results in no correlation between the bases, and the sequence length can only be adjusted in a rough way; at high error rates and high indels, noise spreads throughout the image, and median filtering can improve image quality but has limited effect. Although the Pixel-Base
codebook adopted by DNA-CTMF results in a decrease in coding density, it has the advantage of self-error correction ability and control biological constraints. At the same error rate, the reconstructed images by DNAСТМЕ are significantly better than those by DP-ID with various quality indexes.
3.3.3. Multiple images reconstruction To provide a more systematic assessment of the DNA-CTMF's performance, we expand the testing to other images. The benchmark image
dataset named waterloo exploration database [42] from the computer field, totaling 4000 images (3600 BMP and 400 PNG), is selected and several examples are shown in Fig. S17. We respectively used the above DNA storage methods to conduct simulation tests at 5 % error rate (substitution: insertion: deletion is 1:1:1), and the results are shown in Fig. 7. The impact of base errors on the images varies greatly due to the large differences in the content of the images, but it is clear that the image quality reconstructed by DP-ID and DNA-CTMF methods is far better than other DNA storage methods. In terms of SSIM and MS-SSIM,
which are more consistent with human vision, DNA-CTMF reconstructed images are far better than DP-ID method, with the highest multiples reaching 5.24 and 2.44, respectively. Therefore, DNA-CTMF is still superior to other DNA storage methods on a diverse range of images compared to other methods.
3.4. Wet experimental verification
In order to validate the DNA-CTMF method is still effective in practical experiments, we used the DNA-CTMF method to encode three images: pepeers (BMP), baboon (JPG) and butterfly (PNG). The size of each image is 256·256 (2185 DNA sequences), with a total of 6555 DNA sequences of 200 nt and the designed sequences are shown in Fig. S18. These sequences were synthesized by inkjet printing (Dynegene Technologies) and sequenced by Illumina paired-end sequencing.
Fig. 8A shows the distribution of the number of each sequence copies in sequencing, which is approximately gaussian distribution, and no sequence loss. Fig. 8B shows the error rate of substitution, insertion and deletion at each position in index and data part. Different error types are shown in Fig. S19. Overall, substitutions account for approximately 55 %, deletions for approximately 33 %, and insertions for approximately 12 %. We randomly select one sequence from each copy to reconstruct three images and repeat it 2000 times. In this way, the error rate approximately conforms to the distribution in Fig. 8B. The MSE, PSNR, SSIM and MS-SSIM of the reconstructed images are shown as Fig. 8C and D. The visualization of the reconstructed image is shown in Fig. S20. The
noise on the image is discretely distributed. Using median filtering can effectively remove these noises and improve the image quality, which is consistent with the results of the simulation experiment.
4. Discussion
4.1. Computational efficiency and complexity
DNA-CTMF involves four computational components: 1) Encode image into DNA sequences, 2) Correct errors in DNA sequences, 3) Decode DNA sequences into reconstructed image, and 4) Apply median filtering to reconstructed image. In the encode stage, the time complexity for encoding each image into DNA sequences is O(n), where n is determined by the number of pixels, defined as height x width x channels. For each pixel, the time complexity for encoding it into 5-bases is O(1), because it is a direct mapping based on codebook. In the selfcorrect stage, due to difference in types, positions, and numbers of errors in each sequence, the time required for error correction is also different. We limit the error correction time of each wrong interval to 0.1s, and that beyond the time will be replaced by the median value of the sequence, which can effectively reduce the error correction time. We calculated the error correction time of 1,000,000 sequences at different error rates, as shown in Fig. S21A. Even at 5 % error rate, the error correction time of all sequences is less than 1s, the error correction time of more than 80 % sequences is less than 0.1s, and the timeout interval of others will be directly replaced by the median value to save the error
correction time. In the decode stage, as the inverse process of encoding, the time complexity is also O(n). The time complexity of converting each 5-bases group to its corresponding pixel value is also O(1). We counted the encode and decode time of 4000 images (3600 BMP, 400 PNG, the image sizes are not exactly the same). As shown in Fig. S21B, the encode and decode time of most images are within 1s. In the median filter stage, which is a simple and effective method to remove salt-pepper noise in image processing field, replacing each pixel with median value of its 3x 3 neighborhood requires sort algorithm, whose time complexity is O(1) due to the fixed size of the neighborhood, so the overall time complexity is O(n).
4.2. Static codebook and adaptive code
Static codebook adopts pre-defined fixed rules to convert binary data into DNA sequences, which not only stably meets biological constraints but also has a certain ability of self-correction. During synthesis and sequencing, the DNA sequence with GC content between 40 % and 60 % and homopolymer length no more than 3 has lower error rate [17,43]. When designing the Pixel-Base codebook, 256 5-bases groups which meet GC content (40 %-60 %), homopolymer constraints (homopolymer length < 2), and any two combinations of 5-bases groups that do not produce 4 or more homopolymer were selected as the 5-Bases part. The GC content of each 5-bases group is either 40 % or 60 %. For one sequence, it is composed of m 5-bases groups with 40 % GC content and
n 5-bases groups with 60 % GC content, so the GC content of this ... 000 DNA sequences with length of 150 nt are generated through codebook to quantitatively analyze their GC content and homopolymer number, as shown in Fig. S22. The GC content of all sequences is between 40 % and 60 % and no 4 or more homopolymer in DNA sequences.
GC Content (%) = (5)
Adaptive coding schemes, such as DNA fountain and compression before storage based on file type, can improve storage density, but due to unknown base error rates, error correction codes are required to handle base errors, which leads to redundancy. The static codebook directly maps the image pixels to sequences, which can recover partial information even at high error rates. Besides, static codebook enables to detect base groups which not exist in codebook, thus helping to correct errors and prevent error propagation.
4.3. Image file and Pixel Coding
Whether the image format is BMP, PNG, JPEG, or other formats, they are all made up of pixels matrix. DNA-CTMF encodes the pixels into DNA sequences regardless of the image format. Different image formats are designed for computer, for example, PNG is the lossless compression format, while JPEG is the lossy compression format, which are stored in
computer as a special structure. In other words, they are stored as a whole. If the image is stored as the file structure into DNA sequences, base errors may cause irreversible effects on the image. Lu et at found that if 0.001 % of the bytes in a JPEG file change, the image may experience visible pixel loss, block damage, or misalignment; If 0.01 % of the bytes change, nearly all images will suffer irreversible damage, and in the worst case, the file may become unreadable [44]. Yan et al. thought that compress the raw information before encoding, which can lead to complete failure of data recovery even with a minor error, is not suitable for scenarios with high error rates [21]. If pixels matrix of the image is stored, base error affects the image pixels and appears in the image as pixel variation, that is, noise, and part of image information can be recovered even at high error rate. Given that DNA storage is an error-prone process, the inherent fault-tolerant nature of images makes it highly compatible with DNA storage. Therefore, storing images into DNA has the potential to be a breakthrough in DNA storage on a large scale.
4.4. DNA-CTMF and universal error correction codes
Compared with other universal error correction such as RS and LDPC, the error correction function of Pixel-Base codebook in DNAСТМЕ is to convert base errors into noise that can be removed rather accurately correcting all errors. Moreover, DNA-CTMF is compatible with error correction codes, and error correction codes can be added in DNA-CTMF encode stage to correct errors in correct stage. However, at
low error rate, DNA-CTMF can reconstruct high-quality images close to the original; at high error rate, numerous errors may exceed the error correction capability of RS and LC codes, which may lead to high redundancy and cost. For lossy storage in images, this is not costeffective. In short, for texts and compressed files, precise storage is essential. DNA-CTMF is not suitable for this scenario, while error correction codes and high copies are appropriate solutions; For image, we focus on its characteristic, converting base errors in salt-and-pepper noise and remove them with median filtering, and even at high error rate, DNA-CTMF can recover part of image information.
5. Conclusion
In this paper, we propose the DNA-CTMF method to reconstruct high-quality images from lossy DNA storage even at high error rates and high indels. In the encode stage, the image pixels are converted into DNA sequences according to a pre-built Pixel-Base codebook, which ensures the sequences satisfy biological constraints and possesses a certain degree of self-error correction ability. In the self-correct and decode stage, error correction is performed according to the codebook, with the aim of adjusting the sequence to the correct length and the offset base group affected by indels to its original position; then the DNA sequences are converted into the reconstructed image, where the noise caused by base errors is discretely distributed in the image, which can be significantly removed by median filtering. Instead of achieving the error-free correction, the key idea of DNA-CTMF is to adjust the offset base
group affected by indels to its original position as much as possible. Given that the majority of base groups are error-free, those with errors are sparsely distributed and appear as salt-and-pepper noise on the image. Repeated simulation experiments demonstrate that the reconstructed images by DNA-CTMF at 1 %-5 % error rate significantly outperform other DNA storage methods and exhibit robustness at various indels ratios. Large-scale image simulations demonstrate that DNA-CTMF still obtain outstanding results on multiple images. The results of wet experiment are consistent with those of simulations, revealing the potential of ОМА-СТМЕ in the field of storing images into DNA. Unlike prior studies that utilized error correction codes to correct base errors, DNA-CTMF addresses base errors through median filtering, which provides a new insight and idea for storing images into DNA.
CRediT authorship contribution statement
Qi Xu: Writing - original draft, Methodology. Ying Zhou: Resources. Qingjiang Sun: Investigation. Xiangwei Zhao: Investigation. Zuhong Lu: Project administration. Kun Bi: Supervision, Data curation.
Ethics statement No applicable. Data and code availability
The data and source code are publicly available on GitHub (https://g ithub.com/minghong/DNA-CTMF).
Declaration of competing interest All authors declare no financial or non-financial competing interests. Acknowledgements
This work was funded by the National Key Research and Development Program (2020YFA0712104), National Natural Science Foundation of China (62201141) and National Key Research and Development Program (2023YFF1206102), Support Plan for Outstanding Young Scholars at Southeast University.
Appendix A. Supplementary data
Supplementary data to this article can be found online at https://doi. org/10.1016/j.synbio.2025.04.015.
References
[1] Tan X, Ge L, Zhang T, et al. Preservation of DNA for data storage. Russ Chem Rev 2021;90(2):280.
[2] Bencurova E, Akash A, Dobson RC, et al. DNA storage-from natural biology to synthetic biology. Comput Struct Biotechnol J 2023;21:1227-35.
[3] Hao Y, Li О, Fan С, et al. Data storage based on DNA. Small Struct. 2021;2(2): 2000046.
[4] Ezekannagha C, Becker A, Heider D, et al. Design considerations for advancing data storage with synthetic DNA for long-term archiving. Mater Today Bio 2022;15: 100306.
[5] Bi K, Xu Q, Lai X, et al. Multi-file dynamic compression method based on classification algorithm in DNA storage. Med Biol Eng Comput 2024;62(12): 3623-35.
[6] Shen P, Zheng Y, Zhang C, et al. DNA storage: the future direction for medical cold data storage. Synth Sys Biotechnol 2025;10(2):677-95.
[7] Cao В, Zheng У, Shao О, et al. Efficient data reconstruction: the bottleneck of largescale application of DNA storage. Cell Rep 2024;43(4):113699.
[8] Yu M, Tang X, Li Z, et al. High-throughput DNA synthesis for data storage. Chem Soc Rev 2024;53(9):4463-89.
[9] Ruan C, Yang L, Han R, et al. Robust DNA image storage decoding with residual CNN. In: 2024 IEEE international symposium on circuits and systems (ISCAS); 2024. p. 1-5.
[10] Song L, Geng F, Gong Z-Y, et al. Robust data storage in DNA by de Bruijn graphbased de novo strand assembly. Nat Commun 2022;13(1):5361.
[11] Zheng У, Cao В, Zhang X, et al. DNA-QLC: an efficient and reliable image encoding scheme for DNA storage. BMC Genom 2024;25(1):266.
[12] Zhou Y, Bi K, Xu Q, et al. Ultrafast and accurate DNA storage and reading integrated system via microfluidic magnetic beads polymerase chain reaction. ACS Nano 2025;19(7):7306-16.
[13 Li Y, Du DH, Ou L, et al. HL-DNA: a hybrid lossy/lossless encoding scheme to enhance DNA storage density and robustness for images. In: 2022 IEEE 40th international conference on computer design (ICCD); 2022. p. 434-42.
[14] Li B, Ou L, Du D. IMG-DNA: approximate dna storage for images. In: Proceedings of the 14th ACM international conference on systems and storage; 2021. p. 1-9.
[15] Liu DD, Cheow LF. Rapid information retrieval from DNA storage with microfluidic very large-scale integration platform. Small 2024;20(17):e2309867.
[16] Wang К, Cao В, Ma T, et al. Storing images in DNA via base128 encoding. J Chem Inf Model 2024;64(5):1719-29.
[17] Cao В, Wang К, Xie L, et al. PELMI: realize robust DNA image storage under general errors via parity encoding and local mean iteration. Briefings Bioinf 2024;25(5): bbae463.
[18] Xu O, Ма У, Lu 7, et al. DP-ID: interleaving and denoising to improve the quality of DNA storage image. Interdiscip Sci 2024. https://doi.org/10.1007/s12539-02400671-6.
[19] Xu Q, Lu Z, Bi K. DNA-LSIED: DNA lossy storage for images by encryption and corrective denoising method. Signal, Image Video Process 2025;19(1):11.
[20] Su Y, Chu L, Lin W, et al. Robust and efficient representation-based DNA storage architecture by deep learning. Small Methods 2024;9(3):e2400959.
[21] Yan Z, Zhang H, Lu B, et al. DNA palette code for time-series archival data storage. Natl Sci Rev 2025;12(1):nwae321.
[22] Zhang C, WuR, Sun F, et al. Parallel molecular data storage by printing epigenetic bits on DNA. Nature 2024;634(8035):824-32.
[23] Rasool A, Hong J, Hong Z, et al. An effective DNA-based file storage system for practical archiving and retrieval of medical MRI data. Small Methods 2024;8(10): е2301585.
[24] Zan X, Yao X, Хи P, et al. Hierarchical error correction strategy for text DNA storage. Interdiscipl Sci 2022;14(1):141-50.
[25] Zhang X, Zhou F. An encoding table corresponding to ASCII codes for DNA data storage and a new error correction method HMSA. IEEE Trans NanoBioscience 2024;23(2):344-54.
[26] Bi К, Lu Z, Ge O, et al. Extended XOR algorithm with biotechnology constraints for data security in DNA storage. Curr Bioinf 2022;17(5):401-10.
[27] Ping Z, Chen S, Zhou G, et al. Towards practical and robust DNA-based data archiving using the yin-yang codec system. Nature Comput Sci 2022;2(4):234-42.
[28] Jo S, Shin H, Joe SY, et al. Recent progress in DNA data storage based on highthroughput DNA synthesis. Biomed Eng Lett 2024;14(5):993-1009.
[29] Dong Y, Sun Е, Ping Z, et al. DNA storage: research landscape and future prospects. Natl Sci Rev 2020;7(6):1092-107.
[30] Yuan L, Xie Z, Wang Y, et al. DeSP: a systematic DNA storage error simulation pipeline. BMC Bioinf 2022;23(1):185.
[31] Schwarz M, Welzel M, Kabdullayeva T, et al. MESA: automated assessment of synthetic DNA fragments and simulation of DNA synthesis, storage, sequencing and PCR errors. Bioinformatics 2020;36(11):3322-6.
[32] Yao X, Xie В, Zan X, et al. A novel image encryption scheme for DNA storage systems based on DNA hybridization and gene mutation. Interdiscipl Sci Comput Life Sci 2023;15(3):419-32.
[33] Zhou Y, Bi К, Ge Q, et al. Advances and challenges in random access techniques for in vitro DNA data storage. ACS Appl Mater Interfaces 2024;16(33):43102-13.
[34] Sagheer SVM, George SN. A review on medical image denoising algorithms. Biomed Signal Process Control 2020;61:102036.
[35] Fan L, Zhang F, Fan H, et al. Brief review of image denoising techniques. Vis Comput Ind Biomed Art 2019;2(1):7.
[36] Dey S, Tibarewala DN, Maity SP, et al. Automated detection of early oral cancer trends in habitual smokers. In: Soft computing based medical image analysis. Elsevier; 2018. p. 83-107.
[37] Doricchi A, Platnich CM, Gimpel A, et al. Emerging approaches to DNA data storage: challenges and prospects. ACS Nano 2022;16(11):17552-71.
[38] Gimpel AL, Stark WJ, Heckel R, et al. A digital twin for DNA data storage based on comprehensive quantification of errors and biases. Nat Commun 2023;14(1):6026.
[39]Hore A, Ziou D. Image quality metrics: PSNR vs. SSIM. In: 2010 20th international conference on pattern recognition; 2010. p. 2366-9.
[40] Wang Z, Bovik AC, Sheikh HR, et al. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 2004;13(4):600-12.
[41] Dimopoulou M, Antonini M. Data and image storage on synthetic DNA: existing solutions and challenges. EURASIP J Image Video Process 2022;2022(1):23.
[42] Ma K, Duanmu Z, Wu Q, et al. Waterloo exploration database: new challenges for image quality assessment models. IEEE Trans Image Process 2016;26(2):1004-16.
[43] Rasool A, Qu Q, Jiang Q, et al. strategy-based optimization algorithm to design codes for DNA data storage system. In: International conference on algorithms and architectures for parallel processing; 2021. p. 284-99.
[44] Lu Y, Zhang 7, Yang J, et al. High fault-tolerant DNA image storage system based on VAE. IEEE Trans NanoBioscience 2025. https://doi.org/10.1109/ TNB.2025.3544401.
Peer review under the responsibility of Editorial Board of Synthetic and Systems Biotechnology.
* Corresponding author.
© 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.