NLSTseg: A Pixel-level Lung Cancer Dataset Based

Full text

Turn on search term navigation

Background & Summary

High-rate of mortality in lung cancer

Lung cancer is the second most diagnosed cancer in both men and women. It is also the most common cause of cancer death¹. However, despite the decrease in mortality rates of lung cancer, it remains the leading cause of cancer-related deaths. The reason why the mortality rate of lung cancer remains high is that most lung cancer has no obvious symptoms in the early stage. When symptoms become apparent, the cancer has usually presented in an advanced stage, leading to limited treatment options, with low cure rates.

According to the American Joint Committee on Cancer 8^th (AJCC 8^th) staging², which classifies the TNM clinical stages of lung cancer, in Non-Small Cell Lung Cancer (NSCLC), the early-stage (include stage I and stage II) lung cancer 5-year survival rates are around 76%, whereas the advanced-stage (including stage III and stage IV) 5-year survival rates are notably lower, at around 30%. The differences in the 5-year survival rates between early-stage and advance-stage lung cancer are significant. Therefore, early detection of lung cancer is crucial for initiating prompt treatment.

Low-dose computed tomography in lung cancer screening

The two largest randomized lung cancer screening trials are the National Lung Screening Trial (NLST)³ and the Nederlands Leuvens Longkanker Screenings Onderzoek Trial (NELSON)^4,5 found that low-dose computed tomography (LDCT) screening effectively reduces lung cancer mortality by 20% to 24%. Therefore, in 2013, the US Preventive Services Task Force (USPSTF)⁶ recommended annual screening for high-risk adults aged 55 through 80 years old who have a smoking history of at least 30 pack-years and currently smoke or have quit within the past 15 years⁷.

In 2021 to 2022, the USPSTF changed the lung cancer screening age range from 55–80 to 50–80 years, and 30 pack-year smoking history to 20 pack-year with annual LDCT^7,8. Due to the adjustment of the screening criteria, the volume of lung cancer screening examination will continue to rise, requiring radiologists to spend more time interpreting images and resulting in greater consumption of healthcare resources.

Lung Cancer / nodule segment dataset

In the field of medical image analysis, datasets are crucial for the development and evaluation of deep learning models. As one of the essential tools for diagnosing lung cancer, computed tomography (CT) imaging has numerous publicly available lung cancer imaging dataset, such as LIDC-IDRI⁹, LNDb^{10, 11–12}, LUNA16¹³

NLST³, NSCLC-Radiomics^14,15, and LIDC-IDRI⁹. While these publicly accessible datasets provide a substantial amount of lung cancer imaging data, the number and availability of annotated files remain relatively limited.

The currently known lung cancer CT datasets with annotated files include:

LIDC-IDRI⁹: The largest annotated lung cancer dataset, initiated by the National Cancer Institute (NCI), and established through collaboration among seven academic centers and eight medical imaging companies. It contains CT scans and lesion annotations for 1,080 cases, annotated by four chest radiologists, providing detailed pixel-level annotation. Due to its high-quality imaging data, it has been widely used in various studies^{13,16, 17, 18–19}.
LNDb^{10, 11–12}: A collection of CT images from 294 patients. The dataset was annotated by two radiologists, offering annotations of pulmonary nodules in clinical reports along with detailed pixel-level annotation.
LUNA16¹³: The LUNA16 (LUng Nodule Analysis) dataset is a dataset for lung segmentation. It consists of 1,186 lung nodules annotated in 888 CT scans.

The National Lung Screening Trial (NLST)³, was a randomized controlled clinical trial of screening tests for lung cancer, comparing whether chest X-rays or LDCT could effectively reduce lung cancer mortality. The study collected data from 53,454 high-risk individuals between August 2002 and April 2004. The LDCT group had 26,722 participants, and the chest radiography group had 26,732 participants. Each participant. underwent annual screening over a three-year period.

The NLST dataset has been used in multiple studies to develop lung cancer risk prediction models, such as the model by Diego Ardila et al.^20,21, which collected 14,851 CT images from the NLST to develop a model predicting the risk of malignancy using bounding box annotations for pulmonary nodules or tumors. Yoganand Balagurunathan et al.²², utilized NLST data from 100 cases with pulmonary nodules detected over two consecutive years and applied semi-automated annotation to select cases with nodules ≥4 mm. However, the semi-automated annotation process struggles with non-solid nodules, poorly defined margins, or nodules attached to vessels or pleura. Another study by Masquelin, Axel H. et al.²³ used the NLST dataset to predict the malignancy potential of uncertain nodules smaller than 20 mm, focusing on solitary solid nodules. However, the performance was constrained when dealing with ground-glass or part-solid nodules.

Peter G. Mikhael, BSc et al.²⁴ developed the Sybil model using the NLST dataset to predict lung cancer risk for up to six years. The annotations in this study were made publicly available, they used bounding box annotations, which are effective for object detection but do not fully represent the shape of the tumor. There are no segmentation annotations of NLST have been released.

Given the significant role of LDCT in early lung cancer screening due to its lower radiation dose, the lack of annotated datasets has limited further progress in research.

Research objective

To the best of our knowledge, there is currently no publicly available dataset with pixel-level annotated pulmonary lesions on LDCT images. To enhance the availability of lung cancer CT images, we utilized the NLST dataset to develop a set of annotated LDCT lung cancer images. We collected LDCT scans from 605 patients who had positive screening results and were later confirmed to have lung cancer during the years of screening. Radiologists manually reviewed each scan and provided pixel-level annotations of each lung lesion, resulting in 662 tumors and 53 nodules being segmented. We submitted the annotated dataset, LDCT images and related documentation for future use by researchers. To validate the usability of our annotated dataset, we conducted an initial training of a 2D U-Net model to predict pulmonary lesions.

Methods

Data collection

Our dataset used the open dataset from the National Lung Screening Trial (NLST)²⁵, with the LDCT arm comprising 26,722 cases. The study collected data from 53,454 high-risk individuals between August 2002 and April 2004, whose ages ranged between 55 and 78 years. In the LDCT arm, all cases were scheduled to receive annual LDCT scans for three consecutive years unless lung cancer was diagnosed during the study period. We excluded 26,703 cases that were not diagnosed with lung cancer during the study period. In total, 649 patients were diagnosed with lung cancer after a positive screening result (T0:270; T1:168; T2:211). Imaging data were downloaded from the public database of The Cancer Imaging Archive (TCIA)²⁶. Nine patients had no imaging records, and 21 patients had incomplete imaging. Out of 619 patients images were annotated, and 14 patients with undetectable lesions on imaging were excluded. In total, 605 patients images were annotated Fig. 1. The NLST publicly released clinical records data, including tumor stage, histology type, tumor location, patient characteristics, and LDCT images datasets. The NLST dataset and clinical data files are freely available from The Cancer Imaging Archive TCIA²⁶. It is published under the Creative Commons Attribution 4.0 unported License (https://creativecommons.org/licenses/by/4.0/). All data were anonymized.

Fig. 1 [Images not available. See PDF.]

Flowchart with inclusion and exclusion of articles. This flowchart illustrates the inclusion and exclusion criteria applied to select patients for low-dose CT imaging (LDCT) data analysis.

During the imaging selection phase, each of the 605 patients had 1 to 3 different series per year. We only selected the series with the highest number of slices. If there were duplicate slice counts, priority was given based on the kernel sequence. Since the NLST dataset comes from 33 different institutions, each CT scanner has different machine parameters: for GE MEDICAL SYSTEMS the kernel is (Lung>Standard>bone); for Philips, it is (B>D>C>A); for SIEMENS it is (30 F>50 F>80 F); and for TOSHIBA, it is (F30>FC10>FC51). Each patient included in the study only had one series. Information about the selected series, Kvp, and other imaging parameters can be verified in the file “Image.xlsx”.

Tumor annotation

We used the Segment Editor tool of an open-source software 3D slicer^27,28 (https://www.slicer.org/) version 5.6.1 to segment lung lesions. While segmenting, ‘Editable area’ were selected and was set to ‘Everywhere’. The ‘paint’ of the Segment Editor tool was our main tool.

Each CT image was annotated manually with pixel-level precision along the tumor outlines by a researcher with two years of imaging research experience and a radiologist specializing in oncology with five years of experience. Annotations were performed based on the lesion location provided by the NLST clinical data(right upper lobe, right middle lobe, right lower lobe, left upper lobe, left lower lobe, etc.). A radiation oncologist with ten years of experience and a radiologist with ten years of experience cross-verified each annotation. (If discrepancies occurred between image and clinical data, we adjusted the clinical lesion location based on pathology nodules. We also marked suspicious nodules. The location of each lesion was recorded in the file “Lable.xlsx”. Two cases had no obvious tumors on imaging, but suspected lung nodules were found. In total, we annotated 662 tumors and 53 nodules, resulting in a total of 715 lesions. CT images and annotation files were saved in NIFTI format.

During the annotation process, we encountered some limitations. None of the images used contrast agents, which made it challenging to clearly identify central lung tumor boundaries from mediastinal structures. Consequently, we adjusted the window level while annotating to delineate the recognizable areas Fig. 2 Examples include tumors located near at the hilum, main stem bronchus, blood vessels and the heart. For diffuse lung tumors with more than five lesions, we annotated the most prominent five. Tumors with clear boundaries between them were segmented into multiple annotations. In Fig. 3, we describe different lesion types: diffuse malignancy, ground-glass opacities (GGO), tumor close to the organ boundary, and so on.

Fig. 2 [Images not available. See PDF.]

Solutions to Challenges in Image Annotation. This figure illustrates the stepwise process to enhance the visibility of tumor boundaries in CT scans. Step 1: The tumor presents with indistinct boundaries. (The red box indicates the tumor location) Step 2: Adjust the image window level. (The red box shows the tumors location, and you can see a faint boundary) Step 3: Adjust back to the lung window level.

Fig. 3 [Images not available. See PDF.]

Lesion types. Different tumor appearance, if there are more than one tumor in a case, tumors were segmented in different colors (a) Diffuse malignancy. The green color shows tumor number one, the red color shows tumor number two (b) ground glass opacity (c) Tumor close to the organ boundary (d) Early-stage lung cancer with dimensions exceeding the normal range. Left: Stage IA; Right: Stage IB.

Lesion volume calculation

In this study, we measured lesion size based on volume rather than diameter. This approach provides a more detailed description of the lesion’s actual size, particularly for irregularly shaped or complex lesions, as volume measurements offer more comprehensive information compared to data consisting of just the diameter of tumors.

Each case folder contains two files: one LDCT image in NIFTI format and one annotation file. The NIFTI header includes metadata of the image, where srow_x, srow_y, and srow_z corresponds to the pixel spacing in the x and y directions and slice thickness, respectively. These values describe the spatial resolution of the images. Lesion volumes were calculated by pixels times space volume occupied by each pixel. The lesion volume number verified by “Segment Statistics” module in 3D slicer.

Data preprocessing

In this study, NIFTI format images were processed by adjusting the pixel intensity values to the lung window of −500 to 1400 Hounsfield Units (HU). These images were then converted to 2D Portable Network Graphics (PNG) format, with each image sized at 512×512 pixels. The training dataset comprised 86,731 images, among which 5,726 images contained tumor segmentations (6.6%). The test dataset included 20,688 2D LDCT images.

Model training process

The model input consisted of 512×512 2D LDCT images and their corresponding annotations. Fig. 4 presents the different lesion sizes on LDCT. In our dataset, the total number of non-tumor-annotated pixels was 22,730,946,644, while the total number of tumor-annotated pixels was 5,064,620 (0.022%). This shows that there was a significant imbalance between the number of lesion volumes in all LDCT image pixels, which restricted the model’s ability to learn tumor features. To address this issue, we implemented several strategies to enhance the model’s learning capability and generalization performance.

Fig. 4 [Images not available. See PDF.]

The different lesion sizes on CT. This figure illustrates the appearance of lung lesions with varying volumes on CT images, classified into six categories based on lesion size, ranging from less than 0.5 cm³ to greater than 50 cm³. The volume differences result in significant variations in the area occupied by the lesions on the lung images.

To enable the model to learn subtle features in the images, we incorporated ResBlock²⁹ into the model architecture and added a dropout layer. ResBlock helps to prevent network degradation during deeper layers by learning residuals and alleviates potential gradient vanishing problems during training. Additionally, dropout, an ensemble learning technique, effectively prevents overfitting, further enhancing the model’s performance and generalization ability.

To deal with the pixel imbalance problem between positive and negative samples, we employed an oversampling strategy, providing positive samples with tumors a higher chance of selection compared to negative samples. This approach helped the model to learn the features of minority lesions. The training process utilized 8 NVIDIA GTX 1080 Ti GPUs to fully leverage computational resources and accelerate the model training.

Data Records

Our dataset is publicly available and can be accessed through the open access repository on Zenodo (https://doi.org/10.5281/zenodo.14838349)³⁰.

The dataset includes 605 patients ID folders, divided into 6 directories. Each directory contains 100 patients’ folders. Each folder comprises a LDCT image and an annotation file, with filenames formatted as (ID_CT.nii.gz and ID_tumor.nii.gz), respectively Fig. 5. The file “Label.xlsx” records image-related information, with the following column descriptions: (Mark_labels) denote the annotation code, where code 1 represents the first tumor, and so on. (Labels_type) indicates the type of annotation, where 1 represents a tumor and 2 represents a nodule. (Tumor_V(cm³)) specifies the volume of each tumor. (Lung_loc) denotes the location of each lesion within the lung, for example: right upper lobe, right middle lobe, right lower lobe, left upper lobe, left lower lobe, and so on.

Fig. 5 [Images not available. See PDF.]

Folder Hierarchy Description. This figure represents the structural composition of the dataset.

The file “Patient.xlsx” contains demographic data for the patients. (cancyr) indicates the year of lung cancer diagnosis, with 0 representing the first year, 1 the second year, and 2 the third year. (Cigsmok) denotes smoking history, where 0 indicates former smokers and 1 indicates current smokers. (Gender) represents gender, with 1 for male and 2 for female. In cancer staging, 16 patients were initially classified according to the AJCC6th edition standards. After cross-verification by three physicians using the AJCC7th edition standards, as well as pathology, clinical, and imaging data provided by the NLST, these patients were reclassified to AJCC 7th edition. The items marked in bold font were initially staged according to the AJCC6th edition.

The file “Image.xlsx” records image-related parameters, with columns derived from DICOM tags, including ID, SeriesInstanceUID, Study Date, StudyInstanceUID, Series Description, Manufacturer, Convolution Kernel, and kvp.

Technical Validation

Statistical description of clinical data

Within the statistical description of the annotated data, we according the NLST clinical data record, incorporating 605 patients’ clinical data, ID, Gender, Age, Stage (AJCC7th), and Histology type. Males and females accounted for 351(58%) and 254(41.9%) patients respectively. The smoking status of the former (quit within 15 years) and still smoking accounted for 269(44.4%) and 337(55.6%) patients, respectively, and most of the patients were White 561(92.7%). The average age was 64 years old, ranging between 55 to 74 years. Table 1.

Table 1. Characteristics of lung cancer diagnosis.

Patient Characteristics N= 605
Years of diagnosis	Total (%)
Gender
Male	351(58.0)
Female	254(41.9)
Smoking status
Former	268(44.2)
Current	337(55.7)
Race
White	561(92.7)
Black or African-American	23(3.8)
Asian	10(1.7)
American Indian or Alaskan Native	4(0.7)
Native Hawaiian or Other Pacific Islander	2(0.3)
More than one race	4(0.7)
Unknown/ declined to answer	1(0.2)
Lung cancer stage (AJCC7^th)
Stage IA	315(52.1)
Stage IB	63(10.4)
Stage IIA	43(7.1)
Stage IIB	18(3.0)
Stage IIIA	73(12.1)
Stage IIIB	15(2.5)
Stage IV	70(11.6)
Carcinoid, cannot be assessed	2(0.3)
Unknown, cannot be assessed	6(1.0)
	Mean (Range)
Age	64 (55–74)

Stage IA lung cancer comprised 315(52.1%) patients, with the most common lung histology type being adenocarcinoma (320,52.8%). The second most common type was Squamous cell carcinoma (122,20.1%) Table 2. Further details can be found in the file “Patient.xlsx”.

Table 2. Stage and Histology type.

Stage	Stage IA	Stage IB	Stage IIA	Stage IIB	Stage IIIA	Stage IIIB	Stage IV	Carcinoid, cannot be assessed	Unknown, cannot be assessed	Total (%)
Histology type
Non-small cell carcinoma
Adenocarcinoma	190	36	25	11	28	2	27	0	1	320 (52.8)
Squamous cell carcinoma	63	17	10	1	19	5	7	0	0	122 (20.1)
Other	57	8	5	5	14	4	17	2	2	114 (18.8)
Small cell carcinoma	2	0	3	0	10	4	17	0	3	39 (6.4)
Other	1	2	0	0	2	0	1	0	0	6 (0.1)
Unknown	2	0	0	1	0	0	1	0	0	4 (0.7)
Total (%)	315 (52.1)	63 (10.4)	43 (7.1)	18 (3.0)	73 (12.1)	15 (2.5)	70 (11.6)	2 (0.3)	6 (1.0)	605

Statistical description of image annotated data

In addition, within the statistical description of the annotated data, we incorporated tumor location information from NLST clinical data²⁵, along with other details such as tumor volume, the number of annotated lesions per patient, and lesion type (tumor or nodule). Each patient had between 79 and 545 LDCT slices, with slice thickness ranging from 0.625 mm to 3.5 mm, and pixel spacing in the X and Y dimensions ranging from 0.49 mm to 0.87 mm. Further details can be found in the file “Lable.xlsx”.

A total of 715 lesions were annotated, including 662 tumors and 53 nodules. Among the tumors, 239 (33.4%) were 1 cm³ to 5 cm³ in volume, 149 (20.8%) were smaller than 0.5 cm³, and 112 (15.7%) were 0.5 cm³ to 1 cm³. In the 53 nodules, 28 (52.8%) were smaller than 0.5 cm³. There were 265 lesions (37.1%) annotated in the right upper lobe, followed by 171 lesions (23.9%) in the left upper lobe. Table 3.

Table 3. Lung location number and lung lesion segment volume number.

lung lesion =715
Segment volume
Tumor	Number (%)
<0.5 (cm³)	149 (20.8)
0.5~1 (cm³)	112 (15.7)
1~5 (cm³)	239 (33.4)
5~10 (cm³)	77 (10.8)
10~50 (cm³)	65 (9.1)
>50 (cm³)	15 (2.1)
Nodule	53 (7.4)
Total (Mean (Range)±Std)	715(7.64(0.03–372.21) ±(27.28)
Lung cancer location	Number (%)
Left lower lobe	93 (13.0)
Left upper lobe	171(23.9)
Left Hilum	10 (1.4)
Right lower lobe	124 (17.3)
Right middle lobe	33 (4.6)
Right upper lobe	265 (37.1)
Right Hilum	13 (1.8)
Mediastinum	2 (0.3)
Other	4 (0.6)

Among the 715 tumors, the largest tumor had a volume of 372.21 cm³, while the smallest was 0.03 cm³. Out of the 605 patients, 71 had more than two lesions annotated, with 49 patients (69%) having two lesions, and 22 patients (31%) having three or more lesions annotated.

In Stage IA, 343 lesions were annotated, with a median volume of 0.9 cm³ Fig. 6 and a minimum volume of 0.03 cm³, which was the smallest lesion among all stages. Some lesions had already exceeded the normal range in the early stages Fig. 3d, In the early stage, we observed some large lesions, which is an unusual circumstance. For example, in stage IB, the maximum lesion volume reached 252.8 cm³ Table 4, and in stage IA the maximum lesion volume reached 15.66 cm³. We observed that the pathology reports of surgically resected lesions in the NLST clinical data did not correspond to the lesion sizes observed in imaging, as the lesion may shrink after being resected and embedded in the specimen material. The final stage is based on the pathology report.

Fig. 6 [Images not available. See PDF.]

Distribution of Tumor Staging and Median Values. This figure represents the structural composition of the dataset.

Table 4. Lung stage and statistical variables.

Tumor Volume (cm³)	n	Median	Min-Max (Range)
Stage
Stage IA	343	0.9	(0.03–15.7)
Stage IB	80	2.0	(0.11–252.8)
Stage IIA	44	2.1	(0.11–80.3)
Stage IIB	20	1.5	(0.1–97.4)
Stage IIIA	86	3.2	(0.07–132.5)
Stage IIIB	16	4.7	(0.21–37.3)
Stage IV	116	2.6	(0.04–372.2)
Carcinoid, cannot be assessed	3	2.4	(0.86–5)
Unknown, cannot be assessed	7	1.7	(0.41–5.4)

Kruskal-Wallis test.

Based on the results of tumor volume and cancer staging, out dataset focused on the annotation of small lesions in early-stage lung cancer, where predicting small lesions still present significant challenges. Our dataset provides comprehensive data that may be valuable in lung cancer research, and it has the potential to further advance studies in this field.

The dataset was validated by a 2D U-net model

To validate the usability of our annotated dataset, we conducted a training of a 2D U-Net³¹ model to predict pulmonary lesions. All annotated lesions were cross-verified by two radiation oncologists and one radiologist to ensure the accuracy of the annotations.

The model achieved an Intersection over Union (IoU) of 0.95 on the training set, demonstrating strong learning performance within the training data. However, since most of the annotated lesions were smaller than 5 cm³, the model sometimes misclassified small blood vessels as lung lesions. However, the model demonstrated IoU of 0.42 in the testing set. The high performance in the training set and low performance in the testing set imply the overfitting in the training data. We demonstrated a prediction results of testing set in Fig. 7.

Fig. 7 [Images not available. See PDF.]

Model prediction result. The figure illustrates the use of CT images from three different patients for predictions made by the model. The left side represents the ground truth, indicated in green, while the right side shows the predictions, represented in blue or red. The prediction results indicate that the model can effectively predict pulmonary lesions.

These findings indicate that more well-annotated dataset is needed in the lung cancer segmentation area. It suggests that our annotated dataset holds potential for training models in lung cancer prediction.

Acknowledgements

The authors thank the National Cancer Institute for access to NCI’s data collected by the National Lung Screening Trial (NLST). This work was supported by Taichung Veterans General Hospital, Taiwan, under Grants of TCVGH-YM1130107 and TCVGH-YM1120113. Thanks to the NLST for providing the free open-access database, and to TCIA for providing the platform to download the dataset. Special thanks to the Biostatistics Group, Department of Medical Research, Taichung Veterans General Hospital for their assistance with the statistical analysis.

Author contributions

The division of work among the contributors to this article was as follows: Kun-Hui Chen: Research design and conceptualization, manuscript editing. Yi-Hui Lin: Research design and conceptualization, dataset annotation and review, manuscript editing. Shawn Wu: dataset annotation and review. Nai-Wen Shih: data collection, statistical analysis, manuscript writing, dataset annotation, and model testing. Hsing-Chen Meng: Data preprocessing, model training. Yen-Yu Lin: Model guidance and revisions. Chun-Rong Huang: Model guidance and revisions. Jing-Wen Huang: Dataset annotation review, clinical expertise and guidance.

Code availability

The model used in the Technical Validation section was uploaded to Github repository (https://github.com/irene2023study/NLSTseg). The library we used to train the model is Pytorch 1.13.1, and we employed Pillow 9.3.0 for the 3D 2D image conversion.

Competing interests

The authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Kratzer, T. B. et al. Lung cancer statistics, 2023. Cancer. 1-19. https://doi.org/10.1002/cncr.35128 (2024).

2. Goldstraw, P et al. The IASLC Lung Cancer Staging Project: Proposals for Revision of the TNM Stage Groupings in the Forthcoming (Eighth) Edition of the TNM Classification for Lung Cancer. Journal of thoracic oncology: official publication of the International Association for the Study of Lung Cancer; 2016; 11, 1 pp. 39-51. [DOI: https://dx.doi.org/10.1016/j.jtho.2015.09.009] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26762738]

3. National Lung Screening Trial Research Teamet al. Reduced lung-cancer mortality with low-dose computed tomographic screening. The New England journal of medicine; 2011; 365, 5 pp. 395-409. [DOI: https://dx.doi.org/10.1056/NEJMoa1102873]

4. de Koning, HJ et al. Reduced Lung-Cancer Mortality with Volume CT Screening in a Randomized Trial. The New England journal of medicine; 2020; 382, 6 pp. 503-513. [DOI: https://dx.doi.org/10.1056/NEJMoa1911793] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31995683]

5. Horeweg, N et al. Characteristics of lung cancers detected by computer tomography screening in the randomized NELSON trial. American journal of respiratory and critical care medicine; 2013; 187, 8 pp. 848-54. [DOI: https://dx.doi.org/10.1164/rccm.201209-1651OC] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/23348977]

6. Moyer, VA U.S. Preventive Services Task Force. Screening for lung cancer: U.S. Preventive Services Task Force recommendation statement. Annals of internal medicine; 2014; 160, 5 pp. 330-8. [DOI: https://dx.doi.org/10.7326/M13-2771] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24378917]

7. Wood, DE et al. Lung Cancer Screening, Version 3.2018, NCCN Clinical Practice Guidelines in Oncology. Journal of the National Comprehensive Cancer Network: JNCCN; 2018; 16, 4 pp. 412-441. [DOI: https://dx.doi.org/10.6004/jnccn.2018.0020] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29632061]

8. Meza, R et al. Evaluation of the Benefits and Harms of Lung Cancer Screening With Low-Dose Computed Tomography: Modeling Study for the US Preventive Services Task Force. JAMA; 2021; 325, 10 pp. 988-997. [DOI: https://dx.doi.org/10.1001/jama.2021.1077] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33687469][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9208912]

9. Armato, SG, 3rd et al. The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans. Medical physics; 2011; 38, 2 pp. 915-31.2011MedPh.38.915A [DOI: https://dx.doi.org/10.1118/1.3528204] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21452728][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041807]

10. Ferreira, CA et al. LNDb v4: pulmonary nodule annotation from medical reports. Scientific data; 2024; 11, 1 [DOI: https://dx.doi.org/10.1038/s41597-024-03345-6] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38760418][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11101445]512.17 May

11. Pedrosa, J. et al. LNDb: A Lung Nodule Database on Computed Tomography. ArXiv abs/1911.08434: n. pag (2019).

12. Pedrosa, J et al. LNDb challenge on automatic lung cancer patient management. Medical image analysis; 2021; 70, [DOI: https://dx.doi.org/10.1016/j.media.2021.102027] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33740739]102027.

13. Setio, AAA et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: The LUNA16 challenge. Med Image Anal; 2017; 42, pp. 1-13. [DOI: https://dx.doi.org/10.1016/j.media.2017.06.015] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28732268]

14. Aerts, H. J. W. L. et al. Data From NSCLC-Radiomics (version 4) [Data set]. The Cancer Imaging Archive, https://doi.org/10.7937/K9/TCIA.2015.PF0M9REI (2014).

15. Aerts, H et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun; 2014; 5, 2014NatCo..5.4006A [DOI: https://dx.doi.org/10.1038/ncomms5006] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24892406]4006.

16. Sankar, SP; George, DE. RETRACTED ARTICLE: Regression Neural Network segmentation approach with LIDC-IDRI for lung lesion. J Ambient Intell Human Comput; 2021; 12, pp. 5571-5580. [DOI: https://dx.doi.org/10.1007/s12652-020-02069-w]

17. Choi, W et al. CIRDataset: A large-scale Dataset for Clinically-Interpretable lung nodule Radiomics and malignancy prediction. Medical image computing and computer-assisted intervention: MICCAI… International Conference on Medical Image Computing and Computer-Assisted Intervention; 2022; 2022, pp. 13-22. [DOI: https://dx.doi.org/10.1007/978-3-031-16443-9_2] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36198166]

18. Dong, X et al. Multi-view secondary input collaborative deep learning for lung nodule 3D segmentation. Cancer imaging: the official publication of the International Cancer Imaging Society; 2020; 20, 1 53. [DOI: https://dx.doi.org/10.1186/s40644-020-00331-0] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32738913]

19. Suji, RJ et al. Optical Flow Methods for Lung Nodule Segmentation on LIDC-IDRI Images. Journal of digital imaging; 2020; 33, 5 pp. 1306-1324. [DOI: https://dx.doi.org/10.1007/s10278-020-00346-w] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/32556911][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7572960]

20. Ardila, D et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature medicine; 2019; 25, 6 pp. 954-961. [DOI: https://dx.doi.org/10.1038/s41591-019-0447-x] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31110349]

21. Morgan, M. et al. Lung-RADS. Reference article, Radiopaedia.org (Accessed on 10 May 2024) https://doi.org/10.53347/rID-32681.

22. Balagurunathan, Y et al. Semi-automated pulmonary nodule interval segmentation using the NLST data. Medical physics; 2018; 45, 3 pp. 1093-1107.2018MedPh.45.1093B [DOI: https://dx.doi.org/10.1002/mp.12766] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29363773][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5952359]

23. Masquelin, A. H et al. LDCT image biomarkers that matter most for the deep learning classification of indeterminate pulmonary nodules. Cancer biomarkers: section A of Disease markers, https://doi.org/10.3233/CBM-230444 (2024).

24. Mikhael, PG et al. Sybil: A Validated Deep Learning Model to Predict Future Lung Cancer Risk From a Single Low-Dose Chest Computed Tomography. Journal of clinical oncology: official journal of the American Society of Clinical Oncology; 2023; 41, 12 pp. 2191-2200. [DOI: https://dx.doi.org/10.1200/JCO.22.01345] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36634294]

25. National Lung Screening Trial Research Team. Data from the National Lung Screening Trial (NLST) [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.HMQ8-J677 (2013).

26. Clark, K et al. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging; 2013; 26, 6 pp. 1045-1057. [DOI: https://dx.doi.org/10.1007/s10278-013-9622-7] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/23884657][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3824915]December

27. Pinter, C et al. Polymorph segmentation representation for medical image computing. Computer methods and programs in biomedicine; 2019; 171, pp. 19-26. [DOI: https://dx.doi.org/10.1016/j.cmpb.2019.02.011] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30902247]

28. Fedorov, A et al. 3D Slicer as an image computing platform for the Quantitative Imaging Network. Magnetic resonance imaging; 2012; 30, 9 pp. 1323-41. [DOI: https://dx.doi.org/10.1016/j.mri.2012.05.001] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/22770690][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3466397]

29. He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770–778, https://doi.org/10.1109/CVPR.2016.90 keywords: (2016).

30. Taichung Veterans General Hospital. NLSTseg A Pixel-level Lung Cancer Dataset Based on NLST LDCT Images [Data set] 2024; <pub-id>10.5281/zenodo.14838349Zenodo;

Word count: 4582

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Low-dose computed tomography (LDCT) is the most effective tools for early detection of lung cancer. With advancements in artificial intelligence, various Computer-Aided Diagnosis (CAD) systems are now supported in clinical practice. For radiologists dealing with a huge volume of CT scans, CAD systems are helpful. However, the development of these systems depends on precisely annotated datasets, which are currently limited. Although several lung imaging datasets exist, there is only few of publicly available datasets with segmentation annotations on LDCT images. To address this problem, we developed a dataset based on NLST LDCT images with pixel-level annotations of lung lesions. The dataset includes LDCT scans from 605 patients and 715 annotated lesions, including 662 lung tumors and 53 lung nodules. Lesion volumes range from 0.03 cm³ to 372.21 cm³, with 500 lesions smaller than 5 cm³, mostly located in the right upper lung. A 2D U-Net model trained on the dataset achieved a 0.95 IoU on training dataset. This dataset enhances the diversity and usability of lung cancer annotation resources.

Details

Title

NLSTseg: A Pixel-level Lung Cancer Dataset Based on NLST LDCT Images

Author

Chen, Kun-Hui¹; Lin, Yi-Hui²; Wu, Shawn³; Shih, Nai-Wen²; Meng, Hsing-Chen⁴; Lin, Yen-Yu⁵; Huang, Chun-Rong⁵; Huang, Jing-Wen⁶

¹ Department of Orthopedic Surgery, Taichung Veterans General Hospital, Taichung, Taiwan (ROR: https://ror.org/00e87hq62) (GRID: grid.410764.0) (ISNI: 0000 0004 0573 0731); Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taichung, Taiwan (ROR: https://ror.org/05vn3ca78) (GRID: grid.260542.7) (ISNI: 0000 0004 0532 3749); Department of Computer Science and Information Engineering, Providence University, Taichung, Taiwan (ROR: https://ror.org/03fcpsq87) (GRID: grid.412550.7) (ISNI: 0000 0000 9012 9465)
² Department of Radiation Oncology, Pingtung Veterans General Hospital, Pingtung City, Taiwan (ROR: https://ror.org/04jedda80) (GRID: grid.415011.0) (ISNI: 0000 0004 0572 9992)
³ Department of Diagnostic Imaging, SY Research Institute, Dallas, USA (ROR: https://ror.org/05kjf3v93) (GRID: grid.477883.7)
⁴ Graduate Degree Program of AI, National Yang Ming Chiao Tung University, Taichung, Taiwan (ROR: https://ror.org/00se2k293) (GRID: grid.260539.b) (ISNI: 0000 0001 2059 7017)
⁵ Department of Computer Science, National Yang Ming Chiao Tung University, Taichung, Taiwan (ROR: https://ror.org/00se2k293) (GRID: grid.260539.b) (ISNI: 0000 0001 2059 7017)
⁶ Department of Radiation Oncology, Taichung Veterans General Hospital, Taichung, Taiwan (ROR: https://ror.org/00e87hq62) (GRID: grid.410764.0) (ISNI: 0000 0004 0573 0731)

Pages

1475

Section

Data Descriptor

Publication year

2025

Publication date

2025

Publisher

Nature Publishing Group

e-ISSN

20524463

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/s41597-025-05742-x

ProQuest document ID

3242490495

NLSTseg: A Pixel-level Lung Cancer Dataset Based on NLST LDCT Images

Jump to:

Full text

Abstract

Details

Suggested sources