Content area

Abstract

Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we propose AutoGEEval, the first multimodal, unit-level automated evaluation framework for geospatial code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs). Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types. The framework integrates both question generation and answer verification components to enable an end-to-end automated evaluation pipeline—from function invocation to execution validation. AutoGEEval supports multidimensional quantitative analysis of model outputs in terms of accuracy, resource consumption, execution efficiency, and error types. We evaluate 18 state-of-the-art LLMs—including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models—revealing their performance characteristics and potential optimization pathways in GEE code generation. This work provides a unified protocol and foundational resource for the development and assessment of geospatial code generation models, advancing the frontier of automated natural language to domain-specific code translation.

Full text

Turn on search term navigation

1. Introduction

General-purpose code refers to a set of program instructions written in languages like Python, C++, or Java, applied to tasks such as data processing, network communication, and algorithm implementation [1,2]. Programming allows users to translate logical intentions into executable tasks on computers [3]. With the rise of Transformer-based large language models (LLMs), models like GPT-4o, DeepSeek, Claude, and LLaMA have shown exceptional performance in code generation, benefiting from exposure to large-scale training data and advanced contextual understanding and generative capabilities [4]. These models enable users to generate code from natural language instructions, lowering the programming barrier [5]. Building on this foundation, domain-specific code generation models—such as DeepSeek Coder [6], Qwen2.5-Coder [7], and Code LLaMA [8]—have further improved accuracy and robustness through targeted training. However, model-generated code often faces issues like syntax errors, incorrect function calls, and missing dependencies, which impact its executability and logical consistency. This issue, known as “code hallucination,” remains a challenge [9]. To quantify model performance and guide iterative improvement, researchers have developed benchmark suites such as HumanEval [10], MBPP [11], and LiveCodeBench [12], which enable automated evaluation based on execution success rates and related metrics [13,14,15].

With the rapid expansion of high-resolution remote sensing and spatiotemporal data, the demand for customized geospatial analysis tools in the geosciences has significantly increased. In response to this demand, cloud platforms such as Google Earth Engine (GEE) have emerged, offering geoscientific functionalities through JavaScript and Python interfaces [16,17]. Unlike traditional GIS tools that rely on graphical user interfaces, GEE utilizes concise coding to automate complex workflows, supporting tasks such as remote sensing preprocessing, index computation, and time-series change detection [18]. Its “copy-paste-run” sharing mechanism has facilitated the widespread adoption of geospatial analytical methods, signifying the specialization of general-purpose code in the geosciences and gradually shaping the modern paradigm of “geospatial code as analysis [19,20].”

However, writing code on the GEE platform requires not only basic programming skills but also solid knowledge of geospatial analysis. This includes familiarity with core objects and operators (such as ‘ee.Image’ and ‘ee.FeatureCollection’), remote sensing datasets (e.g., Landsat, MODIS), spatial information concepts (e.g., coordinate systems and geographic projections), and methods for processing and integrating multisource data. As a result, the learning curve for GEE programming is significantly steeper than for general-purpose coding, and users without a geospatial background often encounter substantial barriers in practice [21,22]. In this context, leveraging LLMs to generate GEE code has emerged as a promising approach to lowering the entry barrier and enhancing development efficiency [23]. It was not until October 2024, with the publication of two systematic evaluation papers, that “geospatial code generation” was formally proposed as an independent research task. These studies extended the general NL2Code paradigm into natural language to geospatial code (NL2GeospatialCode), providing a theoretical foundation for the field [24,25]. Since then, research in this area has advanced rapidly, with several optimization strategies emerging: the CoP strategy uses prompt chaining to guide task decomposition and generation [26]. Geo-FuB [27] and GEE-OPs [22] build functional semantic and function invocation knowledge bases, respectively, enhancing accuracy through retrieval-augmented generation (RAG); GeoCode-GPT, a Codellama variant fine-tuned on geoscientific code corpora, became the first LLM dedicated to this task [21]. However, due to limited training resources, geospatial code accounts for only a small fraction of pretraining data. As a result, models are more prone to “code hallucination” in geospatial code generation tasks than in general domains [24,25]. Typical issues include function invocation errors, object type confusion, missing filter conditions, loop structure errors, semantic band mapping errors, type mismatch errors, invalid type conversions, and missing required parameters. Figure 1 illustrates some of these error types, with additional error types provided in Appendix A. These issues severely compromise code executability and the reliability of analytical results. Therefore, establishing a systematic evaluation framework for geospatial code generation is essential. It not only helps to clarify the performance boundaries of current models in geospatial tasks but also provides theoretical and practical support for developing future high-performance, low-barrier geospatial code generation models [26].

At present, a few studies have begun exploring evaluation mechanisms for geospatial code generation tasks. Notable efforts include GeoCode-Bench [24] and GeoCode-Eval [21] proposed by Wuhan University, and the GeoSpatial-Code-LLMs dataset developed by Wrocław University of Science and Technology [25]. GeoCode-Bench primarily relies on multiple-choice, true/false, and open-ended questions, focusing mainly on the understanding of textual knowledge required for code construction. Code-related tasks rely on expert manual scoring, which increases evaluation costs, introduces subjectivity, and limits reproducibility. Similarly, GeoCode-Eval depends on human evaluation and emphasizes complex test cases, but lacks systematic testing of basic function and common logical combinations, which hinders a fine-grained analysis of model capabilities. The GeoSpatial-Code-LLMs dataset attempts to introduce automated evaluation mechanisms, but currently supports only limited data types, excluding multimodal data such as imagery, vector, and raster formats, and contains only about 40 samples. There is an urgent need to develop an end-to-end, reproducible, and unit-level evaluation benchmark that supports automated assessment and encompasses diverse multimodal geospatial data types.

In response to the aforementioned needs and challenges, this study proposes AutoGEEval, an automated evaluation framework for GEE geospatial code generation tasks based on LLMs, as shown in Figure 2. The framework comprises three key components: the AutoGEEval-Bench test suite (Figure 2a, Section 3), the Submission Program (Figure 2b, Section 4.1), and the Judge Program (Figure 2c, Section 4.2). It supports multimodal data types and unit-level assessment, implemented via GEE’s Python API (earthengine-api, version 1.3.1). The Python interface, running locally in environments like Jupyter Notebook (version 6.5.4) and PyCharm (version 2023.1.2), removes reliance on the GEE web editor, aligns better with real-world practices, and is more widely adopted than the JavaScript version. It also enables automated error detection and feedback through capturing console outputs and runtime exceptions, facilitating a complete evaluation workflow. In contrast, the JavaScript interface is limited by GEE’s online platform, restricting automation.

The main contributions of this study are summarized as follows:

We design, implement, and open-source AutoGEEval, the first automated evaluation framework for geospatial code generation on GEE using LLMs. The framework supports end-to-end automation of test execution, result verification, and error type analysis across multimodal data types at the unit level.

We construct and release AutoGEEval-Bench, a geospatial code benchmark comprising 1325 unit-level test cases spanning 26 distinct GEE data types.

We conduct a comprehensive evaluation of 18 representative LLMs across four categories—including GPT-4o, DeepSeek-R1, Qwen2.5-Coder, and GeoCode-GPT—by measuring execution pass rates for geospatial code generation tasks. In addition, we analyze model accuracy, resource consumption, execution efficiency, and error type distributions, providing insights into current limitations and future optimization directions.

The remainder of this paper is structured as follows: Section 2 describes the construction of the AutoGEEval-Bench test suite. Section 3 outlines the AutoGEEval evaluation framework, detailing the design and implementation of the Submission and Judge Programs. Section 4 presents a systematic evaluation analysis, followed by Section 5, which discusses the experimental findings on geospatial code generation by large language models. Section 6 concludes by summarizing the contributions, identifying current limitations, and proposing directions for future research.

2. AutoGEEval-Bench

The AutoGEEval-Bench (Figure 2a) is built using the official GEE function documentation, containing 1325 unit test cases automatically generated via the LLM-based Self-Design framework and covering 26 GEE data types. This chapter will detail the test case definition, design approach, construction method, and final results.

2.1. Task Definition

Unit-level testing (Tunit) is designed to evaluate a model’s ability to understand the invocation semantics, parameter structure, and input–output specifications of each API function provided by the platform. The goal is to assess whether the model can generate a syntactically correct and semantically valid function call based on structured function information, such that the code executes successfully and produces the expected result. This task simulates one of the most common workflows for developers—“consulting documentation and writing function calls”—and serves as a capability check at the finest behavioral granularity. Each test case corresponds to a single, independent API function and requires the model to generate executable code that correctly invokes the function with appropriate inputs and yields the expected output.

Let F denote the set of functions provided in the public documentation of the Earth Engine platform.

(1)F={f1,f2,,fN}, fiGEE_API

The task of each model under evaluation is to generate a syntactically correct and executable code snippet Ci within the Earth Engine python environment.

(2)Tunit:fiCi

Define a code executor, where yi denotes the result object returned after executing the code snippet Ci.

(3)ExecCi=yi

Let Ai denote the expected output (ground-truth answer). The evaluation metric is defined based on the comparison between yi and Ai, where the symbol “=” may represent strict equality, approximate equality for floating-point values, set containment, or other forms of semantic equivalence.

(4)unityi,Ai=0,if Exec(Ci)=Ai1,otherwise

2.2. Structural Design

All test cases are generated by the flagship LLM Qwen2.5-Max, developed by Alibaba, using predefined prompts and reference data, and subsequently verified by human experts (see Section 2.3 for details). Each complete test case consists of six components: the function header, reference code snippet (Reference_code), parameter list (Parameters_list), output type (Output_type), output path (Output_path), and the expected answer (Expected_answer). Let the set of unit test cases be denoted as

(5)Q={q1,q2,,qn}

Each test case qi is defined as a six-tuple:

(6)qi=Hi,Ri,Pi,Ti,Oi,Ai

The meaning of each component is defined as follows:

HiFunctionHeader: Function declaration, including the ‘def’ statement, function name, parameter list, and a natural language description of the function’s purpose. It serves as the semantic prompt to guide the language model in generating the complete function body.

RiReferenceCode: Reference code snippet, representing the intended logic of the function. It is generated by Qwen2.5-Max based on a predefined prompt and is executed by human experts to obtain the standard answer. During the testing phase, this component is completely hidden from the model, which must independently complete a functionally equivalent implementation based solely on Hi.

PiParameterList: Parameter list, specifying the concrete values to be injected into the function during testing, thereby constructing a runnable execution environment.

TiOutputType: Output type, indicating the expected data type returned by the function, used to enforce format constraints on the model’s output. Examples include numeric values, Boolean values, dictionaries, or layer objects.

OiOutputPath: Output path, specifying where the execution result of the generated code will be stored. The testing system retrieves the model’s output from this path.

AiExpectedAnswer: Expected answer, the correct output obtained by executing the reference code with the given parameters. It serves as the ground-truth reference for evaluating the accuracy of the model’s output.

2.3. Construction Methodology

The unit test cases are constructed based on the official GEE reference documentation, specifically the Client Libraries section, available at https://developers.google.com/earth-engine/apidocs (accessed on 29 June 2025), which includes a total of 1374 functions. Each function page provides the full function name, a description of its functionality, usage examples, return type, parameter names and types, and parameter descriptions. Some pages include sample code demonstrating function usage, while others do not. Prior to constructing the test cases, we manually executed all functions to validate their operability. This process revealed that 49 functions were deprecated or non-functional due to version updates, and were thus excluded. The final set of valid functions incorporated into the unit test suite includes 1325 functions. We extracted relevant information from each function page and organized it into a JSON structure. A corresponding prompt template was then designed (see Figure 3) to guide the LLM in parsing the structured documentation and automatically generating unit-level test items.

After initial generation, all test cases were manually verified by a panel of five experts with extensive experience in GEE usage and geospatial code development. The verification process ensured that each test task reflects a valid geospatial analysis need, has a clear and accurate problem definition, and is configured with appropriate test inputs. Any test case exhibiting execution errors or incomplete logic was revised and corrected by the experts based on domain knowledge. For test cases that execute successfully and produce the expected results, the output is stored at the specified ‘output_path’ and serves as the ground-truth answer for that item. During the testing phase, the Judge Program retrieves the reference result from this path and compares it against the model-generated output to compute consistency-based accuracy metrics.

2.4. Construction Results

The distribution and proportion of each data type in AutoGEEval-Bench are detailed in Table 1.

The 26 GEE data types covered in AutoGEEval-Bench can be broadly categorized into two groups. The first group consists of text-based formats, such as dictionaries, arrays, lists, strings, and floating-point numbers. The second group includes topology-based formats, such as geometries, imagery, and GeoJSON structures. Representative unit test cases from AutoGEEval-Bench are presented in this paper. Figure 4 showcases a typical test case involving text-based GEE data types using ‘ee.Array’, while Figure 5 illustrates a task related to topology-based data types using ‘ee.Image’. Additional test cases are provided in Appendix B.

3. Submission and Judge Programs

The AutoGEEval framework relies on two main components during evaluation: the Submission Program, which generates and executes code based on tasks in AutoGEEval-Bench, and the Judge Program, which compares the model’s output to the correct answers. This chapter outlines the workflow design of both programs.

3.1. Submission Program

The overall workflow of the Submission Program is illustrated in Figure 2b and consists of three main tasks: answer generation, execution, and result saving. In the answer generation stage, the system utilizes a prompt template to guide the target LLM to respond to each item in AutoGEEval-Bench sequentially. The model generates code based solely on the function header, from which it constructs the corresponding function body. During the execution stage, the execution module reads the parameter list and substitutes the specified values into the formal parameters of the generated code. The code is then executed within the Earth Engine environment. Finally, the execution result is saved to the specified location and file name, as defined by the output path. It is important to note that the prompt is carefully designed to instruct the model to output only the final answer, avoiding any extraneous or irrelevant content. The detailed prompt design is shown in Figure 6.

3.2. Judge Program

The overall workflow of the Judge Program is illustrated in Figure 2c. Its primary function is to read the execution results from the specified ‘Output_path’, select the appropriate evaluation logic based on the declared ‘Output_type’, and compare the model’s output against the ‘Expected_answer’. The core challenge of the Judge Program lies in accurately assessing correctness across different output data types. As shown in Table 1, AutoGEEval-Bench defines 26 categories of GEE data types. However, many of these types share overlapping numerical representations. For example, although ‘ee.Array’, ‘ee.ConfusionMatrix’, and ‘ee.ArrayImage’ are different in type, they are all expressed as arrays in output. Similarly, ‘ee.Dictionary’, ‘ee.Blob’, and ‘ee.Reducer’ are represented as dictionary-like structures at runtime. Furthermore, ‘ee.Geometry’, ‘ee.Feature’, and ‘ee.FeatureCollection’ all serialize to the GeoJSON format, while both ‘ee.String’ and ‘ee.Boolean’ are represented as strings. Given these overlaps, the Judge Program performs unified categorization based on the actual value representation—such as arrays, dictionaries, GeoJSON, or strings—and applies corresponding matching strategies to ensure accurate and fair evaluation across diverse GEE data types. AutoGEEval summarizes the value representations and matching strategies for each GEE data type in Table 2.

4. Experiments

The framework also supports the automated monitoring of resource usage and execution efficiency. In the experimental evaluation, we assessed various models, including general-purpose, reasoning-enhanced, code generation, and geospatial-specific models. This chapter covers the model selection, experimental setup, evaluation metrics, and runtime cost considerations.

4.1. Evaluated Models

The models evaluated in this study are selected from among the most advanced and widely adopted LLMs as of April 2025. All selected models have either undergone peer review or have been publicly released through open-source or open-access channels. The aim is to align with the growing user preference for end-to-end, easy-to-use models and to provide informative references for both practical application and academic research. It is important to note that optimization strategies such as prompt engineering, RAG, and agent-based orchestration are not included in this evaluation. These strategies do not alter the core model architecture, and their effectiveness is highly dependent on specific design choices, often resulting in unstable performance. Moreover, they are typically tailored for specific downstream tasks and were not originally intended for unit-level testing, making their inclusion in this benchmark neither targeted nor meaningful. Additionally, such strategies often involve complex prompts that consume a large number of tokens, thereby compromising the fairness and efficiency of the evaluation process.

The evaluated models span four categories: (1) general-purpose non-reasoning LLMs, (2) general-purpose reasoning-enhanced LLMs, (3) general-domain code generation models, and (4) task-specific code generation models tailored for geospatial applications. For some models, multiple publicly available parameter configurations are evaluated. Counting different parameter versions as independent models, a total of 18 models are assessed. Detailed specifications of the evaluated models are provided in Table 3.

4.2. Experimental Setup

In terms of hardware configuration and parameter settings, a local computing device equipped with 32 GB RAM and an RTX 4090 GPU was used. During model inference, open-source models with parameter sizes not exceeding 16 B were deployed locally using the Ollama tool; for larger open-source models and proprietary models, inference was conducted via their official API interfaces to access cloud-hosted versions.

For parameter settings, the generation temperature was set to 0.2 for non-reasoning models to enhance the determinism and stability of outputs. For reasoning-enhanced models, following existing research practices, no temperature was specified, preserving the models’ native inference capabilities. In addition, the maximum output token length for all models was uniformly set to 4096 to ensure complete responses and prevent truncation due to excessive length.

Time consumption and task descriptions for each phase are provided in Table 4.

4.3. Evaluation Metrics

This study evaluates the performance of LLMs in geospatial code generation tasks across four dimensions: accuracy metrics, image metrics, resource consumption metrics, operational efficiency metrics, rank and error type logs.

4.3.1. Accuracy Metrics

This study adopts pass@n as the primary accuracy metric [38]. It measures the probability that a correct answer is generated at least once within n independent attempts for the same test case. This is a widely used standard for evaluating both the correctness and stability of model outputs. Given the known hallucination issue in LLMs—where inconsistent or unreliable results may be produced for identical inputs—a single generation may not be representative. Therefore, we evaluate the models under three configurations, n = 1, 3, 5, to enhance the robustness and credibility of the assessment.

(7)pass@n=1CnN

where N is the total number of generated samples and Cn is the number of incorrect samples among them.

In addition, we introduce the coefficient of variation (CV) to assess the stability of the pass@1, pass@3, and pass@5 scores. This metric helps to evaluate the variability in model performance across multiple generations, serving as an indirect indicator of the severity of hallucination.

(8)CV=σμ

where σ is the standard deviation and μ is the mean. A smaller CV indicates higher stability in model performance.

To more comprehensively evaluate model behavior, we further introduce the stability-adjusted accuracy (SA), which integrates both accuracy and stability into a single metric. Specifically, a higher pass@5 score (accuracy) and a lower CV score (stability) result in a higher SA score. The calculation is defined as

(9)SA=pass@51+CV

4.3.2. Image Metrics

Specifically, ee.Image, ee.Geometry, ee.FeatureCollection and ee.ImageCollection are object types in Google Earth Engine (GEE) that represent raster data structured on a per-pixel basis. For these data types, the outputs of model-generated code are evaluated against ground-truth reference images using pixel-wise comparison. This study employs three quantitative metrics—mean absolute error (MAE), peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM)—to assess image similarity.

MAE [39] measures the average absolute difference between corresponding pixel values in the generated and reference images, providing a direct indication of pixel-level reconstruction accuracy.

(10)MAE=1Ni=1NIprediItruei

Here, Ipredi and Itruei denote the pixel values at location i in the predicted image and the ground-truth reference image, respectively, and N represents the total number of pixels in the image.

PSNR [40] quantifies the level of distortion between two images and is commonly used to assess the quality of image compression or reconstruction. A higher PSNR value indicates that the predicted image is closer in quality to the original reference. The PSNR is computed as follows:

(11)PSNR=10log10R2MSE

Here, R denotes the maximum possible pixel value in the image, typically 255 for 8-bit images. MSE refers to the mean squared error, defined as follows:

(12)MSE=1Ni=1N(Ipred(i)Itrue(i))2

SSIM [41] evaluates image quality by comparing luminance, contrast, and structural information between images, providing a perceptual similarity measure that aligns more closely with human visual perception. The SSIM is computed as follows:

(13)SSIMx,y=2μxμy+C12σxy+C2μx2+μy2+C1σx2+σy2+C2

Here, μx and μy denote the mean pixel values of images x and y, respectively; σx2 and σy2 represent the variances; σxy is the covariance between x and y. Constants C1 and C2 are used to stabilize the division and prevent numerical instability.

A test case is considered passed only if all three criteria are simultaneously satisfied: MAE ≤ 0.01, PSNR ≥ 50, and SSIM ≥ 0.99.

4.3.3. Resource Consumption Metrics

Resource consumption metrics measure the computational resources and cost required for a model to complete the testing tasks. This study considers three key metrics:

Token Consumption (Tok.): Refers to the average number of tokens required to complete each unit test case. For locally deployed models, this metric reflects hardware resource usage; for commercial models, token consumption directly correlates with monetary cost. Most mainstream APIs charge based on the number of tokens processed (typically per 1 million tokens), and pricing varies significantly across models. As of April 2025, GPT-4 Turbo is priced at USD 10.00/1M tokens, Claude 3 Opus at USD 15.00/1M tokens, DeepSeek-Coder at USD 0.60/1M tokens, and Qwen2-72B at USD 0.80/1M tokens. Therefore, token usage is a critical indicator of both inference cost and model accessibility.

Inference Time (In.T): Refers to the average response time (in seconds) required by the model to generate each test case. This metric reflects latency and response efficiency, both of which directly impact user experience.

Code Lines (Co.L): Measures the number of core executable lines of code generated by the model, excluding comments, natural language explanations, and auxiliary prompts. Compared to token count, code line count provides a more accurate assessment of the model’s actual code generation capability, filtering out token inflation caused by unnecessary text in the reasoning process.

4.3.4. Operational Efficiency Metrics

Operational efficiency metrics are used to assess a model’s accuracy per unit of resource consumption, thereby reflecting its cost-effectiveness. This study defines inference efficiency, token efficiency, and code line efficiency based on three resource dimensions: time, token usage, and code structure. It is important to note that, to ensure comparability and fairness across models in terms of generation attempts and to reduce the variance caused by random sampling, all resource consumption metrics reported in this study are averaged over five generations. Therefore, pass@5 is uniformly adopted as the reference accuracy metric in all efficiency calculations.

Inference Efficiency (In.T-E): Inference efficiency refers to the average accuracy achieved by a model per unit time, calculated as the ratio of accuracy to average inference time (in seconds). This metric evaluates the model’s ability to balance response speed and output quality. The shorter the inference time, the higher the accuracy achieved per unit time, indicating a more efficient utilization of computational resources and better interactive performance.

(14) Inference Efficiency = p a s s @ 5 Inference Time

Token Efficiency (Tok.-E): Token efficiency measures the accuracy achieved per unit of token consumption, calculated as the ratio of accuracy to the average number of tokens used. This metric reflects the economic efficiency of the generation process and effectively supports cross-model comparisons in terms of cost–performance.

(15) Token Efficiency = p a s s @ 5 Token Consumption

Code Line Efficiency (Co.L-E): Code line efficiency refers to the accuracy achieved per line of core executable code, emphasizing the structural compactness and effectiveness of the generated logic. Unlike tokens, code lines exclude natural language explanations and prompt-related content, offering a more direct reflection of the model’s ability to produce high-quality, executable code for geospatial tasks. This metric is of particular value to developers, especially when evaluating code generation efficiency in practical engineering deployments.

(16) Code Line Efficiency = p a s s @ 5 C o d e   L i n e s

4.3.5. Rank

To facilitate systematic evaluation across multiple dimensions, we introduce a unified ranking scheme for all key performance indicators and apply it consistently in the subsequent result analysis.

For accuracy metrics, we adopt the pass@n metric (n = 1, 3, 5) as the primary indicator of model accuracy. The corresponding ranking is denoted as P_Rank, with higher scores indicating better performance and thus a higher rank. We further compute the coefficient of variation (CV) across pass@1, pass@3, and pass@5 to capture the stability of accuracy. The ranking based on CV is denoted as C_Rank, where lower CV values correspond to higher rankings. For the stability-adjusted accuracy metric (SA), the ranking is noted as S_Rank, and a higher SA value leads to a higher rank. These three rankings are utilized in Table 6 (Section 5.1) and Table 9 (Section 5.3) to support comparative analysis.

For operational efficiency metrics, we independently rank three dimensions: token efficiency (Tok.-E), inference time efficiency (In.T-E), and code line efficiency (Co.L-E), denoted as T_Rank, I_Rank, and Co_Rank, respectively. The overall efficiency rank, E_Rank, is derived by computing the average of these three ranks and ranking the result. These efficiency-related rankings are also presented in Table 9 (Section 5.3).

Finally, to evaluate overall model performance, we define a composite ranking metric, Total_Rank, which integrates accuracy, efficiency, and stability. It is calculated by averaging the rankings of P_Rank, S_Rank, and E_Rank, followed by ranking the averaged score. This comprehensive ranking is used in Table 9 (Section 5.3) to compare models across all performance dimensions.

4.3.6. Error Type

To support qualitative analysis of model performance, AutoGEEval incorporates an automated error detection mechanism based on GEE runtime errors, designed to record the types of errors present in generated code. These errors are as follows:

Syntax Errors: These refer to issues in the syntactic structure of the code that prevent successful compilation, such as missing parentheses, misspellings, or missing module imports. Such errors are typically flagged in the GEE console as ‘SyntaxError’.

Parameter Errors: These occur when the code is syntactically correct but fails to execute due to incorrect or missing parameters. Parameters often involve references to built-in datasets, band names, or other domain-specific knowledge in geosciences. Common error messages include phrases like “xxx has no attribute xx”, “xxx not found”, or prompts indicating missing required arguments. These errors often arise during parameter concatenation or variable assignment.

Invalid Answers: These refer to cases where the code executes successfully, but the output is inconsistent with the expected answer or the returned data type does not match the predefined specification.

Runtime Errors: Timeouts often result from infinite loops or large datasets, causing the computation to exceed 180 s and be terminated by the testing framework. These errors are usually due to logical flaws, such as incorrect conditionals or abnormal loops. On the GEE platform, they are displayed as “timeout 180 s”.

Network Errors: These occur when the GEE system returns an Internal Server Error, persisting after three retries under stable network conditions. Such errors are caused by Google server rate limits or backend timeouts, not by model code syntax or logic. On the GEE platform, these are displayed as HTTP 500 errors, while client errors are shown as HTTP 400 codes.

5. Results

Building on the evaluation metrics outlined in Section 5.3, this chapter presents a systematic analysis of the evaluation results based on the AutoGEEval framework and the AutoGEEval-Bench test suite. The analysis focuses on four key dimensions: accuracy metrics, resource consumption metrics, operational efficiency metrics, and error type logs.

5.1. Accuracy

The evaluation results for accuracy-related metrics across all models are presented in Table 5.

The stacked bar chart of execution accuracy across models is shown in Figure 7. As observed, increasing the number of generation attempts generally improves accuracy, indicating that multiple generations can partially mitigate hallucination in model-generated code. However, a visual analysis reveals that although both pass@3 and pass@5 increase the number of generations by two rounds compared to the previous level, the green segment (representing the improvement from pass@3 to pass@5) is noticeably shorter than the orange segment (representing the improvement from pass@1 to pass@3). This suggests a significant diminishing return in accuracy gains with additional generations. Quantitative analysis results are presented in Figure 8. The average improvement in pass@3 is 12.88%, ranging from 4.38% to 21.37%. In contrast, the average improvement in pass@5 is only 3.81%, with a range of 1.24% to 6.93%. This pattern highlights a clear diminishing marginal effect in improving accuracy through additional generations. It suggests that while early rounds of generation can substantially correct errors and enhance accuracy, the potential for improvement gradually tapers off in later rounds, reducing the value of further sampling. Therefore, future research should focus on enhancing performance during the initial generation rounds, rather than relying on incremental gains from additional sampling, in order to improve generation efficiency and accuracy more effectively.

The bubble chart displaying the pass@n scores and relative rankings of all models is shown in Figure 9. Several key observations can be made:

Model performance ranking: The dark blue bubbles, representing general-purpose non-reasoning models, generally occupy higher ranks, outperforming the red general-purpose reasoning models and pink general-purpose code generation models. The light blue bubble representing the geospatial code generation model GeoCode-GPT is positioned in the upper-middle tier, with an average rank of 7.33 among the 18 evaluated models.

Performance variation within the DeepSeek family: DeepSeek-V3 (average rank 1.33), DeepSeek-V3-0324 (average rank 3.67), and DeepSeek-R1 (average rank 4.00) all rank among the top-performing models, demonstrating strong performance. However, DeepSeek-Coder-V2 performs poorly, ranking last (average rank 18.00), indicating that it lacks sufficient capability for GEE code generation tasks.

Inconsistent performance across model versions: Surprisingly, DeepSeek-V3-0324, an optimized version of DeepSeek-V3, performs worse in GEE code generation, suggesting that later updates may not have specifically targeted improvements in this domain, potentially leading to performance degradation.

Performance of different parameter versions within the same model: Significant differences are observed across parameter configurations of the same model. For instance, Qwen-2.5-Coder-32B (average rank 8.33) outperforms its 7B (rank 14.00) and 3B (rank 15.67) variants. Similarly, within the Qwen-2.5 family, the 32B version (rank 12.33) ranks notably higher than the 7B (rank 15.33) and 3B (rank 17.00) versions. In addition, GPT-4o (rank 9.33) also outperforms GPT-4o-mini (rank 12.00).

Performance gain of GeoCode-GPT-7B: GeoCode-GPT-7B (average rank 7.33) outperforms its base model Code-Llama-7B (rank 9.50), indicating effective fine-tuning for GEE code generation tasks. However, the improvement is modest, possibly due to GeoCode-GPT’s training covering a broad range of geospatial code types (e.g., ARCPY, GDAL), thus diluting its specialization in the GEE-specific domain.

Category-wise performance analysis: Among the categories, the best-performing general-purpose non-reasoning LLM is DeepSeek-V3 (rank 1.33), the top general-purpose reasoning model is DeepSeek-R1 (rank 4.00), and the best general-purpose code generation model is Qwen-2.5-Coder-32B (rank 8.33).

Underwhelming performance of the GPT series: The GPT series shows relatively weak performance. Specifically, GPT-4o (rank 9.33) and GPT-4o-mini (rank 12.00) are both outperformed by models from the DeepSeek, Claude, and Gemini families, as well as by GeoCode-GPT-7B. Even the GPT-series reasoning model o3-mini only marginally surpasses GeoCode-GPT-7B by less than one rank.

Figure 9

LLM pass@n ranking bubble chart. The x-axis represents the pass@1 scores, the y-axis represents the pass@3 scores, and the size of the bubbles corresponds to the pass@5 scores. Different colors represent different LLM types, as shown in the legend. The bold and underlined numbers beside the model names indicate the average ranking of the model under the pass@1, pass@3, and pass@5 metrics. The bold and underlined numbers in red represent the highest-ranking model within each LLM category.

[Figure omitted. See PDF]

To assess the stability of accuracy for the evaluated LLMs, we performed metric slicing, and summarize the results in Table 6. Models with green shading indicate that both P_Rank and C_Rank are higher than S_Rank, suggesting that these models exhibit strong stability, with high overall rankings and robust consistency. Examples include DeepSeek-V3 and DeepSeek-V3-0324. Models with orange shading indicate that P_Rank is lower than both S_Rank and C_Rank. Although these models achieve high P_Rank, their poor stability leads to lower S_Rank scores. Typical examples include Gemini-2.0-pro, DeepSeek-R1, o3-mini, and QwQ-32B. Most of these are reasoning models, reflecting that poor stability is one of the current performance bottlenecks for reasoning-oriented LLMs. Models with blue shading indicate that P_Rank is higher than both S_Rank and C_Rank. Although P_Rank is not particularly high, these models demonstrate good stability and achieve relatively better rankings, making them more robust in scenarios where stability is crucial. Representative models include Claude3.7-Sonnet, Qwen2.5-Coder-32B, GPT-4o, GPT-4o-mini, and Qwen-2.5-7B.

5.2. Resource Consumption

The evaluation results for resource consumption are presented in Table 7. This study provides visual analyses of token consumption, inference time, and the number of core generated code lines.

The bar chart of the average token consumption for GEE code generation across all LLMs is shown in Figure 10. The results show that the general non-reasoning, general code generation, and geospatial code generation model categories exhibit relatively similar levels of token consumption, while the general reasoning models consume significantly more tokens—approximately 6 to 7 times higher on average than the other three categories. This finding provides a useful reference for users in estimating token-based billing costs when selecting a model. It suggests that, for the same GEE code generation task, general reasoning models will incur 6 to 7 times the cost compared to general non-reasoning, general code generation, and geospatial code generation models.

The lollipop chart of inference time consumption for GEE code generation across LLMs is shown in Figure 11. In terms of inference methods, models using the API call approach (circles) exhibit longer inference times compared to those using local deployment (squares). This may be due to network latency and limitations in the computing resources of remote servers. From a model category perspective, general reasoning models (orange) generally require more inference time than other types. However, o3-mini is an exception—its inference latency is even lower than that of the locally deployed DeepSeek-Coder-V2, indicating that its server-side computational resources may have been optimized accordingly. In addition, the average inference time per unit test case for DeepSeek-R1 and QwQ-32B reaches as high as 78.3 s and 44.68 s, respectively—2 to 40 times longer than other models—indicating that these two models are in urgent need of targeted optimization for inference latency.

The token consumption metric reflects not only the length of the generated code but also includes the model’s reasoning output and the length of the prompt template, thereby representing more of the reasoning cost than the actual size of the generated code itself. To more accurately measure the structural length of the model’s output code, we excluded the influence of prompt- and reasoning-related content and used the total number of generated lines of code (including both comments and executable lines) as the evaluation metric. The results are shown in Figure 12. As observed, GeoCode-GPT-7B (average: 11.79 lines), DeepSeek-Coder-V2 (10.06), Qwen2.5-Coder-3B (9.11), and Claude3.7-Sonnet (8.98) rank among the highest in terms of code length. This may be attributed to excessive generated comments or more standardized code structures that automatically include formal comment templates, thereby increasing the overall line count. Additionally, a noteworthy phenomenon is observed within the Qwen2.5-Coder family: models with larger parameter sizes tend to generate shorter code. For example, the Qwen2.5-Coder-32B model has an average code length of 5.79 lines, which is significantly shorter than its 7B (7.06) and 3B (9.11) versions. This result contradicts conventional expectations and may suggest that larger models possess stronger capabilities in code compression and refinement, or that their output formatting is subject to stricter constraints and optimizations during training.

5.3. Operational Efficiency

The operational efficiency results for each model are presented in Table 8.

According to the results shown in Table 9, DeepSeek-V3, Gemini-2.0-pro, and DeepSeek-V3-0324 consistently rank at the top across all three dimensions and demonstrate excellent overall performance. All three are commercial models, making them suitable for API-based deployment. In contrast, models such as Code-Llama-7B, Qwen2.5-Coder-32B, and GPT-4o do not rank as highly in terms of P_Rank and S_Rank, but their strong performance in E_Rank makes them well-suited for local deployment (the first two) or for scenarios requiring high generation efficiency (GPT-4o). By comparison, although models like DeepSeek-R1, GeoCode-GPT-7B, o3-mini, and Claude3.7-Sonnet perform well in terms of accuracy and stability, their low E_Rank scores lead to less favorable overall rankings, indicating a need to improve generation efficiency in order to optimize their total performance.

5.4. Error Type Logs

The types of errors encountered by each model during GEE code generation are summarized in Table 10, revealing an overall consistent error pattern across models. Parameter errors occur at a significantly higher rate than invalid answers, while syntax errors, runtime errors, and network errors appear only sporadically and at extremely low frequencies. This suggests that the core challenge currently faced by models in GEE-based geospatial code generation lies in the lack of domain-specific parameter knowledge, including references to platform-integrated datasets, band names, coordinate formats, and other geoscientific details. As such, there is an urgent need to augment training data with domain-relevant knowledge specific to the GEE platform and to implement targeted fine-tuning. Meanwhile, the models have demonstrated strong stability in terms of basic syntax, code structure, and loop control, with related errors being extremely rare. This indicates that their foundational programming capabilities are largely mature. Therefore, future optimization efforts should shift toward enhancing domain knowledge rather than further reinforcing general coding skills.

6. Discussion

Using the AutoGEEval framework, this study evaluates 18 large language models (LLMs) in GEE code generation across four key dimensions: accuracy, resource consumption, operational efficiency, and error types. This chapter summarizes the findings and explores future research directions for LLMs in geospatial code generation.

The results show that multi-round generation mitigates hallucinations and improves stability, but the marginal gains diminish with more rounds, particularly from pass@3 to pass@5. This highlights the need to prioritize early-stage generation quality over additional iterations. Future work should focus on enhancing initial code generation accuracy to prevent error propagation. A promising approach is incorporating reinforcement learning-based adaptive mechanisms with early feedback to optimize early outputs and reduce reliance on post-correction. Cross-round information sharing may also enhance stability.

Our study shows that general-purpose reasoning models consume significantly more resources, leading to higher computational costs and slower response times, averaging 2 to 40 times longer than non-reasoning models. Despite this, their performance does not exceed, and in some cases is inferior to, non-reasoning models, resulting in low cost-efficiency. Future research on geospatial code generation with reasoning models should focus on integrating model compression and optimization techniques, such as quantization, distillation, and hardware acceleration (e.g., GPU/TPU), to improve inference speed and efficiency, particularly in resource-limited edge computing environments.

Our analysis reveals that parameter errors are the most common, with syntax and network errors being relatively rare. This indicates that most models have achieved maturity in basic syntax and code execution. However, the lack of domain-specific knowledge required by the GEE platform (e.g., dataset paths, band names, coordinate formats) remains a key limitation, highlighting the need for domain-specific fine-tuning to improve model performance in geospatial tasks.

We observed that models from the same company can show significant performance variability. For example, while OpenAI’s GPT series consistently maintains stability, DeepSeek models vary widely—DeepSeek-V3 excels in accuracy and stability, while DeepSeek-Coder-V2 ranks lowest. This underscores the importance of data-driven model selection, emphasizing that model choice should rely on rigorous testing and comparative analysis, not brand reputation. For model selection, the overall ranking indicator (Total_Rank), which combines accuracy (P_Rank), stability (S_Rank), and efficiency (E_Rank), is recommended. Models such as DeepSeek-V3, offering high accuracy and efficiency, are well-suited for high-performance, high-frequency API deployment. In contrast, models like Claude3.7-Sonnet, with a focus on accuracy and stability, are better suited for scientific and engineering tasks requiring consistent outputs.

Model size alone does not determine performance. For example, Qwen2.5-Coder-32B excels in accuracy and efficiency compared to its 7B and 3B counterparts but underperforms in code simplicity and stability. This indicates that, for specific tasks, fine-tuning and output formatting are more crucial than model size. Future research should focus on task-specific adaptation and fine-tuning, integrating model size, task alignment, and output formatting to optimize efficiency.

7. Conclusions

This study presents AutoGEEval, the first automated evaluation framework designed for geospatial code generation tasks on the GEE platform. Implemented via the Python API, the framework supports unit-level, multimodal, and end-to-end evaluation across 26 GEE data types. It consists of three core components: the constructed benchmark (AutoGEEval-Bench) with 1325 unit test cases; the Submission Program, which guides LLMs to generate executable code via prompts; and the Judge Program, which automatically verifies output correctness, resource consumption, and error types. Using this framework, we conducted a comprehensive evaluation of 18 representative LLMs, spanning general-purpose, reasoning-enhanced, code generation, and geospatial-specialized models. The results reveal performance gaps, code hallucination phenomena, and trade-offs in code quality, offering valuable insights for the optimization of future geospatial code generation technologies.

7.1. Significance and Contributions

This study is the first to establish a dedicated evaluation system for LLMs in geospatial code generation tasks, addressing key gaps in current tools that lack geospatial coverage, granularity, and automation. Through the proposed AutoGEEval framework, we achieved the systematic evaluation of multimodal GEE data types and API function call capabilities, advancing the automated transformation from natural language to geospatial code. Compared to existing methods that rely heavily on manual scoring, AutoGEEval offers high automation, standardization, and reproducibility, substantially reducing evaluation costs and improving efficiency. The framework supports comprehensive tracking and quantitative analysis of code correctness, inference efficiency, resource consumption, and error types, providing a clear indicator system and real-world entry points for model refinement. Moreover, the constructed benchmark AutoGEEval-Bench, covering 1325 test cases and 26 GEE data types, is both scalable and representative, serving as a valuable public resource for future research on intelligent geospatial code generation. Overall, this work advances the transformation of geospatial code generation from an engineering tool into a quantifiable scientific problem, and provides a methodological reference and practical blueprint for interdisciplinary AI model evaluation paradigms.

7.2. Limitations and Future Work

Despite the representativeness of the proposed unit-level evaluation framework, several limitations remain, and future work can explore multiple directions for further enhancement. Currently, the evaluation tasks focus on single-function unit tests, and although 1325 use cases are included, coverage remains limited. Future expansions could include a broader test set, especially under boundary conditions and abnormal inputs, to evaluate model robustness under extreme scenarios. Additionally, introducing function composition and cross-API test cases will allow for the assessment of model capabilities in handling complex logical structures. The current 26 GEE data types could also be expanded using modality-based classification strategies to achieve a more balanced and comprehensive benchmark. In terms of evaluation metrics, the current system primarily centers on execution correctness. Future extensions could incorporate multi-dimensional evaluation criteria, including code structural complexity, runtime efficiency, and resource usage. Given the continual evolution of LLMs, a valuable next step would be to build an open, continuous evaluation platform that includes economic cost dimensions and releases “cost-effectiveness leaderboards”, thereby driving community development and enhancing the visibility and influence of geospatial code generation research.

Author Contributions

Conceptualization, Huayi Wu and Shuyang Hou; methodology, Huayi Wu and Shuyang Hou; software, Zhangxiao Shen, Haoyue Jiao and Shuyang Hou; validation, Zhangxiao Shen, Jianyuan Liang and Yaxian Qing; formal analysis, Shuyang Hou and Zhangxiao Shen; investigation, Xu Li, Xiaopu Zhang and Shuyang Hou; resources, Jianyuan Liang, Huayi Wu, Zhipeng Gui, Xuefeng Guan and Longgang Xiang; data curation, Jianyuan Liang and Shuyang Hou; writing—original draft preparation, Shuyang Hou and Zhangxiao Shen; writing—review and editing, Shuyang Hou, Zhangxiao Shen and Huayi Wu; visualization, Shuyang Hou; supervision, Huayi Wu, Zhipeng Gui, Xuefeng Guan and Longgang Xiang; project administration, Shuyang Hou; funding acquisition, Zhipeng Gui All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The experimental data used in this study can be downloaded from https://github.com/szx-0633/AutoGEEval (accessed on 29 June 2025).

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Common error types in geospatial code generation with LLMs. The figure categorizes typical errors into three components: the error type, the user’s geospatial code prompt to the LLM, and the erroneous code generated by the model. In each example, red highlights the incorrect code while green indicates the corrected version.

View Image -

Figure 2 AutoGEEval framework structure. The diagram highlights AutoGEEval-Bench (a), the Submission Program (b), and the Judge Program (c). Blue represents documentation, orange denotes language models, green represents prompts, and purple indicates evaluation methods and metrics. The black solid arrows represent the main workflow of each stage while the gray dashed arrows indicate the primary data flow.

View Image -

Figure 3 Prompt for unit test construction.

View Image -

Figure 4 Unit test example with text-based type ‘ee.Array’.

View Image -

Figure 5 Unit test example with topology-based type ‘ee.Image’.

View Image -

Figure 6 Prompt for submission program.

View Image -

Figure 7 Stacked bar chart of pass@n metrics. The blue represents the pass@1 value, the orange represents the improvement of pass@3 over pass@1, and the green represents the improvement of pass@5 over pass@3. The text on the bars indicates the absolute scores for pass@1, pass@3, and pass@5, respectively.

View Image -

Figure 8 Stacked bar chart of pass@3 and pass@5 improvement ratios.

View Image -

Figure 10 Average token consumption across LLMs. Blue indicates general non-reasoning models, orange indicates general reasoning models, green represents code generation models, and yellow represents geospatial code generation models.

View Image -

Figure 11 Average inference time comparison of LLMs.

View Image -

Figure 12 Average lines of generated GEE code per model.

View Image -

Distribution of GEE output types in AutoGEEval-Bench.

Output_Type Description Count Percentage
ee.Array Multi-dimensional array for numbers and pixels 118 8.91%
ee.ArrayImage Image constructed from multidimensional arrays 30 2.26%
ee.Blob Binary large object storage (e.g., files/models) 1 0.08%
ee.BOOL Boolean logic value (True/False) 38 2.87%
ee.Classifier Machine learning classifier object 12 0.91%
ee.Clusterer Clustering algorithm processor 6 0.45%
ee.ConfusionMatrix Confusion matrix of classification results 4 0.30%
ee.Date Date and time format data 9 0.68%
ee.DateRange Object representing a range of dates 5 0.38%
ee.Dictionary Key-value data structure 63 4.75%
ee.Element Fundamental unit of a geographic feature 3 0.23%
ee.ErrorMargin Statistical object for error margins 1 0.08%
ee.Feature Single feature with properties and shape 21 1.58%
ee.FeatureCollection Collection of geographic features 41 3.09%
ee.Filter Object representing data filtering conditions 37 2.79%
ee.Geometry Geometric shapes (point, line, polygon, etc.) 146 11.02%
ee.Image Single raster image data 224 16.91%
ee.ImageCollection Collection of image data objects 17 1.28%
ee.Join Method for joining datasets 6 0.45%
ee.Kernel Convolution kernel for spatial analysis 22 1.66%
ee.List Ordered list data structure 68 5.13%
ee.Number Numeric data 194 14.64%
ee.PixelType Pixel type definition 10 0.75%
ee.Projection Coordinate system projection information 15 1.13%
ee.Reducer Aggregation and reduction functions 60 4.53%
ee.String String-type data 174 13.13%
Overall Total 1325 100.00%

Summary of value representations and evaluation strategies for GEE data types.

GEE Data Type Value Representation Testing Logic
ee.Array Small-scale array Use getInfo to convert to a NumPy array and compare each element with expected_answer.
ee.ConfusionMatrix
ee.ArrayImage
ee.Image Large-scale array Download the image as a NumPy array and perform pixel-wise comparison; for large images, apply center sampling with a tolerance of 0.001. Merge all images into one and evaluate as a single image.
ee.ImageCollection
ee.List List Convert to a Python list via getInfo and compare each element.
ee.String String Convert to a Python string via getInfo and compare directly. Boolean values are also treated as strings.
ee.BOOL
ee.Number Floating-point number Convert to a Python float via getInfo and compare with the answer.
ee.Dictionary All dictionary keys Convert to a Python dictionary via getInfo and compare key-value pairs.
ee.Blob
ee.Reducer
ee.Filter
ee.Classifier
ee.Clusterer
ee.Pixeltype
ee.Join
ee.Kernel
ee.ErrorMargin
ee.Element
ee.Projection
ee.Date Dictionary ‘value’ field Use getInfo to obtain a dictionary, extract the ‘value’ field (timestamp in milliseconds) and compare numerically.
ee.DateRange
ee.Geometry GeoJSON Convert to GeoJSON using getInfo and compare geometric consistency with Shapely; for Features, extract geometry before comparison.
ee.Feature
ee.FeatureCollection

Information of evaluated LLMs. “N/A” indicates that the parameter size of the model was not publicly released at the time of publication and is therefore marked as unknown.

Model Type Model Name Developer Size Year
General Non-Reasoning GPT-4o [28] OpenAI N/A 2024
GPT-4o-mini [29] OpenAI N/A 2024
Claude3.7-Sonnet [30] Anthropic N/A 2025
Gemini-2.0-pro [31] Google N/A 2025
DeepSeek-V3 [32] DeepSeek 671B 2024
DeepSeek-V3-0324 [32] DeepSeek 685B 2025
Qwen-2.5 [33] Alibaba 3B, 7B, 32B 2024
General Reasoning o3-mini [34] OpenAI N/A 2025
QwQ-32B [35] Alibaba 32B 2025
DeepSeek-R1 [36] DeepSeek 671B 2025
General Code Generation DeepSeek-Coder-V2 [37] DeepSeek 16B 2024
Qwen2.5-Coder [7] Alibaba 3B, 7B, 32B 2024
Code-Llama-7B [8] Meta 7B 2023
Geospatial Code Generation GeoCode-GPT-7B [21] Wuhan University 7B 2024

Time allocation across experimental stages.

Stages Time Spent (hours)
AutoGEEval-Bench Construction 35
Expert Manual Revision 50
Model Inference and Code Execution 445
Evaluation of Model Responses 270
Total (All Stages) 800

Accuracy evaluation results. Where the values in parentheses under pass@3 represent the improvement over pass@1, and the values in parentheses under pass@5 represent the improvement over pass@3.

Model pass@1 (%) pass@3 (%) pass@5 (%) CV SA
General Non-Reasoning
GPT-4o 59.02 63.62 (+4.60) 65.36 (+1.74) 0.097 59.58
GPT-4o-mini 55.02 60.68 (+4.66) 61.43 (+0.75) 0.104 55.63
Claude3.7-Sonnet 63.92 66.72 (+2.80) 67.92 (+1.20) 0.059 64.14
Gemini-2.0-pro 65.36 75.09 (+9.73) 77.28 (+2.19) 0.154 66.95
DeepSeek-V3 71.55 75.25 (+3.70) 76.91 (+1.66) 0.070 71.90
DeepSeek-V3-0324 65.28 71.92 (+6.64) 73.51 (+1.59) 0.112 66.11
Qwen-2.5-3B 33.58 39.32 (+5.74) 41.43 (+2.11) 0.189 34.83
Qwen-2.5-7B 49.36 54.49 (+5.13) 56.38 (+1.89) 0.125 50.14
Qwen-2.5-32B 54.42 60.00 (+5.58) 62.04 (+2.04) 0.123 55.25
General Reasoning
o3-mini 56.98 68.91 (+11.93) 71.02 (+2.11) 0.198 59.30
QwQ-32B 53.74 64.83 (+9.09) 68.83 (+4.00) 0.219 56.45
DeepSeek-R1 60.23 72.68 (+12.45) 76.68 (+4.00) 0.215 63.14
General Code Generation
DeepSeek-Coder-V2 31.40 38.11 (+6.71) 40.75 (+2.64) 0.229 33.14
Qwen2.5-Coder-3B 46.49 54.34 (+7.85) 57.36 (+3.02) 0.190 48.22
Qwen2.5-Coder-7B 51.25 57.66 (+6.41) 60.91 (+3.25) 0.159 52.57
Qwen2.5-Coder-32B 61.28 64.08 (+2.80) 65.21 (+1.13) 0.060 61.50
Code-Llama-7B 56.98 64.00 (+7.02) 66.42 (+2.42) 0.142 58.15
Geospatial Code Generation
GeoCode-GPT-7B 58.58 65.34 (+6.76) 68.53 (+3.19) 0.145 59.84

Ranking of the models under pass@5, CV, and SA metrics. The table is sorted by S_Rank, reflecting the accuracy ranking of the models with the inclusion of stability factors, rather than solely considering accuracy. Categories 1, 2, 3, and 4 correspond to general non-reasoning models, general reasoning models, general code generation models, and geospatial code generation models, respectively.

Category Model pass@5 CV SA P_Rank C_Rank S_Rank
1 DeepSeek-V3 76.91 0.07 71.9 2 3 1
1 Gemini-2.0-pro 77.28 0.154 66.95 1 11 2
1 DeepSeek-V3-0324 73.51 0.112 66.11 4 6 3
1 Claude3.7-Sonnet 67.92 0.059 64.14 8 1 4
2 DeepSeek-R1 76.68 0.215 63.14 3 16 5
3 Qwen2.5-Coder-32B 65.21 0.06 61.5 11 2 6
4 GeoCode-GPT-7B 68.53 0.145 59.84 7 10 7
1 GPT-4o 65.36 0.097 59.58 10 4 8
2 o3-mini 71.02 0.198 59.3 5 15 9
3 Code-Llama-7B 66.42 0.142 58.15 9 9 10
2 QwQ-32B 68.83 0.219 56.45 6 17 11
1 GPT-4o-mini 61.43 0.104 55.63 13 5 12
1 Qwen-2.5-32B 62.04 0.123 55.25 12 7 13
3 Qwen2.5-Coder-7B 60.91 0.159 52.57 14 12 14
1 Qwen-2.5-7B 56.38 0.125 50.14 16 8 15
3 Qwen2.5-Coder-3B 57.36 0.19 48.22 15 14 16
1 Qwen-2.5-3B 41.43 0.189 34.83 17 13 17
3 DeepSeek-Coder-V2 40.75 0.229 33.14 18 18 18

Evaluation results for resource consumption. For the QwQ-32B model using API calls, due to the provider’s configuration, only “streaming calls” are supported. In this mode, token consumption cannot be tracked, so it is marked as N/A.

Model Inference Method Tok. (tokens) In.T (s) Co.L (lines)
General Non-Reasoning
GPT-4o API call 210 3.31 7.77
GPT-4o-mini API call 208 7.63 5.86
Claude3.7-Sonnet API call 265 11.72 8.98
Gemini-2.0-pro API call 223 24.55 5.2
DeepSeek-V3 API call 190 8.87 4.86
DeepSeek-V3-0324 API call 204 16.32 6.82
Qwen-2.5-3B Local deployment 186 2.58 4.12
Qwen-2.5-7B Local deployment 197 3.88 6.28
Qwen-2.5-32B API call 205 5.63 6.6
General Reasoning
o3-mini API call 1083 7.40 6.93
QwQ-32B API call N/A 44.68 5.64
DeepSeek-R1 API call 1557 78.30 5.32
General Code Generation
DeepSeek-Coder-V2 Local deployment 285 8.39 10.06
Qwen2.5-Coder-3B Local deployment 240 2.51 9.11
Qwen2.5-Coder-7B Local deployment 224 3.76 7.06
Qwen2.5-Coder-32B API call 198 5.50 5.79
Code-Llama-7B Local deployment 256 3.05 3.58
Geospatial Code Generation
GeoCode-GPT-7B Local deployment 253 4.05 11.79

Evaluation results for operational efficiency. For the QwQ-32B model using API calls, due to the provider’s configuration, only “streaming calls” are supported. In this mode, token consumption cannot be tracked, so it is marked as N/A.

Model Inference Method Tok.-E In.T-E Co.L-E
General Non-Reasoning
GPT-4o API call 0.311 19.746 7.77
GPT-4o-mini API call 0.295 8.052 5.86
Claude3.7-Sonnet API call 0.256 5.796 8.98
Gemini-2.0-pro API call 0.347 3.148 5.2
DeepSeek-V3 API call 0.405 8.670 4.86
DeepSeek-V3-0324 API call 0.360 4.504 6.82
Qwen-2.5-3B Local deployment 0.223 16.060 4.12
Qwen-2.5-7B Local deployment 0.286 14.530 6.28
Qwen-2.5-32B API call 0.303 11.019 6.6
General Reasoning
o3-mini API call 0.066 9.597 6.93
QwQ-32B API call N/A 1.541 5.64
DeepSeek-R1 API call 0.049 0.979 5.32
General Code Generation
DeepSeek-Coder-V2 Local deployment 0.143 8.39 10.06
Qwen2.5-Coder-3B Local deployment 0.239 2.51 9.11
Qwen2.5-Coder-7B Local deployment 0.272 3.76 7.06
Qwen2.5-Coder-32B API call 0.329 5.50 5.79
Code-Llama-7B Local deployment 0.259 3.05 3.58
Geospatial Code Generation
GeoCode-GPT-7B Local deployment 0.271 4.05 11.79

Rank-based comparative evaluation of models. The table is sorted by Total_Rank in ascending order. If models share the same average rank, they are assigned the same ranking (e.g., both DeepSeek-V3 and Code-Llama-7B are ranked 1 in E_Rank). Yellow highlights indicate the top 12 models in E_Rank, green for the top 12 in S_Rank, and orange for the top 12 in P_Rank. Gray highlights mark the bottom 6 models across E_Rank, S_Rank, and P_Rank. Categories 1, 2, 3, and 4 correspond to general non-reasoning models, general reasoning models, general code generation models, and geospatial code generation models, respectively.

Category Model T_Rank I_Rank Co_Rank E_Rank S_Rank P_Rank Total_Rank
1 DeepSeek-V3 1 11 2 1 1 2 1
1 Gemini-2.0-pro 3 16 3 4 2 1 2
1 DeepSeek-V3-0324 2 15 7 6 3 4 3
3 Code-Llama-7B 11 2 1 1 10 9 4
3 Qwen2.5-Coder-32B 4 8 6 3 6 11 4
1 GPT-4o 5 3 14 5 8 10 6
2 DeepSeek-R1 17 18 4 15 5 3 6
4 GeoCode-GPT-7B 10 4 17 13 7 7 8
2 o3-mini 16 10 9 14 9 5 9
1 Claude3.7-Sonnet 12 13 15 17 4 8 10
1 Qwen-2.5-32B 6 9 11 7 13 12 11
1 GPT-4o-mini 7 12 8 8 12 13 12
2 QwQ-32B 18 17 5 16 11 6 12
3 Qwen2.5-Coder-7B 9 5 13 9 14 14 14
1 Qwen-2.5-7B 8 7 12 10 15 16 15
3 Qwen2.5-Coder-3B 13 1 16 11 16 15 16
1 Qwen-2.5-3B 14 6 10 12 17 17 17
3 DeepSeek-Coder-V2 15 14 18 18 18 18 18

Error type distribution in GEE code generation across models.

Model ParameterError (%) InvalidAnswer (%) SyntaxError (%) RuntimeError (%) NetworkError (%)
General Non-Reasoning
GPT-4o 72.21 26.58 1.02 0.19 0.00
GPT-4o-mini 75.88 22.29 1.49 0.30 0.04
Claude3.7-Sonnet 65.81 31.92 1.76 0.22 0.29
Gemini-2.0-pro 55.71 37.15 7.01 0.02 0.11
DeepSeek-V3 72.75 26.29 0.37 0.14 0.45
DeepSeek-V3-0324 79.40 19.86 0.43 0.08 0.23
Qwen-2.5-3B 83.72 8.38 7.90 0.00 0.00
Qwen-2.5-7B 83.44 12.60 3.96 0.00 0.00
Qwen-2.5-32B 78.47 18.65 2.75 0.11 0.02
General Reasoning
o3-mini 67.79 30.02 1.84 0.09 0.26
QwQ-32B 85.68 13.01 1.11 0.01 0.19
DeepSeek-R1 85.04 14.62 0.19 0.00 0.15
General Code Generation
DeepSeek-Coder-V2 84.47 10.62 4.78 0.00 0.13
Qwen2.5-Coder-3B 75.26 12.54 12.20 0.00 0.00
Qwen2.5-Coder-7B 84.76 14.42 0.63 0.03 0.16
Qwen2.5-Coder-32B 79.19 19.96 0.43 0.19 0.23
Code-Llama-7B 80.01 18.47 1.37 0.01 0.14
Geospatial Code Generation
GeoCode-GPT-7B 77.21 9.54 13.14 0.03 0.08

Appendix A

The error types that may occur when large language models generate GEE platform code are shown in Figure A1.

Figure A1 Common error types in geospatial code generation with LLMs.

View Image -

Appendix B

Figure A2 and Figure A3 present representative test cases from AutoGEEval-Bench, illustrating tasks involving text-based and topology-based GEE data types, respectively.

Figure A2 Unit test example with text-based type.

View Image -

Figure A3 Unit test example with topology-based type. The non-English characters in the imagery are determined by the map API used. When retrieving a specific region, the place names are displayed in the official language of that region. This does not affect the readability of the figure.

View Image -

References

1. Li, Y.; Choi, D.; Chung, J.; Kushman, N.; Schrittwieser, J.; Leblond, R.; Eccles, T.; Keeling, J.; Gimeno, F.; Dal Lago, A. Competition-level code generation with alphacode. Science; 2022; 378, pp. 1092-1097. [DOI: https://dx.doi.org/10.1126/science.abq1158] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36480631]

2. Popat, S.; Starkey, L. Learning to code or coding to learn? A systematic review. Comput. Educ.; 2019; 128, pp. 365-376. [DOI: https://dx.doi.org/10.1016/j.compedu.2018.10.005]

3. Bonner, A.J.; Kifer, M. An overview of transaction logic. Theor. Comput. Sci.; 1994; 133, pp. 205-265. [DOI: https://dx.doi.org/10.1016/0304-3975(94)90190-2]

4. Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A survey on large language models for code generation. arXiv; 2024; arXiv: 2406.00515

5. Wang, J.; Chen, Y. A review on code generation with llms: Application and evaluation. Proceedings of the 2023 IEEE International Conference on Medical Artificial Intelligence (MedAI); Beijing, China, 18–19 November 2023; pp. 284-289.

6. Guo, D.; Zhu, Q.; Yang, D.; Xie, Z.; Dong, K.; Zhang, W.; Chen, G.; Bi, X.; Wu, Y.; Li, Y.K. DeepSeek-Coder: When the Large Language Model Meets Programming—The Rise of Code Intelligence. arXiv; 2024; arXiv: 2401.14196

7. Hui, B.; Yang, J.; Cui, Z.; Yang, J.; Liu, D.; Zhang, L.; Liu, T.; Zhang, J.; Yu, B.; Lu, K. Qwen2. 5-coder technical report. arXiv; 2024; arXiv: 2409.12186

8. Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T. Code llama: Open foundation models for code. arXiv; 2023; arXiv: 2308.12950

9. Rahman, M.M.; Kundu, A. Code Hallucination. arXiv; 2024; arXiv: 2407.04831

10. Li, D.; Murr, L. HumanEval on Latest GPT Models—2024. arXiv; 2024; arXiv: 2402.14852

11. Yu, Z.; Zhao, Y.; Cohan, A.; Zhang, X.-P. HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation. arXiv; 2024; arXiv: 2412.21199

12. Jain, N.; Han, K.; Gu, A.; Li, W.-D.; Yan, F.; Zhang, T.; Wang, S.; Solar-Lezama, A.; Sen, K.; Stoica, I. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv; 2024; arXiv: 2403.07974

13. Evtikhiev, M.; Bogomolov, E.; Sokolov, Y.; Bryksin, T. Out of the bleu: How should we assess quality of the code generation models?. J. Syst. Softw.; 2023; 203, 111741. [DOI: https://dx.doi.org/10.1016/j.jss.2023.111741]

14. Liu, J.; Xie, S.; Wang, J.; Wei, Y.; Ding, Y.; Zhang, L. Evaluating language models for efficient code generation. arXiv; 2024; arXiv: 2408.06450

15. Zhou, S.; Alon, U.; Agarwal, S.; Neubig, G. Codebertscore: Evaluating code generation with pretrained models of code. arXiv; 2023; arXiv: 2302.05527

16. Capolupo, A.; Monterisi, C.; Caporusso, G.; Tarantino, E. Extracting land cover data using GEE: A review of the classification indices. Proceedings of the Computational Science and Its Applications—ICCSA 2020; Cagliari, Italy, 1–4 July 2020; pp. 782-796.

17. Tamiminia, H.; Salehi, B.; Mahdianpari, M.; Quackenbush, L.; Adeli, S.; Brisco, B. Google Earth Engine for geo-big data applications: A meta-analysis and systematic review. ISPRS J. Photogramm. Remote Sens.; 2020; 164, pp. 152-170. [DOI: https://dx.doi.org/10.1016/j.isprsjprs.2020.04.001]

18. Ratti, C.; Wang, Y.; Ishii, H.; Piper, B.; Frenchman, D. Tangible User Interfaces (TUIs): A novel paradigm for GIS. Trans. GIS; 2004; 8, pp. 407-421. [DOI: https://dx.doi.org/10.1111/j.1467-9671.2004.00193.x]

19. Zhao, Q.; Yu, L.; Li, X.; Peng, D.; Zhang, Y.; Gong, P. Progress and trends in the application of Google Earth and Google Earth Engine. Remote Sens.; 2021; 13, 3778. [DOI: https://dx.doi.org/10.3390/rs13183778]

20. Mutanga, O.; Kumar, L. Google earth engine applications. Remote Sens.; 2019; 11, 591. [DOI: https://dx.doi.org/10.3390/rs11050591]

21. Hou, S.; Shen, Z.; Zhao, A.; Liang, J.; Gui, Z.; Guan, X.; Li, R.; Wu, H. GeoCode-GPT: A large language model for geospatial code generation. Int. J. Appl. Earth Obs. Geoinf.; 2025; 104456. [DOI: https://dx.doi.org/10.1016/j.jag.2025.104456]

22. Hou, S.; Liang, J.; Zhao, A.; Wu, H. GEE-OPs: An operator knowledge base for geospatial code generation on the Google Earth Engine platform powered by large language models. Geo-Spat. Inf. Sci.; 2025; pp. 1-22. [DOI: https://dx.doi.org/10.1080/10095020.2025.2505556]

23. Yang, L.; Driscol, J.; Sarigai, S.; Wu, Q.; Chen, H.; Lippitt, C.D. Google Earth Engine and artificial intelligence (AI): A comprehensive review. Remote Sens.; 2022; 14, 3253. [DOI: https://dx.doi.org/10.3390/rs14143253]

24. Hou, S.; Shen, Z.; Liang, J.; Zhao, A.; Gui, Z.; Li, R.; Wu, H. Can large language models generate geospatial code?. arXiv; 2024; arXiv: 2410.09738

25. Gramacki, P.; Martins, B.; Szymański, P. Evaluation of Code LLMs on Geospatial Code Generation. arXiv; 2024; arXiv: 2410.04617

26. Hou, S.; Jiao, H.; Shen, Z.; Liang, J.; Zhao, A.; Zhang, X.; Wang, J.; Wu, H. Chain-of-Programming (CoP): Empowering Large Language Models for Geospatial Code Generation. arXiv; 2024; arXiv: 2411.10753[DOI: https://dx.doi.org/10.1080/17538947.2025.2509812]

27. Shuyang, H.; Anqi, Z.; Jianyuan, L.; Zhangxiao, S.; Huayi, W. Geo-FuB: A method for constructing an Operator-Function knowledge base for geospatial code generation with large language models. Knowl.-Based Syst.; 2025; 319, 113624. [DOI: https://dx.doi.org/10.1016/j.knosys.2025.113624]

28. Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.J.; Welihinda, A.; Hayes, A.; Radford, A. Gpt-4o system card. arXiv; 2024; arXiv: 2410.21276

29. Menick, J.; Lu, K.; Zhao, S.; Wallace, E.; Ren, H.; Hu, H.; Stathas, N.; Such, F.P. GPT-4o Mini: Advancing Cost-efficient Intelligence; Open AI: San Francisco, CA, USA, 2024.

30. Anderson, I. Comparative Analysis Between Industrial Design Methodologies Versus the Scientific Method: AI: Claude 3.7 Sonnet. Preprints; 2025.

31. Team, G.R.; Abeyruwan, S.; Ainslie, J.; Alayrac, J.-B.; Arenas, M.G.; Armstrong, T.; Balakrishna, A.; Baruch, R.; Bauza, M.; Blokzijl, M. Gemini robotics: Bringing ai into the physical world. arXiv; 2025; arXiv: 2503.20020

32. Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C. Deepseek-v3 technical report. arXiv; 2024; arXiv: 2412.19437

33. Yang, A.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Huang, H.; Jiang, J.; Tu, J.; Zhang, J.; Zhou, J. Qwen2. 5-1M Technical Report. arXiv; 2025; arXiv: 2501.15383

34. Arrieta, A.; Ugarte, M.; Valle, P.; Parejo, J.A.; Segura, S. o3-mini vs DeepSeek-R1: Which One is Safer?. arXiv; 2025; arXiv: 2501.18438

35. Zheng, C.; Zhang, Z.; Zhang, B.; Lin, R.; Lu, K.; Yu, B.; Liu, D.; Zhou, J.; Lin, J. Processbench: Identifying process errors in mathematical reasoning. arXiv; 2024; arXiv: 2412.06559

36. Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv; 2025; arXiv: 2501.12948

37. Zhu, Q.; Guo, D.; Shao, Z.; Yang, D.; Wang, P.; Xu, R.; Wu, Y.; Li, Y.; Gao, H.; Ma, S. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv; 2024; arXiv: 2406.11931

38. Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.D.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G. Evaluating large language models trained on code. arXiv; 2021; arXiv: 2107.03374

39. Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging; 2016; 3, pp. 47-57. [DOI: https://dx.doi.org/10.1109/TCI.2016.2644865]

40. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process.; 2004; 13, pp. 600-612. [DOI: https://dx.doi.org/10.1109/TIP.2003.819861]

41. Wang, Z.; Wu, G.; Sheikh, H.R.; Simoncelli, E.P.; Yang, E.-H.; Bovik, A.C. Quality-aware images. IEEE Trans. Image Process.; 2006; 15, pp. 1680-1689. [DOI: https://dx.doi.org/10.1109/TIP.2005.864165]

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.