Content area
Imaging applications involving outdoor scenes and fast motion require sensing and processing of high-dynamic-range images at video rates. In turn, image signal processing pipelines that serve low-dynamic-range displays require tone mapping operators (TMOs). For high-speed and low-power applications with low-cost field-programmable gate arrays (FPGAs), global TMOs that employ contrast-limited histogram equalization prove ideal. To develop such TMOs, this work proposes a MATLAB–Simulink–Vivado design flow. A realized design capable of megapixel video rates using milliwatts of power requires only a fraction of the resources available in the lowest-cost Artix-7 device from Xilinx (now Advanced Micro Devices). Unlike histogram-based TMO approaches for nonlinear sensors in the literature, this work exploits Simulink modeling to reduce the total required FPGA memory by orders of magnitude with minimal impact on video output. After refactoring an approach from the literature that incorporates two subsystems (Base Histograms and Tone Mapping) to one incorporating four subsystems (Scene Histogram, Perceived Histogram, Tone Function, and Global Mapping), memory is exponentially reduced by introducing a fifth subsystem (Interpolation). As a crucial stepping stone between MATLAB algorithm abstraction and Vivado circuit realization, the Simulink modeling facilitated a bit-true design flow.
Full text
1. Introduction
Automotive, smartphone, and other applications have motivated research into high-dynamic-range (HDR) image capture and processing [1,2]. Takayanagi and Kuroda [1] reviewed technologies to capture HDR images, which represent scene luminance from dim shadows to bright highlights without saturation. Such technologies include nonlinear response HDR, linear response HDR with multiple exposures, linear response HDR with single exposures, frame oversampling and photon counting, and linear response HDR with multiple photodiodes. Hajisharif et al. [3] presented a linear response HDR with single-exposures apparatus, where a complementary metal-oxide-semiconductor (CMOS) active pixel sensor (APS) array uses alternating pixels with low and high gains. Compared to multiple-exposure methods, single exposures reduce motion artifacts in video applications. Brunetti and Choubey [2] elaborated on a recent development with a nonlinear response apparatus. With their experimental results, they demonstrated a logarithmic (log) APS capable of over 160 decibels (dB) of dynamic range with single exposures at video rates.
Regarding image processing that back-ends image capture, there are many possible stages. This paper focuses on ones called tone-mapping operators (TMOs), especially for nonlinear response HDR sensors. As Khan et al. [4] explain, because low-dynamic-range (LDR) images are more suitable for conventional display devices, a TMO serves to map HDR images to LDR images while preserving salient features.
Khan et al. [4] classified TMOs into only two kinds. Global TMOs apply one function to each pixel of an image, although the function may vary with time. By allowing a spatially varying mapping, local TMOs can enhance local details. Like Khan et al. [4], Völgyes et al. [5] favour a histogram-based TMO. Whereas the former adopt a local method for medical X-ray images, the latter prefer a global method, inspired by classic work on contrast-limited histogram equalization from Larson et al. [6] for visible-light images. Whereas Larson et al.’s method determines contrast limits to mimic a human visual system (HVS) model that is agnostic with respect to sensor details, Li et al. [7] modified this approach to actually prevent sensor noise in nonlinear HDR inputs from becoming visible in LDR outputs. While demonstrated only for a visible-band application, the latter approach is equally applicable to invisible-band modalities. Other authors such as Rana et al. [8] have investigated learning-based approaches for TMOs, where the LDR output is meant for a machine as opposed to a human observer.
In scenarios where only an LDR output is retained, after applying a TMO to an HDR input, inverse-TMO mappings can recover HDR images for subsequent processing by machine observers. Gunawan et al. [9] and Tade et al. [10] elaborate on image quality assessments that are especially suited for such applications. These assessments, which fall into full reference, no reference, and reduced reference categories, quantify the impact of visual artifacts that are more likely to be produced by certain TMOs than others. With test cases, visual artifacts may also or instead be evaluated by human observers. Both Guanawan et al. and Tade et al. have discussed the tension between objective and subjective methods. Each has disadvantages and advantages. Nevertheless, in an image processing pipeline, producing an LDR output to be displayed to a human observer may not prevent retention of an HDR input for subsequent processing by a machine observer.
For real-time tone mapping of video, there is a difference between a method and an apparatus. The latter emphasizes hardware. Ou et al. [11] reviewed 60 methods for real-time tone mapping realized as apparatus. New figures of merit beyond image quality assessments become important. These include apparatus complexity and cost, supported frame rates, and power consumption. Whereas graphics processing units (GPUs) support complex local TMOs at video rates, they have high unit costs and high power consumption. field-programmable gate arrays (FPGAs) have low unit costs, demand less power, and are reconfigurable, making them ideal for global and simpler local TMOs at video rates. More recent works by Kashyap et al. [12] and Muneer et al. [13] make the same points.
In a previous apparatus-focused publication [14], we presented a bit-true design flow to realize the contrast-limited histogram equalization method of Li et al. [7] in Xilinx and Altera FPGAs. Realized designs for megapixel resolutions consuming only tens of milliwatts (mW) at video rates fit within the fifth simplest device of the Spartan-6 family, the lowest-cost family in production by Xilinx at the time of the research. Our model-based design flow involved MATLAB and Xilinx’s Integrated Synthesis Environment (ISE).
Xilinx is now owned by Advanced Micro Devices (AMD). Today, AMD offers four families of Xilinx 7-Series FPGA devices [15]. In order of their increasing levels of sophistication and unit cost, these are Spartan-7, Artix-7, Kintex-7, and Virtex-7. Integrated within AMD Zynq-7000 platforms [16], Artix-7 devices are the lowest-cost FPGAs suitable for such system-on-chip (SoC) platforms. Moreover, ISE has been replaced by Vivado, also from AMD, to design, build, and test circuits expressed in a hardware-description language (HDL), e.g., very-high-speed integrated-circuit HDL (VHDL). For this work, we used Vivado 2022.2, together with MATLAB & Simulink R2023a.
In this paper, we propose and validate a new design flow for a contrast-limited histogram equalization apparatus, following Li et al.’s method [7] for nonlinear response HDR sensors. The flow results in circuit designs, called TMO 2025 systems, that significantly outperform those which can be obtained with our previous design flow [14]. We verify and evaluate designs using the
Our new design flow leverages Simulink modeling to reduce the number of random-access memory (RAM) bits required by an exponential factor without compromising the number of logic cells required, the maximum frequency of operation, or the power consumption at megapixel video rates. We target the simplest device of the Artix-7 family and remodel our previous realizations, called TMO 2021 systems, with the new design flow to facilitate explanation and for comparative evaluation. The results show, consistent with structural similarity (SSim) scores [19], negligible impact on video output of a lower-complexity TMO 2025 system with respect to the TMO 2021 reference [14].
Although Hai et al. [20] have advocated using Simulink, which accompanies MATLAB, and its HDL Coder blockset to develop an FPGA circuit, we used neither in our previously published TMO research [14]. After summarizing the productivity gains of their design flow, Hai et al. commented on how Simulink models can facilitate classroom explanations of a system for edge detection and human height estimation.
The rest of this paper is structured as follows. Section 2 starts with a design overview of TMO 2021 and TMO 2025 systems, then continues with design details (featuring Simulink models) of five subsystems, called Scene Histogram, Perceived Histogram, Tone Function, Global Mapping, and Interpolation. Section 3 addresses bit-true verification and timing constraint analysis of VHDL models developed from the Simulink ones using Vivado. The same section evaluates circuit realizations for a variety of video formats, focusing on complexity, speed, and power. Finally, Section 4 summarizes the motivation, scope, and main contributions of this work.
2. Apparatus and Methods
The remodeled TMO 2021 and novel TMO 2025 systems have four main parameters: the number of pixels, n, per frame; the video rate, f, in frames per second (fps); the histogram bin size, , in bits; and the contrast limit, , which is a histogram ceiling. Starting with a design overview, this section presents subsystem designs in sequence for 16-bit unsigned integer HDR input and 8-bit unsigned integer LDR output.
2.1. Design Overview
Figure 1 presents Simulink models of the TMO 2021 and TMO 2025 systems. Using a hierarchical design with multiple subsystems, which is an ideal way to organize complex circuits, these are top-level models showing system inputs and outputs plus subsystem interconnects, all of which carry sample-based signals.
In the sample-based approach, an HDR video input,
With the Simulink models, this work flows naturally from algorithm to circuit design, capturing novel and significant aspects of the latter without requiring details of VHDL representations to be presented. Frame-based explanations of MATLAB classes define “methods” of the TMO 2021 and TMO 2025 systems, whereas sample-based illustrations of Simulink models define “apparatuses” of the same.
The TMO 2021 design has, after remodeling, the four subsystems shown and named in Figure 1. In contrast, the TMO 2025 design, while having the same four subsystems in the same sequence, adds a new subsystem, Interpolation, and a second output to a subsystem, Global Mapping. The presented models indicate the data types of all signals. Compared to the TMO 2021, the TMO 2025 system has
The Scene Histogram subsystem computes the current frame’s histogram while reading out the previous frame’s histogram. The Perceived Histogram subsystem applies a frame-based low-pass filter (LPF) to scene histogram inputs. The Tone Function subsystem outputs a normalized cumulative sum of a modified histogram, computed from a perceived histogram and the contrast limit. The Global Mapping stores a currently computing tone function, while applying a previously computed tone function to the HDR input. Using least-significant bits (LSBs) of the HDR input to refine LDR output from the Global Mapping, the Interpolation produces the LDR output of the TMO 2025 design. Consistent with a global TMO, the Interpolation does not employ neighbouring pixels.
The Scene and Perceived Histogram subsystems require three RAMs, each of a wordlength, , that depends on the number of pixels. For the TMO 2021 and TMO 2025 designs, the Global Mapping subsystem requires two RAMs that have single-width and double-width wordlengths: 8 and 16, respectively. Considering the number of RAM words, the Simulink modeling predicts the memory in bits (Kb), and , required by TMO 2021 and TMO 2025 realizations as follows:
(1)
(2)
Before remodeling, the TMO 2021 system had multiple control signals. These are now represented by one multi-bit unsigned integer (
2.2. Scene Histogram
Shown in Figure 2, the Scene Histogram subsystem computes and writes, to a RAM, the histogram of the current video frame, called frame k, and simultaneously reads out, from another RAM, the histogram of the previous video frame, called frame . Each RAM with additional circuitry defines a RAM system, which is technically a sub-subsystem.
When one control bit,
A histogram of one video frame, an image, is a count of how many pixels have values that fall within one of multiple non-overlapping bins, where the bins cover all possible pixel values. We can model the sample-based scalar input,
(3)
The range of the MSBs,
The sample-based histogram output,
(4)
Each RAM system, which counts or reads out a histogram, receives two control bits,
In the read-out mode, when
In Figure 2, the Scene Histogram subsystem’s first output,
2.3. Perceived Histogram
Figure 3 presents the Perceived Histogram subsystem, which computes a perceived histogram,
Although designed in sample-based fashion, the subsystem is easier to explain with equations using frame-based histogram signals, for
(5)
where(6)
(7)
(8)
In the case of negligible round-off error, i.e., given a sufficiently large scale factor, , with respect to parameter ratios, the model simplifies as follows:
(9)
(10)
After the first two frames, representing an internal-state initialization where the perceived histogram, , equals the scene histogram, , the model further simplifies into a frame-based difference equation in forward recursion form as follows:
(11)
Like the TMO 2021 system, the TMO 2025 system adopts established values, s and 8, for the aforementioned constants, and s. The frame period, , derives from a frame rate, f, of popular video formats. The LPF coefficients, and , have
The Perceived Histogram subsystem employs one dual-port RAM. The wordlength, , of each histogram sample on paths from input to output, i.e., from
Scene histogram bin values,
The weighted-sum sub-subsystem has two parallel multiplications with constants followed by an addition. Favouring high-speed operation via sequential logic, a unit delay follows each arithmetic operation. One product takes the scene histogram as an input and its filter coefficient, , as an argument. The other product takes the previous perceived histogram as an input and its filter coefficient, , as an argument. The previous perceived histogram comes from the read port,
When a control bit,
2.4. Tone Function
Figure 4 presents the Tone Function of the TMO 2025 system. The subsystem outputs a
The single-width tone function,
A normalized cumulative sum of a scene or perceived histogram would yield a tone function for histogram equalization. Because histogram values specify slopes of the tone function, we realize contrast-limited histogram equalization by clipping values to a ceiling,
Ideally, we divide the cumulative sum by its maximum, the end value in a sample-based approach, and round off after multiplying by 255. By normalizing instead with the maximum value of the previous cumulative sum, we gain considerable efficiency and sacrifice little accuracy. The modified histogram follows the perceived histogram, which unlike the scene histogram changes slowly, always from frame to frame.
Consider a frame-based version, , of the sample-based histogram,
(12)
(13)
(14)
(15)
where, to help express a required(16)
In Figure 4, the Tone Function employs a sub-subsystem to realize a normalization that computes without division. Two nonidealities, a previous not current divisor and an approximate not exact division, mean that the normalization requires a 1-to-256 saturation before minus-one blocks. After the minus-one blocks, we obtain the single-width tone function,
Whether rounding up, down, or off, after scaling by a positive power of two, division of non-constant signals would demand significant FPGA resources. We replace the division in the Tone Function model by multiplication with a non-constant multiplier, A, and with scaling by a negative power of two, , a left-shift operation:
(17)
(18)
Starting from an initial guess, , we compute the multiplier, A, through a feedback process, observing that an expression, , should equal 256 at the end of each frame period. If the expression is less than or equal to 128, we double the multiplier. If it is greater than or equal to 512, we halve the multiplier. There are 383 possible values in between these limits that we handle with an LUT. The expression of interest is
(19)
To update the multiplier, A, we designed a 385-word LUT to retrieve a positive integer, R, that with a multiplication, a bit-shift, and a rounding-off results in a good prediction of the best multiplier, A, to use during the next frame period:
(20)
(21)
where, to faciliate a realization of the constant LUT,(22)
Moreover, the multiplier computation entails a saturation, indicated in Figure 4, whereby any result less than a minimum, , saturates to the minimum and any result greater than a maximum, , saturates to the maximum. These values are
(23)
(24)
As shown in Figure 4, the Tone Function does not use the video input,
2.5. Global Mapping
The Global Mapping subsystem, presented in Figure 5, employs two single-port RAMs in ping-pong fashion. During each frame period, the ping RAM exclusively performs read operations and the pong RAM exclusively performs write operations or no operations at all. Roles reverse each frame. Because neither RAM must perform read and write operations at the same time, we do not employ dual-port RAMs.
The subsystem takes an HDR input,
(25)
(26)
where the function, , is a nonlinear mapping from two vector signals, and , with the latter delayed by one frame, meaning stored in RAM memory, to one vector signal, . Frame-based signals, and , represent sample-based signals,We model the vector-to-vector tone function, , as the entrywise (global) application of a scalar tone function, , to every entry, y, of one vector argument, , as follows:
(27)
Each entry, y, is an unsigned integer with well-defined limits, where m equals . The scalar function, , is a piecewise function defined by its second argument, . The form of this function is that of an input-dependent LUT:(28)
Global mapping from the HDR input,
Returning to Figure 5, simple bit extraction, which is practically a zero-cost operation in an FPGA circuit, yields the binned video input,
While a control bit,
The pong RAM stores a vector of data, using addresses supplied by MSBs of the control input,
The TMO 2025 system takes advantage of the LSBs of the video input,
(29)
Figure 5 presents the TMO 2025’s subsystem only. Three differences specify TMO 2021’s corresponding design. First,
2.6. Interpolation
Figure 6 presents the Interpolation subsystem of the TMO 2025 system. Interpolation provides a contribution to the final LDR output,
In the figure, “
Consider a scalar function, T, that defines an entrywise vector-to-vector mapping, , of the HDR input, , to the TMO 2021’s LDR output, denoted below, where a single-width tone function, , defines the scalar function:
(30)
(31)
(32)
(33)
Because the low byte of the double-width tone function, , exactly equals the single-width tone function, , the TMO 2021’s single-width LDR output, , exactly equals the low byte of TMO 2025’s double-width LDR output, :
(34)
What the TMO 2025 system does, by virtue of the Interpolation subsystem, is to take advantage of the high-byte, , of the double-width LDR output, , with the help of another input, , the LSB remainder of the original HDR input, :
(35)
(36)
The LSBs, , are a nonnegative fractional part of the binned HDR input, . Entries of the fractional part fall in an open interval, . If every entry was exactly 1, we could apply the vector-to-vector mapping, , to the binned HDR input plus one, . The result is exactly equal to the high byte, , of the double-width LDR output:
(37)
Our model requires a small correction, a saturation, to the scalar function, T. With this correction, as follows, the scalar function can represent the incremental output, , for a special case input, y, corresponding to a saturation value, :
(38)
The TMO 2025 system takes advantage of the LSBs, , by entrywise linear interpolation between the low-byte output, , and the high-byte output, , as follows:
(39)
This formula guarantees the Interpolation subsystem output, , has the same range, , as the low-byte result, . Where remainder bits, , approach zero, the output approaches the low-byte result. Where they approach their maximum, , the output approaches the high-byte result, . In between, a linearly weighted sum is obtained.3. Results and Discussion
This section begins by presenting the general approach and example results by which we verified Simulink models and corresponding FPGA realizations in bit-true fashion. Then, it compares the visual quality achieved by three designs for one video format. The evaluation focuses on circuit specifications achieved by three designs for five video formats.
3.1. Verification
Following the Simulink models, we developed the TMO 2021 and TMO 2025 systems as VHDL designs suitable for FPGA implementation. Using Vivado, we synthesized circuit realizations for Artix-7 target devices. Given test cases of video and control inputs,
After the functional verification, synthesized TMO 2021 and TMO 2025 designs underwent translation and mapping and place and route, using Vivado, for the three simplest FPGA devices of the Artix-7 family of FPGA devices from Xilinx, now AMD. The simplest device, XC7A12T, has up to 12,800 logic cells and up to 912,384 memory bits available for design synthesis. To fit this device, therefore, the memory required by the five RAMs must not exceed 891 Kb for a given design.
We synthesize a specific FPGA circuit from a generic VHDL design by choosing specific parameters, like the number of pixels, n, the frame rate, f, the histogram bin size, , and the contrast limit, . Given a synthesized circuit, Vivado tools report data that can be used to summarize the circuit’s complexity, in particular the number of logic cells and memory bits utilized. The smaller these are, relative to the available resources, the more feasible it becomes to include them on the same FPGA other circuits, beside the TMO, to complete a multi-stage image processing pipeline.
After translation and mapping and place and route, we used static timing analysis (STA) to predict the maximum frequency of operation in a target device. A synthesized circuit, either for the TMO 2021 or TMO 2025 system, requires one clock frequency, i.e., the sample rate (not to be confused with the frame rate). During STA, we varied the clock frequency from low to high—actually, we varied the clock period from high to low in nanosecond (ns) decrements—until Vivado predicted a timing violation.
Examples of timing violations include failures to meet the setup-and-hold requirements of a logic cell on a critical circuit path. Violations could occur due to propagation delays of combinational logic subcircuits, considering place-and-route details. We designed the TMO 2021 and TMO 2025 systems, using Simulink, to intersperse combinational logic with delays, e.g., blocks. Using Vivado, they map to FPGA resources configured for sequential logic, i.e., pipelined elements that facilitate high-frequency operation.
The maximum frequency that ensures functional correctness, as predicted by Vivado after the place and route, may be compared to the sample rate associated with the number of pixels and the required frame rate. If the maximum frequency is less than the sample rate then the circuit will not work in the target device at the required frame rate. Targeting a more complex device in the same Artix-7 family could solve the problem. Alternately, another device family may be required. This work limits its scope to the simplest devices of the lowest-cost AMD family, i.e., Artix-7, suitable for SoC platforms.
The TMO 2021 and TMO 2025 systems require input and control signals to function. For testing purposes, whether in Simulink or Vivado, we produced signals in MATLAB from a video file called
Table 1 lists the video formats of interest. Two of them, the full HD (FHD) and 4K ultra HD (4KUHD) ones, are megapixel formats. Two of them, the video-graphics-array (VGA) and half-quarter VGA (HQVGA) ones, are sub-megapixel formats. The VGA format may also be called the standard-definition (SD) format in the literature.
The
(40)
In this frame-based model, where k is the frame number, F is an entrywise monotonic function that converts a frame of scene luminances, , to a frame of sensor responses, , that define the HDR input. The model incorporates zero-mean Gaussian noise, , that is independently and identically distributed from sample to sample.Figure 7 plots the sensor function, F, as well as the noise standard deviation, . The sensor file specifies the function and the noise standard deviation for a log APS and a log DPS image sensor based on experimentally obtained data. This work uses only the latter. Although we mathematically model the video input in frame-based fashion, during simulation, it streams through Simulink models and Vivado realizations of TMO 2021 and TMO 2025 systems in a sample-based fashion, the same as the control input.
Figure 8 illustrates the control signalling of the TMO 2025 system juxtaposed with example video input and output signals. The remodeled TMO 2021 design adopts exactly the same control strategy and takes the same video input. We partially automated the bit-true verification. Invoking Simulink, MATLAB scripts verified frame-based models automatically. Additionally, the Simulink model and Vivado circuit designers manually shared and compared sample-based files of input, output, and selected intermediate signals. The results were identical, whether produced using Simulink or Vivado.
The control input,
System designs employ RAMs to store all histograms and tone functions. The designs configure pairs of RAMs, some dual-port, in ping-pong fashion. One RAM, the ping RAM, performs reads alone while the other, the pong RAM, performs writes only or reads and writes. In each pair, ping and pong RAMs operate in parallel without interference. Each frame period, the RAMs of each pair swap roles. What was the ping RAM becomes the pong RAM and vice versa. One control bit,
3.2. Evaluation
Figure 9 presents the LDR output of two TMO 2021 systems and one TMO 2025 system for the same HDR input,
The figure primarily demonstrates a qualitative improvement in tone mapping, from the TMO 2021 to the TMO 2025 system, given an equal bin size in bits that corresponds to reduced memory requirements. Especially visible as blotchy textures on upper window regions, mapping artifacts appear in the TMO 2021 results, equals 8. They are absent in the TMO 2025 results, equals 8. The same figure illustrates, whether by the TMO 2021 or the TMO 2025 system, the usefulness of a contrast limit to histogram equalization. Without it, window regions exaggerate sensor noise and too little contrast remains to show bench textures clearly. These subjective observations agree with objective assessments, i.e., SSim scores computed via MATLAB’s
Figure 10 illustrates key intermediate signals, namely histograms and tone functions, corresponding to a processing of the
Whereas the scene and perceived histograms stream through an FPGA implementation one sample at a time in real time, Figure 10 plots bin values, i.e., counts of particular ranges of pixel intensities, on a y-axis versus pixel intensity on the x-axis. With a second y-axis, the figure also shows the effective tone functions of the TMO 2021 and TMO 2025 systems applied to the video input for the given frame.
Tone functions equate to a normalized cumulative sum of a modified histogram, derived from the perceived histogram and a constant, . The systems compute the normalized cumulative sum in real time using histogram bin values that stream from high bin addresses to low bin addresses, i.e., from right to left in Figure 10. As a result, the mapping functions are inverting, an inversion that compensates for the inverting response, shown in Figure 7, of the chosen log DPS sensor.
Mapping functions are contrast-limited; they have a maximum possible (negative) slope. We achieve this by utilizing a modified histogram that equals the perceived histogram only when it is lower than a “contrast limit” threshold, illustrated in Figure 10. Otherwise, the modified histogram equals the threshold, a ceiling calculated as follows:
(41)
Here, the number of pixels, n, the histogram bin size, , and the noise standard deviation, , are parameters. With this ceiling, tone functions determine LDR outputs while limiting the visibility after tone mapping of the sensor noise present in HDR inputs.Figure 10 shows that the mapping function of the TMO 2021 system has a staircase shape, whereas that of the TMO 2025 system does not. By virtue of its final Interpolation subsystem, the latter brings back discarded bits as remainder bits in a real-time computation that smooths the staircase. One way to reduce the staircase effect of the TMO 2021 subsystem is to reduce the number of discarded bits, equal to the histogram bin size in bits, . However, doing so leads to an exponential-related increase in the RAM memory required to store the histograms and mapping functions.
Table 2 presents results obtained with Vivado. These include circuit complexity and maximum frequency for five video formats, HQVGA to 4KUHD. For each format, three designs underwent circuit synthesis, translation and mapping, place and route, and STA. As with the examples in Figure 9, contrast limits correspond to the log DPS sensor.
In Table 2, the first two designs realize the TMO 2021 system for two values, 2 and 8, of the histogram bin size in bits, . Although the lower bin size yielded the reference results, shown in Figure 9, it required a significantly more complex circuit in terms of memory utilization for all video formats. With Artix-7 devices, Vivado can allocate block RAM memory in 18 Kb increments. Apart from five such blocks, one for each histogram and mapping RAM, Vivado allocated some non-block or distributed RAM memory to support, for example, the LUT of the Tone Function subsystem.
For equal parameters, the TMO 2021 and the TMO 2025 systems have equivalent complexity, as shown in Table 2. With the latter design, we consider only one value, 8, of the bin size in bits, . As shown in Figure 9, with this bin size for the HQVGA format, the TMO 2025 system produces video output comparable to that of the TMO 2021 system with a smaller bin size, 2. However, Table 2 shows that the TMO 2025 design of similar visual quality enjoys an order of magnitude reduction in required memory.
Compared to the TMO 2021, the TMO 2025 design requires a bit more logic, mainly because it involves an extra subsystem: Interpolation. It also doubles the width of some buses and RAMs. Required memory grows with the logarithm of the number of pixels. Because read/write logic synthesizes in proportion to RAM sizes, we find a weak dependence of logic on the number of pixels. Relative to resources available in the simplest Artix-7 device, logic requirements are low and nearly independent of video format.
Table 2 shows that simple Artix-7 devices support all video formats except 4KUHD for all designs. A device supports a video format when the maximum frequency, as determined by STA, exceeds the required sample rate for the chosen frame rate. For all supported cases, the table reports the static and dynamic power of the circuit at 30 fps, as determined by Vivado. Static power, on the order of 100 mW, is approximately constant. Maximum frequency tends to decrease as the number of pixels increase, most likely because timing constraints are harder to satisfy when larger RAMs are placed and routed.
Dynamic power depends on the circuit, its sample rate, and the device. Given equal parameters, power consumption hardly increases from the TMO 2021 to the TMO 2025 design, despite the extra circuit complexity. Given a fixed frame rate, the dynamic power increases with the number of pixels. In all reported cases, the dynamic power is less than 100 mW, the order of the static power. By this measure, circuit designs are power-efficient. However, due to an exponential increase in RAM memory, the TMO 2021 design requires more dynamic power than the TMO 2025 design for artifact-free tone mapping.
4. Conclusions
Automotive, smartphone, and other applications have motivated research into TMOs, which are useful stages in image processing pipelines for HDR video. Given an HDR input, a TMO produces an LDR representation. Compared to local TMOs, global TMOs are especially suitable for FPGA realizations. This work does not investigate local TMOs and takes a model-based, not a learning-based, approach to global TMOs called contrast-limited histogram equalizations, as tailored to nonlinear HDR sensors. The work presents an in-depth comparison to one previously published design, the TMO 2021.
To realize an exponential improvement, the TMO 2025 design, this work developed a MATLAB–Simulink–Vivado design flow, applying it first to remodel the TMO 2021 design. With MATLAB, frame-based algorithms abstract key parts of the TMO. With Simulink, sample-based models incorporate blocks that capture essential features of circuits and that prove convenient for debugging. With the design flow, we productively realized functionally verified systems that meet timing constraints for targeted FPGAs. Vivado simulations of developed VHDL models matched bit for bit with Simulink model results and transitively with MATLAB algorithm results.
After presenting a design overview, this paper detailed the TMO 2021 and TMO 2025 designs, one subsystem at a time. Due to remodeling of the TMO 2021 design with Simulink, small changes to third and fourth subsystems, called Tone Function and Global Mapping, with the addition of a fifth subsystem, called Interpolation, yielded the TMO 2025 design. Both designs have identical first and second subsystems, called Scene Histogram and Perceived Histogram. By specifying the number of pixels, the video rate, the histogram bin size, and a contrast limit, generic designs yield specific systems.
This paper accompanies sample-based Simulink models, illustrated in figures, with frame-based equations and text for explanatory purposes. In this fashion, we elaborated on the novel TMO 2025 design while compactly articulating exactly how it differs from the reference TMO 2021 design. With the Interpolation subsystem, the tone function instead of having a staircase mapping, from HDR input to LDR output, has a linearly interpolated refinement of the staircase mapping.
We realized the TMO 2021 and TMO 2025 designs as systems for five video formats and tested them with an 11 s video. Panning from a shadowed interior to a sunlit hallway, the clip incorporates the response and noise of a nonlinear sensor. Following the Simulink models, we developed FPGA circuits as VHDL models that, with parameter values, underwent synthesis, translation and mapping, and place and route using Vivado, for the simplest Artix-7 devices from Xilinx. We presented circuit results for fifteen realizations—test cases that included megapixel video at 30 fps.
The paper features sample-based and frame-based results from HQVGA realizations of the TMO 2021 design with two values and the TMO 2025 design with one value of a parameter that exponentially affects required memory. The presented results explain a simple control signalling scheme, elaborate on bit-true verification, and illustrate key internal signals, namely scene and perceived histograms as well as tone functions without and with interpolation. Frame-based results also provide a comparison of histogram equalization to contrast-limited histogram equalization, indicating the capability of the latter to limit the visibility of sensor noise in video output.
Whereas the TMO 2021 design enables high-speed, low-power systems in FPGAs, when configured for artifact-free video output, the realizations require too much RAM memory to fit the simplest Artix-7 device, the lowest-cost FPGA from Xilinx, now AMD, suitable for SoC platforms. Thanks to an exponential reduction in required memory, with negligible impact on required logic, maximum frequency, and power consumption, the novel TMO 2025 design fits easily.
Conceptualization, M.N. and D.J.; methodology, W.D., M.N. and D.J.; software, W.D., M.N. and D.J.; validation, W.D., M.N. and D.J.; formal analysis, W.D., M.N. and D.J.; investigation, W.D., M.N. and D.J.; resources, M.N. and D.J.; data curation, M.N. and D.J.; writing—original draft preparation, W.D. and D.J.; writing—review and editing, M.N. and D.J.; visualization, M.N. and D.J.; supervision, D.J.; project administration, D.J.; and funding acquisition, D.J., for publication. All authors have read and agreed to the published version of the manuscript.
MATLAB code and Simulink models developed for this work, using MATLAB & Simulink R2023a, are archived in a public GitHub repository,
Maikon Nascimento and Dileepan Joseph would like to thank Rui (Rachel) Sun for her valued contributions, via a Mitacs Globalink Research Internship, to an earlier concept and method for tone mapping not used in this paper nor developed as an electronic apparatus.
The authors declare no conflicts of interest.
The following abbreviations are used in this manuscript:
| 4KUHD | 4K ultra HD |
| AMD | Advanced Micro Devices |
| APS | active pixel sensor |
| cd/m2 | candelas per metre squared |
| CMOS | complementary metal-oxide-semiconductor |
| dB | decibels |
| DPS | digital pixel sensor |
| FHD | full HD |
| FPGA | field-programmable gate array |
| fps | frames per second |
| GPU | graphics processing unit |
| HD | high-definition |
| HDL | hardware-description language |
| HDR | high-dynamic-range |
| HQVGA | half-quarter VGA |
| HVS | human visual system |
| ISE | Integrated Synthesis Environment |
| Kb | |
| LDR | low-dynamic-range |
| log | logarithmic |
| LPF | low-pass filter |
| LSB | least-significant bit |
| LUT | look-up table |
| MHz | megahertz |
| MSB | most-significant bit |
| mW | milliwatts |
| ns | nanoseconds |
| RAM | random-access memory |
| s | seconds |
| SD | standard-definition |
| SoC | system-on-chip |
| SSim | structural similarity |
| STA | static timing analysis |
| TMO | tone-mapping operator |
| VGA | video-graphics-array |
| VHDL | very-high-speed integrated-circuit HDL |
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 1 Simulink models of tone-mapping operator (TMO) systems. With Simulink, we refactored our 2021 histogram-based TMO design (top) into four subsystems. With a fifth subsystem, Interpolation, our improved TMO 2025 design, bottom, yields an exponential reduction in field-programmable gate array (FPGA) memory requirements. Both designs require a control input,
Figure 2 Scene Histogram of TMO 2021 and TMO 2025 systems. A ping-pong configuration allows the subsystem, top, with two instances of a random-access memory (RAM) sub-subsystem, bottom, and three switches to compute the histogram of a current video frame while, in parallel, reading out a computed histogram of the previous frame. When one control bit,
Figure 3 Perceived Histogram of TMO 2021 and TMO 2025 systems. This subsystem, top, employs a dual-port RAM and a weighted-sum sub-subsystem, bottom, to implement a frame-based low-pass filter (LPF) in sample-based fashion. The LPF smooths, consistent with human perception, sudden changes in light-intensity distribution from a scene histogram input,
Figure 4 The Tone Function subsystem of the TMO 2025 system. The cumulative sum, top, of a modified histogram,
Figure 5 The Global Mapping subsystem of the TMO 2025 system. Using two switches and a control bit,
Figure 6 Interpolation subsystem, in the TMO 2025 system only. This subsystem takes a double-width LDR input,
Figure 7 Simulated logarithmic (log) digital pixel sensor (DPS) array. A log DPS image sensor model, adapted from Nascimento et al. [
Figure 8 Sample-based input and output of the TMO 2025 system. Three LSBs of the control input,
Figure 9 Frame-based output of TMO 2021 and TMO 2025 systems. Enlarged, top, about 8 s after the simulation start for an HDR input,
Figure 10 Frame-based histogram and tone function examples. The left y-axis plots scene and perceived histograms for the 240th frame (8 s mark) of the
Video formats investigated with Simulink and Vivado. The number of pixels and frame rate, in frames per second (fps), determine the sample rate, in megahertz (MHz). They are parameters that influence the logic, memory, and power required by specific field-programmable gate array (FPGA) realizations of generic TMO 2021 and TMO 2025 system designs. Scene and perceived histogram RAMs have wordlengths that depend weakly, as shown, on the number of pixels.
| Video Format Acronym (Acronym Definition) | Number of Pixels, | Frame Rate, | Sample Rate, | Width of Histogram RAMs, |
|---|---|---|---|---|
| HQVGA (half-quarter VGA) | | 30 fps | 1 MHz | 16 |
| VGA (video-graphics-array) | | 30 fps | 9 MHz | 19 |
| HD (high-definition) | | 30 fps | 28 MHz | 20 |
| FHD (full HD) | | 30 fps | 62 MHz | 21 |
| 4KUHD (4K ultra HD) | | 30 fps | 249 MHz | 23 |
Specifications of FPGA circuits, obtained with Vivado. Percentages are with respect to logic and memory available in the simplest Artix-7 device. At
| Video Format | System Design (Bin Size, Bits) | Logic Cells (Utilization) | Memory Bits (Utilization) | Maximum Frequency | Static Power | Dynamic Power |
|---|---|---|---|---|---|---|
| HQVGA | TMO 2021 (2) | 376 ( | 1154 K ( | 182 MHz | 61 mW | 1 mW |
| TMO 2021 (8) | 306 ( | 91 K ( | 164 MHz | 60 mW | 1 mW | |
| TMO 2025 (8) | 442 ( | 92 K ( | 120 MHz | 58 mW | 1 mW | |
| VGA | TMO 2021 (2) | 456 ( | 1316 K ( | 172 MHz | 61 mW | 12 mW |
| TMO 2021 (8) | 395 ( | 92 K ( | 164 MHz | 60 mW | 2 mW | |
| TMO 2025 (8) | 530 ( | 92 K ( | 119 MHz | 58 mW | 2 mW | |
| HD | TMO 2021 (2) | 539 ( | 1424 K ( | 169 MHz | 61 mW | 40 mW |
| TMO 2021 (8) | 467 ( | 92 K ( | 164 MHz | 60 mW | 7 mW | |
| TMO 2025 (8) | 603 ( | 92 K ( | 119 MHz | 58 mW | 7 mW | |
| FHD | TMO 2021 (2) | 525 ( | 1478 K ( | 169 MHz | 61 mW | 94 mW |
| TMO 2021 (8) | 501 ( | 92 K ( | 161 MHz | 60 mW | 15 mW | |
| TMO 2025 (8) | 637 ( | 92 K ( | 119 MHz | 58 mW | 17 mW | |
| 4KUHD | TMO 2021 (2) | 637 ( | 1586 K ( | 167 MHz | Max freq. | |
| TMO 2021 (8) | 565 ( | 92 K ( | 161 MHz | Max freq. | ||
| TMO 2025 (8) | 741 ( | 92 K ( | 119 MHz | Max freq. | ||
1. Takayanagi, I.; Kuroda, R. HDR CMOS Image Sensors for Automotive Applications. IEEE Trans. Electron Devices; 2022; 69, pp. 2815-2823. [DOI: https://dx.doi.org/10.1109/TED.2022.3164370]
2. Brunetti, A.M.; Choubey, B. A Low Dark Current 160 dB Logarithmic Pixel with Low Voltage Photodiode Biasing. Electronics; 2021; 10, 1096. [DOI: https://dx.doi.org/10.3390/electronics10091096]
3. Hajisharif, S.; Kronander, J.; Unger, J. Adaptive dualISO HDR reconstruction. EURASIP J. Image Video Process.; 2015; 2015, 41. [DOI: https://dx.doi.org/10.1186/s13640-015-0095-0]
4. Khan, I.R.; Rahardja, S.; Khan, M.M.; Movania, M.M.; Abed, F. A Tone-Mapping Technique Based on Histogram Using a Sensitivity Model of the Human Visual System. IEEE Trans. Ind. Electron.; 2018; 65, pp. 3469-3479. [DOI: https://dx.doi.org/10.1109/TIE.2017.2760247]
5. Völgyes, D.; Martinsen, A.C.T.; Stray-Pedersen, A.; Waaler, D.; Pedersen, M. A Weighted Histogram-Based Tone Mapping Algorithm for CT Images. Algorithms; 2018; 11, 111. [DOI: https://dx.doi.org/10.3390/a11080111]
6. Larson, G.W.; Rushmeier, H.; Piatko, C. A Visibility Matching Tone Reproduction Operator for High Dynamic Range Scenes. IEEE Trans. Vis. Comput. Graph.; 1997; 3, pp. 291-306. [DOI: https://dx.doi.org/10.1109/2945.646233]
7. Li, J.; Skorka, O.; Ranaweera, K.; Joseph, D. Novel Real-Time Tone-Mapping Operator for Noisy Logarithmic CMOS Image Sensors. J. Imaging Sci. Technol.; 2016; 60, 020404. [DOI: https://dx.doi.org/10.2352/J.ImagingSci.Technol.2016.60.2.020404]
8. Rana, A.; Valenzise, G.; Dufaux, F. Learning-Based Tone Mapping Operator for Efficient Image Matching. IEEE Trans. Multimed.; 2019; 21, pp. 256-268. [DOI: https://dx.doi.org/10.1109/TMM.2018.2839885]
9. Gunawan, I.P.; Cloramidina, O.; Syafa’ah, S.B.; Febriani, R.H.; Kuntarto, G.P.; Santoso, B.I. A review on high dynamic range (HDR) image quality assessment. Int. J. Smart Sens. Intell. Syst.; 2021; 14, pp. 1-17. [DOI: https://dx.doi.org/10.21307/ijssis-2021-010]
10. Tade, S.L.; Vyas, V. Tone Mapped High Dynamic Range Image Quality Assessment Techniques: Survey and Analysis. Arch. Comput. Methods Eng.; 2021; 28, pp. 1561-1574. [DOI: https://dx.doi.org/10.1007/s11831-020-09428-y]
11. Ou, Y.; Ambalathankandy, P.; Takamaeda, S.; Motomura, M.; Asai, T.; Ikebe, M. Real-Time Tone Mapping: A Survey and Cross-Implementation Hardware Benchmark. IEEE Trans. Circuits Syst. Video Technol.; 2022; 32, pp. 2666-2686. [DOI: https://dx.doi.org/10.1109/TCSVT.2021.3060143]
12. Kashyap, S.; Giri, P.; Bhandari, A.K. Logarithmically Optimized Real-Time HDR Tone Mapping With Hardware Implementation. IEEE Trans. Circuits Syst. II Express Briefs; 2024; 71, pp. 1426-1430. [DOI: https://dx.doi.org/10.1109/TCSII.2023.3325942]
13. Muneer, M.H.; Pasha, M.A.; Khan, I.R. Hardware-friendly tone-mapping operator design and implementation for real-time embedded vision applications. Comput. Electr. Eng.; 2023; 110, 108892. [DOI: https://dx.doi.org/10.1016/j.compeleceng.2023.108892]
14. Nascimento, M.; Li, J.; Joseph, D. Efficient Pipelined Circuits for Histogram-based Tone Mapping of Nonlinear CMOS Image Sensors. J. Imaging Sci. Technol.; 2021; 65, 040503. [DOI: https://dx.doi.org/10.2352/J.ImagingSci.Technol.2021.65.4.040503]
15. Xilinx. 7 Series: Product Selection Guide. Tech. Rep., Advanced Micro Devices, 2021. Available online: https://docs.amd.com/v/u/en-US/7-series-product-selection-guide (accessed on 19 March 2025).
16. Xilinx. Zynq-7000 SoC: Product Selection Guide. Tech. Rep., Advanced Micro Devices, 2019. Available online: https://docs.amd.com/v/u/en-US/zynq-7000-product-selection-guide (accessed on 26 March 2025).
17. Kronander, J.; Gustavson, S.; Bonnet, G.; Unger, J. Unified HDR reconstruction from raw CFA data. Proceedings of the IEEE International Conference on Computational Photography; Cambridge, MA, USA, 19–21 April 2013; pp. 1-9. [DOI: https://dx.doi.org/10.1109/ICCPhot.2013.6528315]
18. Mahmoodi, A.; Li, J.; Joseph, D. Digital Pixel Sensor Array with Logarithmic Delta-Sigma Architecture. Sensors; 2013; 13, pp. 10765-10782. [DOI: https://dx.doi.org/10.3390/s130810765] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/23959239]
19. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process.; 2004; 13, pp. 600-612. [DOI: https://dx.doi.org/10.1109/TIP.2003.819861] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/15376593]
20. Hai, J.C.T.; Pun, O.C.; Haw, T.W. Accelerating Video and Image Processing Design for FPGA using HDL Coder and Simulink. Proceedings of the IEEE Conference on Sustainable Utilization and Development in Engineering and Technology; Selangor, Malaysia, 15–17 October 2015; pp. 28-32. [DOI: https://dx.doi.org/10.1109/CSUDET.2015.7446221]
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.