Content area
This thesis addresses a fundamental problem in contemporary computer vision: the question of perceptual alignment, specifically investigating the extent to which deep neural networks perceive and interpret the visual world in a manner comparable to human observers. While deep learning has catalyzed revolutionary advances in the field, achieving state-of-the-art performance across a wide array of tasks, including image segmentation, object classification, and complex multimodal reasoning, the connection between these artificial systems and the mechanisms underlying human visual perception remains notably limited. Despite impressive accuracy metrics, deep neural networks often exhibit behaviors that diverge from human perception, particularly under conditions of visual ambiguity or in the presence of subtle contextual cues. This work investigates the interplay between biological plausibility, computational performance, and perceptual alignment, employing a combination of bio-inspired architectural modifications and psychophysically grounded evaluation protocols to rigorously quantify the similarities and divergences between machine and human vision.
The first part of the thesis focuses on biologically inspired computational mechanisms, examining their capacity to enhance robustness while maintaining computational efficiency. Specifically, we investigate the integration of divisive normalization (DN), a canonical computation observed in the early visual cortex, into state-of-the-art segmentation architectures, including variants of the widely used U-Net. Divisive normalization serves as a canonical gain-control mechanism that modulates neural responses based on local contrast, and it has been implicated in numerous low-level perceptual phenomena observed in human vision. Experimental results demonstrate that models incorporating DN exhibit increased robustness under adverse environmental conditions, such as fog, low lighting, or reduced contrast, achieving improved segmentation performance with only minimal increases in model complexity or parameter count. However, when these biologically inspired models are evaluated using the Decalogue, a rigorously designed battery of psychophysical tests assessing low-level visual phenomena, including contrast sensitivity and contextual masking, they fail to reproduce human-like perceptual behaviors. These findings suggest a nuanced conclusion: while biologically inspired computations can improve robustness and generalization, they do not inherently induce human-like perceptual characteristics. This underscores the distinction between improving task performance and achieving genuine perceptual alignment with human observers.
The second part of the thesis develops systematic methodologies to measure perceptual alignment beyond traditional accuracy metrics, moving toward behaviorally and psychophysically informed evaluation frameworks. This includes developing the Decalogue for low-level phenomena, devising novel procedures for assessing chromatic discrimination via MacAdam ellipses, evaluating contrast sensitivity function (CSF) responses in multimodal language models (MLLMs), and establishing a framework to quantify abstraction levels in vision-language models such as CLIP. The empirical results obtained through these methodologies reveal critical insights into the factors that influence alignment. For instance, neural networks trained on richer chromatic distributions generate discrimination ellipses that more closely approximate human color perception, highlighting the importance of the visual environment and data diversity. CSF evaluations reveal that even advanced MLLMs exhibit marked limitations in reproducing basic human sensitivities to spatial frequency, suggesting persistent gaps in low-level perceptual fidelity. In CLIP, alignment is found to vary across network layers: early layers, which encode primarily texture-based information, exhibit moderate alignment with human perception, whereas later layers, influenced by linguistic supervision, increasingly abstract visual representations toward semantic concepts. This abstraction enhances model robustness and task generalization but diminishes alignment with low-level human perceptual behaviors.
The third part of the thesis investigates the broader determinants of perceptual alignment. Through systematic analyses across convolutional neural networks (CNNs), Vision Transformers, CLIP, and multimodal language models, the work demonstrates that alignment is a multifactorial property emerging from complex interactions between architectural design, optimization objectives, statistical properties of the training data, duration of training, and reading strategies. Interestingly, the relationship between task performance and perceptual alignment is non-monotonic: increasing model capacity or optimizing solely for accuracy can paradoxically reduce alignment with human perception, resulting in an inverted U-shaped relationship between accuracy and perceptual similarity. Additionally, linguistic supervision biases models toward global shape representations at the expense of local texture information, emphasizing that the type of task and supervision can play a more substantial role than architectural choices alone. These findings suggest that perceptual alignment is more strongly constrained by the combination of data, supervision, and task demands than by modifications to network architecture.
Conceptually, this thesis contributes to a deeper understanding of the interplay between performance, biological inspiration, and perceptual alignment, highlighting that improvements in accuracy or biologically motivated design do not necessarily translate to human-like perceptual behavior. Methodologically, it introduces systematic evaluation frameworks inspired by psychophysics, which can be applied to both vision-only and multimodal models to assess alignment rigorously. Empirically, it clarifies how factors such as early visual computations, chromatic environmental richness, optimization regimes, and language-based supervision interact to influence the degree of similarity between artificial and human perception.
In conclusion, this thesis advances the understanding of how artificial neural networks perceive visual stimuli and delineates the conditions under which they diverge from human visual experience. It provides strong evidence that bridging the gap between computational performance and perceptual alignment requires moving beyond architectural inspiration, toward evaluation frameworks and design principles that are explicitly informed by human behavioral and psychophysical data. These contributions lay the foundation for future research on biologically inspired, robust architectures and establish perceptual alignment as a critical, complementary objective to accuracy in the development and evaluation of computer vision systems. By integrating insights from neuroscience, psychophysics, and machine learning, the work positions perceptual alignment as a central consideration for designing artificial vision systems capable of functioning in real-world, human-centered environments.