Using the NRP, a saliency computation model drives visual segmentation in the Laminart model

Recently, a cortical model for visual grouping and segmentation (the Laminart model) has been integrated to the NRP. From there, the goal was to build a whole visual system on the NRP, connecting many models for different functions of vision (retina processing, saliency computation, saccades generation, predictive coding, …) in a single virtual experiment, including the Laminart as a model for early visual processing. While this process is on-the-go (see here), some scientifically relevant progress already arose from the premises of this implementation. This is what is going to be explained right now.

The Laminart is one of the only models being able to satisfactorily explain how crowding occurs in the visual system. Crowding is a visual phenomenon that happens when perception of a target deteriorates in the presence of nearby elements. Crowding occurs in real life (for example when driving in the street, see fig. 1a) and is widely studied in many psychophysical experiments (see fig. 1b). Crowding happens ubiquitously in the visual system and must thus be accounted for by any complete model of the visual system.

While crowding was for a long time believed to be driven by local interactions in the brain (e.g. decremental feed-forward pooling of different receptive fields along the hierarchy of the visual system, jumbling the target’s visual features with the one from nearby elements), it recently appeared that adding remote contextual elements can still modulate crowding (see fig. 1c). The entire visual configuration is eligible to determine what happens at the very tiny scale of the target!

dfds
Fig. 1: a) Crowding in real life. If you look at the bull’s eye, the kid on the right will be easily identifiable. However, the one on the left will be harder to identify, because the nearby elements have similar features (yellow color, human shape). b) Crowding in psychophysical experiments. Top: the goal is to identify the letter in the center, while looking at the fixation cross. Neighbouring letters make the task more difficult, especially if they are very close to the target. Center and bottom: the goal here is to identify the offset of the target (small tilted lines). Again, the neighbouring lines make the task more difficult. c) The task is the same as before (visual stimuli on the x-axis are presented in the periphery of the visual field and observer must report the offset of the target). This time, squares try to decrease performance. What is plotted on the y-axis is the target offset at which observers give 75% of correct answers (low values indicate good performance). When the target is alone (dashed line), performance is very good. When only one square flanks the target, performance decreases dramatically. However, when more squares are added, the task becomes easier and easier.

To account for this exciting phenomenon (named uncrowding), Francis et al. (2017) proposed a model that parses the visual stimulus in several groups, using low-level, cortical dynamics (arising from a biologically plausible and laminarly structured network of spiking neurons, with fixed connectivity). Crucially, the Laminart is a 2-stage model in which the input image is segmented in different groups before any decremental interaction can happen between the target and nearby elements. In other words: how elements are grouped in the visual field determines how crowding occurs, making the latter a simple and behaviourally measurable phenomenon that unambiguously describes a central feature of human vision (grouping). In fig. 1c (right), the 7 squares form a group that frames the target, instead of interfering with it, hence enhancing performance. In the Laminart model, the 7 squares are grouped together by illusory contours and are segmented out, leaving a privileged access to the target left alone. However, in order to work, the Laminart model needs to start the segmentation spreading process somewhere (see fig. 2).

Grouping
Fig. 2: Dynamics of the layer 2/3 of the area V2 of the Laminart model, for two different stimuli. The red/green lines correspond to the activity of the neurons that detect a vertical/horizontal contrast contour. The three different columns for each stimulus correspond to three segmentation layers, where the visual stimulus is parsed in different groups. The blue circles are spreading signals that start the segmentation process (one for each segmentation layer that is not SL0). Left: the flanker is close to the target. It is thus hard for the spreading signals to segment the flanking square from the target. Right: the flankers extend further and are linked by illusory contours. It is more easy for the the signals to segment them from the target. Thus, this condition produces less crowding than the other.

Up to now, the model was sending ad-hoc top-down signals, lacking an explicit process to generate them. Here, using the NRP, we could easily connect it to a model for saliency computation that was just integrated to the platform. Feeding the Laminart, the saliency computation model delivers its output as a bottom-up influence to where segmentation signals should arise. On the NRP, we created the stimuli appearing in the experimental results shown in fig. 1c, and presented them to the iCub robot. In this experiment, each time a segmentation signal is sent, its location is sampled from the saliency computation map, linking both models in an elegant manner. Notably, when only 1 squares flanks the target, the saliency map is merely a big blob around the target, whereas when 7 squares flank the target, the saliency map is more peaky around the outer squares (see fig. 3). Consequently, the more squares there are, the more probable it is that the segmentation signals succeed in creating 2 groups from the flankers and the target, releasing the target from crowding. This fits very well with the results of figure 1c. The next step for this project is to reproduce the results quantitatively on the NRP.

151626720039374009 (1)

151626720039374009
Fig. 3: Coexistence of the Laminart network and the saliency network. Top: crowded condition. Bottom: uncrowded condition. In both situations, the saliency computation model drives the location of the segmentation signals in the Laminart model and explains very well how crowding and uncrowding can occur. The windows on the left display the saliency model. The ones on the right display the output of the Laminart model (up: V2 contrast borders activity ; down: V4 surface activity).

To sum up, building a visual system on the NRP, we could easily make the connection between a saliency computation model and our Laminart model. This connection greatly enhanced the latter model and gives it the opportunity to explain very well how uncrowding occurs in human vision and the low-level mechanisms of visual grouping. In the future, we will run psychophysical experiments in our lab, where it is possible to disentangle top-down from bottom-up influence on uncrowding, seeing whether a strong influence of saliency computation on visual grouping makes any sense.

Leave a comment