Program of the Workshop
Workshop will run from 8:45 - 6:00pm.
There will be a 4 hr boat cruise on the "MS Regensburg" on the evening of Friday 28 August. Tickets will cost 25 Euros each including the cruise, traditional Bavarian dinner, and dancing. The boat will depart from, and return to, the quayside close to the meeting venue.
A limited number of tickets will be available. To reserve, please email
stating how many tickets you wish to reserve. In the subject line write "boat trip", so that it is easy to sort the emails. Payment will be required at the ECVP registration desk.
Abstracts of the WorkshopJan-Olof Eklundh
Computational approaches to shape in object recognition and classification: A personal view
Shape plays an important role in human perception of objects and object properties. As noted in the motivation for the workshop the use of shape in computer vision has waxed and waned over the years. While mathematics offers a rich language for describing local as well as global shape and today’s computers provide means of processing these notions it is not straightforward to apply this wealth of knowledge in computational processing of real scenes. It is also a question what representations are appropriate. Considerable advances have been made on e.g. reconstructing 3D shapes and on extracting local shape information. However, bridging the gaps between the extraction of local shape and describing global shape, and between processing isolated shapes and the shape of objects in cluttered surroundings has turned out to be more difficult than at least the early practioners of computer vision foresaw. In the talk I will give a personal view of what could constitute these difficulties, with earlier research in the field and presentations at this workshop as a background.Andrew Glennerster
3D shape perception: the case for a primal sketch
The idea of a primal sketch, for example as described by Marr and Hildreth (1980) or Watt and Morgan (1985), has been applied to representing 2D images rather than 3D shapes. Watt (1988) extended the MIRAGE primal sketch to incorporate eye movements but not translations of the optic centre (binocular viewing or a moving observer) which allow for a 3D interpretation of the scene. I will argue that the development of a hierarchical, scale-space primal sketch of the set of images that can be observed in a scene (including arbitrary rotations of the eye/camera and translations of the optic centre) is highly relevant to explaining human 3D perception. Recent work on light fields and epitomic location recognition are helpful steps in this direction. In this context, I will discuss psychophysical evidence from my lab on sensitivity to depth relief with respect to surfaces. I will argue that the data are compatible with hierarchical encoding of position and disparity in a primal sketch, similar to the affine model of Koenderink and van Doorn (1991). I will also discuss two examples of experiments showing how changing the observer's task changes their performance in a way that is incompatible with the visual system storing a 3D model of the shape (Glennerster, Rogers and Bradshaw, 1996) or location (Svarverud, Gilson and Glennerster, VSS 2009) of objects. Such task-dependency indicates that the visual system maintains information in a more 'raw', or 'primal', form than a 3D model.Ales Leonardis
Learning a Hierarchical Compositional Shape Vocabulary for Multi-class Object Representation
Hierarchies allow feature sharing between objects at multiple levels of representation, can code exponential variability in a very compact way and enable fast inference. This makes them potentially suitable for learning and recognizing a higher number of object classes. However, the success of the hierarchical approaches so far has been hindered by the use of hand-crafted features or predetermined grouping rules. The goal of this talk is to present a framework for unsupervised learning of a hierarchical compositional shape vocabulary for representing multiple object classes. The approach takes simple contour fragments and learns their frequent spatial configurations. These are recursively combined into increasingly more complex and class-specific shape compositions, each exerting a high degree of shape variability. In the top-level of the vocabulary, the compositions are sufficiently large and complex to represent the whole shapes of the objects. We learn the vocabulary layer after layer, by gradually increasing the size of the window of analysis and the spatial resolution at which the shape configurations are learned. The lower layers are learned jointly on images of all classes, whereas the higher layers of the vocabulary are learned incrementally, by presenting the algorithm with one object class after another. The experimental results show that the learned multi-class object representation scales logarithmically with the number of object classes and achieves a state-of-the-art detection performance at both, faster inference as well as shorter training times.
This is a joint work with Sanja Fidler and Marko BobenJitendra Malik
The interaction of bottom-up and top-down processing in visual recognition
More than a decade ago, researchers in computer vision came up with a brute-force, yet effective, strategy for face detection. One simply scans over the image with a "sliding window", testing each one with a pattern classifier, and declaring a detection wherever the classifier responds "yes". This process is repeated at multiple scales to detect bigger or smaller faces. The approach works, to some extent, for other object categories, but is quite unsatisfactory as a model for human visual recognition, and perhaps even for computer vision. It completely denies any role for bottom up grouping processes, in contrast to the dominant paradigms twenty or thirty years ago. There was a reason for the change - representations such as the 2.5 D sketch or geons, which were supposed to be obtained purely bottom up, proved quite difficult to compute reliably on real world images. But did we swing too much to the other extreme? In my group, we have been developing an approach that builds significantly on the Gestalt tradition in perceptual organization. We start with a local process of marking contours in images based on local differences in brightness, color, texture etc, and move on to a more global framework for extracting coherent regions. So far, everything is generic and independent of knowledge of specific objects. These regions or extended curves serve as triggers for object hypotheses, (we use a variant of the Hough transform), and machine learning techniques enable us to assign weights to different regions based on repeatability and distinctiveness for a category. Once particular object hypotheses have been activated, one can go back and refine the groupings, e.g., to mark low contrast contours, discount shadow edges etc. This is a concrete instantiation of a model of visual recognition which is "mostly feedforward", and thus consistent with some of the timing data on rapid visual detection, yet permits feedback to deal with ambiguities in the low-level signal that are practically impossible for a bottom-up system to resolve correctly. We have demonstrated the approach successfully on a variety of datasets.
This talk is based on joint work with a number of collaborators, most recently with P. Arbelaez, C. Gu, and J. LimThomas Papathomas
Bottom-up and top-down processes in 3-D shape representation - Lessons from 3-D shape illusions
Converging evidence from psychophysics and neurophysiology points to top-down influences in visual perception, especially in object recognition. In parallel, there have been recent efforts in computer vision to employ top-down influences in object recognition, especially in the area of context-based scene analysis. The goal of this talk is to present a framework for three-dimensional (3-D) shape representation that takes into account both bottom-up and top-down processes. The framework is based on the premise that, when we see an object, we build a top-down 3-D representation, based on expectation and prior experience, that is "projected" onto the 3-D physical world, much in the manner that we perceive all objects projected out in physical space. We are not aware of this projection because it is automatic and subconscious. However, for certain 3-D objects (hollow masks and reverspectives, among others), we misperceive the stimulus by inverting depth relationships; as a result of this depth inversion, when viewers move in front of these objects, they perceive the objects to move vividly. This framework offers an explanation for the illusory motion, in which the projection of the 3-D representation plays a pivotal role. The same framework explains illusory motion of stereograms, as well as the “following eyes” or the “pointing out of the picture” illusions experienced by viewers who move past a full-face portrait or a painting with strong perspective cues, respectively. It also offers a reason on why schizophrenics do not experience the hollow mask illusion as strongly as non-schizophrenics. Finally, I will present the results of a face-tracking algorithm in computer vision that “experiences” the hollow mask illusion as humans do precisely because it uses a top-down 3-D representation model of human faces.Nikos Paragios
Hierarchical Shape Representation Using Sparse Graphs and Unsupervised Clustering
Manifold construction in high-dimensional spaces is a challenging problem with numerous applications to medical image analysis and computer vision. The main challenge is how to construct a manifold from a small number of samples. In this talk we will present a hierarchical representation that embed manifold construction in pair-wise interactions between measurements through a sparse graph. These pairs aim to account for co-dependencies of appearances between the samples and can be determined through an unsupervised clustering approach. The resulting framework consists of cluster where cluster centers is connected with cluster elements with similar statistical behavior, and interconnections between cluster centers towards capturing the global manifold properties. Such a representation, is sparse, can be determined using a small training set and can be used in a number of problems in computer vision. Knowledge-based segmentation and dense registration with priors are two examples that will be presented.
Joint work with : A. Besbes, B. Glocker & N. KomodakisMary Peterson
High-level and Contextual Influences on Figure-Ground Perception: A Case for Recurrent Processing
A fundamental aspect of perceptual organization entails segregating the visual input into separate shapes. In this process, a border shared by two contiguous regions is typically assigned to one region. That region "owns" the border and is perceived as a shaped entity (a figure); the other region seems locally to be an unshaped background to the figure. A traditional assumption was that figures were segregated from grounds before memories of familiar shapes are accessed. Our research showed that, instead, memories of familiar shapes play a role in determining which regions of the visual field are perceived as shaped figures. Effects of familiarity are probabilistic, rather than a deterministic. In recent experiments we found that convexity also exerts a probabilistic influence on figure-ground perception; moreover, its influence varies substantially across contexts affording different global resolutions. We submit that our results are best understood within a framework that involves access to high-level representations in a first stage of processing, and iterative interactions among representations at multiple levels in a second stage during which perceptual organization occurs.Andrew Schofield
Segmentation into layers: Why humans don’t try to pick up shadows
In order to determine the 3D shape of an object it is first necessary to determine which parts of the scene belong to the object and which are part of the background: figure-ground segmentation. One problem with segmenting natural images is that luminance is ambiguous, it can change for a variety of reasons including, critically, changes in both surface reflectance and the illumination field. It may therefore be beneficial to segment the image into layers representing intrinsic characteristics of the scene. Such characteristic images might include a reflectance map of the scene and the illumination field in which the scene is bathed. There is some evidence that humans can divide the visual input into such layers. Intuitively, we don’t try to avoid shadows, we don’t treat them as objects, and we don’t incorporate attached shadows into the body of an object. The reflectance map may form the basis of figure-ground segmentation, object recognition and object-level shape processing. The illumination field might yield information about shadows, surface undulations and lighting direction. I will discuss recent evidence supporting the notion that humans separate shadows and illumination from reflectance changes early in the processing stream and compare this with analogous machine vision methods for intrinsic image extraction.Jianbo Shi
Contour Packing for Shape Recognition
We introduce a method for 'packing' salient but fragmented image contours/segments into recognizable object shapes. This method requires few training examples, and is resistant to image clutter. In total, our method addresses three challenges: 1) object shape variation, 2) learning a discriminative score function for detection, and 3) unpredictable fragmentation of segments or contours. Previous works have addressed either both object shape variation and discriminative training, or both unpredictable fragmentation and object shape variation, but not all three. Our approach uses salient contours as integral tokens for shape matching. We seek a maximal, holistic matching of shapes. Shape features are extracted from large spatial extent, together with long-range contextual relationships among object parts. Our approach allow imperfect image segments to be `glued', to achieve one(object)-to-many(segments), or many(object parts)-to-many(segments) matching. We demonstrate that many-to-many shape matching can be trained discriminatively, using simple bounding box around objects as feedback. There are several computational implementations, using Linear Programming (LP) or Semi-Definite Programming (SDP). We evaluate our method on the challenging task of detecting human, bottles, swans and other objects in cluttered images.
This is joint work with Qihui Zhu, Praveen Srinivasan, Liming Wang, Yang WuKaleem Siddiqi
Medial Models for Vision
Since their introduction by Blum in the 60's medial models have played an important role in both human perception of visual form and in computer vision approaches to form recognition and categorization. These representations simultaneously capture geometric information about an object's boundary as well as information about their interior. Often distinct medial branches suggest decompositions of object outlines into intuitive parts. In computer vision this has lead to graph-based abstractions for object recognition, whereas in human vision medial representations have been implicated in many tasks, including shape-bisection, contour grouping and motion perception. In this talk I review the geometry of medial loci and then explore their association with human and computer vision. A common thread is the idea that certain medial locations within the outline of a form, characterized by a type of flux calculation, play a special role in both human and computer vision.Manish Singh
Perceptual organization of shape: the role of parts and axes
A basic problem in shape perception concerns the organization of local estimates of position / orientation (whether in 2D or 3D) into the global representation of shape. A key idea is that shape representation must be structural, such that the representation of spatial relationships between basic units or parts is separated from the representation of the parts themselves. This approach is especially useful in dealing with biological shapes, which can articulate their limbs and change their spatial configuration. This talk will review recent evidence for the part-based nature of the visual representation of shape, and the role of skeletal axes in organizing shape representation. It will also summarize a recent probabilistic model for computing skeletal axes, which unifies the computation of parts and axes within a common formal framework.
Joint work with Jacob FeldmanChristian Wallraven
Beyond vision: multi-sensory processing in humans and machines
The question of how humans learn to categorize objects and events has been at the heart of cognitive and neuroscience research for the last decades. In recent years, much work also in computer vision has focused on this topic and by now has generated multiple challenges, databases, and novel approaches. In this talk, I will argue that there is more to "vision" than "bags of words". Recent work in our lab has focused on using state-of-the-art computer graphics and simulation technology in order to advance our understanding of the role vision plays in the "ultimate cognitive system" - the human. In particular, in my talk I will discuss the need for spatio-temporal object representations, as well as why we need a notion of shape and material properties in object interpretation that goes far beyond most current computer vision approaches. Most importantly, however, I will focus on multi-modal/multi-sensory aspects of object processing as one of the key elements of learning about the world through interaction. Evidence from several studies of haptic object processing, for example, has shown that the sense of touch is sometimes surprisingly acute in representing complex shape spaces. I will finish by showing how some of these perceptual and cognitive results can be integrated into novel, more efficient and effective vision systems.