tceic.com

学霸学习网 这下你爽了

学霸学习网 这下你爽了

- 译文A Discriminatively Trained, Multiscale, Deformable Part Model
- A Discriminatively Trained, Multiscale, Deformable Part Model【中文译】【转】
- Discriminatively Trained Deformable Part Models模型和训练matlab程序20131217
- 目标检测 Discriminatively Trained Deformable Part Models 20131208
- Deformable Part Models Revisited部件模型的回顾
- Segmentation of medical images using a geometric deformable model and its visualization
- A versatile and robust model for geometrically complex deformable solids
- Object Detection with Discriminatively Trained Part Based Models
- A Multiscale Random Field Model
- A Component Based Deformable Model for Generalized Face Alignment

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro Felzenszwalb University of Chicago pff@cs.uchicago.edu Abstract

This paper describes a discriminatively trained, multiscale, deformable part model for object detection. Our system achieves a two-fold improvement in average precision over the best performance in the 2006 PASCAL person detection challenge. It also outperforms the best results in the 2007 challenge in ten out of twenty categories. The system relies heavily on deformable parts. While deformable part models have become quite popular, their value had not been demonstrated on dif?cult benchmarks such as the PASCAL challenge. Our system also relies heavily on new methods for discriminative training. We combine a margin-sensitive approach for data mining hard negative examples with a formalism we call latent SVM. A latent SVM, like a hidden CRF, leads to a non-convex training problem. However, a latent SVM is semi-convex and the training problem becomes convex once latent information is speci?ed for the positive examples. We believe that our training methods will eventually make possible the effective use of more latent information such as hierarchical (grammar) models and models involving latent three dimensional pose.

David McAllester Toyota Technological Institute at Chicago mcallester@tti-c.org

Deva Ramanan TTI-C and UC Irvine dramanan@ics.uci.edu

Figure 1. Example detection obtained with the person model. The model is de?ned by a coarse template, several higher resolution part templates and a spatial model for the location of each part.

1. Introduction

We consider the problem of detecting and localizing objects of a generic category, such as people or cars, in static images. We have developed a new multiscale deformable part model for solving this problem. The models are trained using a discriminative procedure that only requires bounding box labels for the positive examples. Using these models we implemented a detection system that is both highly ef?cient and accurate, processing an image in about 2 seconds and achieving recognition rates that are signi?cantly better than previous systems. Our system achieves a two-fold improvement in average precision over the winning system [5] in the 2006 PASCAL person detection challenge. The system also outperforms the best results in the 2007 challenge in ten out of twenty object categories. Figure 1 shows an example detection obtained with our person model. 1

The notion that objects can be modeled by parts in a deformable con?guration provides an elegant framework for representing object categories [1–3, 6, 10, 12, 13, 15, 16, 22]. While these models are appealing from a conceptual point of view, it has been dif?cult to establish their value in practice. On dif?cult datasets, deformable models are often outperformed by “conceptually weaker” models such as rigid templates [5] or bag-of-features [23]. One of our main goals is to address this performance gap. Our models include both a coarse global template covering an entire object and higher resolution part templates. The templates represent histogram of gradient features [5]. As in [14, 19, 21], we train models discriminatively. However, our system is semi-supervised, trained with a maxmargin framework, and does not rely on feature detection. We also describe a simple and effective strategy for learning parts from weakly-labeled data. In contrast to computationally demanding approaches such as [4], we can learn a model in 3 hours on a single CPU. Another contribution of our work is a new methodology for discriminative training. We generalize SVMs for handling latent variables such as part positions, and introduce a new method for data mining “hard negative” examples during training. We believe that handling partially labeled data is a signi?cant issue in machine learning for computer vision. For example, the PASCAL dataset only speci?es a bounding box for each positive example of an object. We treat the position of each object part as a latent variable. We

also treat the exact location of the object as a latent variable, requiring only that our classi?er select a window that has large overlap with the labeled bounding box. A latent SVM, like a hidden CRF [19], leads to a nonconvex training problem. However, unlike a hidden CRF, a latent SVM is semi-convex and the training problem becomes convex once latent information is speci?ed for the positive training examples. This leads to a general coordinate descent algorithm for latent SVMs. System Overview Our system uses a scanning window approach. A model for an object consists of a global “root” ?lter and several part models. Each part model speci?es a spatial model and a part ?lter. The spatial model de?nes a set of allowed placements for a part relative to a detection window, and a deformation cost for each placement. The score of a detection window is the score of the root ?lter on the window plus the sum over parts, of the maximum over placements of that part, of the part ?lter score on the resulting subwindow minus the deformation cost. This is similar to classical part-based models [10, 13]. Both root and part ?lters are scored by computing the dot product between a set of weights and histogram of gradient (HOG) features within a window. The root ?lter is equivalent to a Dalal-Triggs model [5]. The features for the part ?lters are computed at twice the spatial resolution of the root ?lter. Our model is de?ned at a ?xed scale, and we detect objects by searching over an image pyramid. In training we are given a set of images annotated with bounding boxes around each instance of an object. We reduce the detection problem to a binary classi?cation problem. Each example x is scored by a function of the form, fβ (x) = maxz β · Φ(x, z). Here β is a vector of model parameters and z are latent values (e.g. the part placements). To learn a model we de?ne a generalization of SVMs that we call latent variable SVM (LSVM). An important property of LSVMs is that the training problem becomes convex if we ?x the latent values for positive examples. This can be used in a coordinate descent algorithm. In practice we iteratively apply classical SVM training to triples ( x1 , z1 , y1 , . . ., xn , zn , yn ) where zi is selected to be the best scoring latent label for xi under the model learned in the previous iteration. An initial root ?lter is generated from the bounding boxes in the PASCAL dataset. The parts are initialized from this root ?lter.

Image pyramid

HOG feature pyramid

Figure 2. The HOG feature pyramid and an object hypothesis de?ned in terms of a placement of the root ?lter (near the top of the pyramid) and the part ?lters (near the bottom of the pyramid).

templates that can be moved with respect to the detection window. The spatial model for the part locations is equivalent to a star graph or 1-fan [3] where the coarse template serves as a reference position.

2.1. HOG Representation

We follow the construction in [5] to de?ne a dense representation of an image at a particular resolution. The image is ?rst divided into 8x8 non-overlapping pixel regions, or cells. For each cell we accumulate a 1D histogram of gradient orientations over pixels in that cell. These histograms capture local shape properties but are also somewhat invariant to small deformations. The gradient at each pixel is discretized into one of nine orientation bins, and each pixel “votes” for the orientation of its gradient, with a strength that depends on the gradient magnitude at that pixel. For color images, we compute the gradient of each color channel and pick the channel with highest gradient magnitude at each pixel. Finally, the histogram of each cell is normalized with respect to the gradient energy in a neighborhood around it. We look at the four 2 × 2 blocks of cells that contain a particular cell and normalize the histogram of the given cell with respect to the total energy in each of these blocks. This leads to a 9 × 4 dimensional vector representing the local gradient information inside a cell. We de?ne a HOG feature pyramid by computing HOG features of each level of a standard image pyramid (see Figure 2). Features at the top of this pyramid capture coarse gradients histogrammed over fairly large areas of the input image while features at the bottom of the pyramid capture ?ner gradients histogrammed over small areas. 2

2. Model

The underlying building blocks for our models are the Histogram of Oriented Gradient (HOG) features from [5]. We represent HOG features at two different scales. Coarse features are captured by a rigid template covering an entire detection window. Finer scale features are captured by part

2.2. Filters

Filters are rectangular templates specifying weights for subwindows of a HOG pyramid. A w by h ?lter F is a vector with w × h × 9 × 4 weights. The score of a ?lter is de?ned by taking the dot product of the weight vector and the features in a w × h subwindow of a HOG pyramid. The system in [5] uses a single ?lter to de?ne an object model. That system detects objects from a particular class by scoring every w × h subwindow of a HOG pyramid and thresholding the scores. Let H be a HOG pyramid and p = (x, y, l) be a cell in the l-th level of the pyramid. Let φ(H, p, w, h) denote the vector obtained by concatenating the HOG features in the w × h subwindow of H with top-left corner at p. The score of F on this detection window is F · φ(H, p, w, h). Below we use φ(H, p) to denote φ(H, p, w, h) when the dimensions are clear from context.

of each part relative to the root (the spatial term),

n i=0 n

Fi · φ(H, pi ) +

i=1

ai · (?i , yi ) + bi · (?2 , yi ), x ? xi ?2

(1)

2.3. Deformable Parts

Here we consider models de?ned by a coarse root ?lter that covers the entire object and higher resolution part ?lters covering smaller parts of the object. Figure 2 illustrates a placement of such a model in a HOG pyramid. The root ?lter location de?nes the detection window (the pixels inside the cells covered by the ?lter). The part ?lters are placed several levels down in the pyramid, so the HOG cells at that level have half the size of cells in the root ?lter level. We have found that using higher resolution features for de?ning part ?lters is essential for obtaining high recognition performance. With this approach the part ?lters represent ?ner resolution edges that are localized to greater accuracy when compared to the edges represented in the root ?lter. For example, consider building a model for a face. The root ?lter could capture coarse resolution edges such as the face boundary while the part ?lters could capture details such as eyes, nose and mouth. The model for an object with n parts is formally de?ned by a root ?lter F0 and a set of part models (P1 , . . . , Pn ) where Pi = (Fi , vi , si , ai , bi ). Here Fi is a ?lter for the i-th part, vi is a two-dimensional vector specifying the center for a box of possible positions for part i relative to the root position, si gives the size of this box, while ai and bi are twodimensional vectors specifying coef?cients of a quadratic function measuring a score for each possible placement of the i-th part. Figure 1 illustrates a person model. A placement of a model in a HOG pyramid is given by z = (p0 , . . . , pn ), where pi = (xi , yi , li ) is the location of the root ?lter when i = 0 and the location of the i-th part when i > 0. We assume the level of each part is such that a HOG cell at that level has half the size of a HOG cell at the root level. The score of a placement is given by the scores of each ?lter (the data term) plus a score of the placement 3

where (?i , yi ) = ((xi , yi ) ? 2(x, y) + vi )/si gives the lox ? cation of the i-th part relative to the root location. Both xi ? and yi should be between ?1 and 1. ? There is a large (exponential) number of placements for a model in a HOG pyramid. We use dynamic programming and distance transforms techniques [9, 10] to compute the best location for the parts of a model as a function of the root location. This takes O(nk) time, where n is the number of parts in the model and k is the number of cells in the HOG pyramid. To detect objects in an image we score root locations according to the best possible placement of the parts and threshold this score. The score of a placement z can be expressed in terms of the dot product, β · ψ(H, z), between a vector of model parameters β and a vector ψ(H, z), β = (F0 , . . . , Fn , a1 , b1 . . . , an , bn ). ψ(H, z) = (φ(H, p0 ), φ(H, p1 ), . . . φ(H, pn ), x1 , y1 , x2 , y1 , . . . , xn , yn , x2 , yn , ). ? ? ?1 ?2 ? ? ?n ?2 We use this representation for learning the model parameters as it makes a connection between our deformable models and linear classi?ers. On interesting aspect of the spatial models de?ned here is that we allow for the coef?cients (ai , bi ) to be negative. This is more general than the quadratic “spring” cost that has been used in previous work.

3. Learning

The PASCAL training data consists of a large set of images with bounding boxes around each instance of an object. We reduce the problem of learning a deformable part model with this data to a binary classi?cation problem. Let D = ( x1 , y1 , . . . , xn , yn ) be a set of labeled examples where yi ∈ {?1, 1} and xi speci?es a HOG pyramid, H(xi ), together with a range, Z(xi ), of valid placements for the root and part ?lters. We construct a positive example from each bounding box in the training set. For these examples we de?ne Z(xi ) so the root ?lter must be placed to overlap the bounding box by at least 50%. Negative examples come from images that do not contain the target object. Each placement of the root ?lter in such an image yields a negative training example. Note that for the positive examples we treat both the part locations and the exact location of the root ?lter as latent variables. We have found that allowing uncertainty in the root location during training signi?cantly improves the performance of the system (see Section 4).

3.1. Latent SVMs

A latent SVM is de?ned as follows. We assume that each example x is scored by a function of the form, fβ (x) = max β · Φ(x, z),

z∈Z(x)

(2)

where β is a vector of model parameters and z is a set of latent values. For our deformable models we de?ne Φ(x, z) = ψ(H(x), z) so that β · Φ(x, z) is the score of placing the model according to z. In analogy to classical SVMs we would like to train β from labeled examples D = ( x1 , y1 , . . . , xn , yn ) by optimizing the following objective function,

n

β ? (D) = argmin λ||β||2 +

β i=1

max(0, 1 ? yi fβ (xi )). (3)

negative examples at a time. Instead, it is common to construct training data consisting of the positive instances and “hard negative” instances, where the hard negatives are data mined from the very large set of possible negative examples. Here we describe a general method for data mining examples for SVMs and latent SVMs. The method iteratively solves subproblems using only hard instances. The innovation of our approach is a theoretical guarantee that it leads to the exact solution of the training problem de?ned using the complete training set. Our results require the use of a margin-sensitive de?nition of hard examples. The results described here apply both to classical SVMs and to the problem de?ned by Step 2 of the coordinate descent algorithm for latent SVMs. We omit the proofs of the theorems due to lack of space. These results are related to working set methods [17]. We de?ne the hard instances of D relative to β as, M (β, D) = { x, y ∈ D | yfβ (x) ≤ 1}. (4)

By restricting the latent domains Z(xi ) to a single choice, fβ becomes linear in β, and we obtain linear SVMs as a special case of latent SVMs. Latent SVMs are instances of the general class of energy-based models [18].

3.2. Semi-Convexity

Note that fβ (x) as de?ned in (2) is a maximum of functions each of which is linear in β. Hence fβ (x) is convex in β. This implies that the hinge loss max(0, 1 ? yi fβ (xi )) is convex in β when yi = ?1. That is, the loss function is convex in β for negative examples. We call this property of the loss function semi-convexity. Consider an LSVM where the latent domains Z(xi ) for the positive examples are restricted to a single choice. The loss due to each positive example is now convex. Combined with the semi-convexity property, (3) becomes convex in β. If the labels for the positive examples are not ?xed we can compute a local optimum of (3) using a coordinate descent algorithm: 1. Holding β ?xed, optimize the latent values for the positive examples zi = argmaxz∈Z(xi ) β · Φ(x, z). 2. Holding {zi } ?xed for positive examples, optimize β by solving the convex problem de?ned above. It can be shown that both steps always improve or maintain the value of the objective function in (3). If both steps maintain the value we have a strong local optimum of (3), in the sense that Step 1 searches over an exponentially large space of latent labels for positive examples while Step 2 simultaneously searches over weight vectors and an exponentially large space of latent labels for negative examples.

That is, M (β, D) are training examples that are incorrectly classi?ed or near the margin of the classi?er de?ned by β. We can show that β ? (D) only depends on hard instances. Theorem 1. Let C be a subset of the examples in D. If M (β ? (D), D) ? C then β ? (C) = β ? (D).

This implies that in principle we could train a model using a small set of examples. However, this set is de?ned in terms of the optimal model β ? (D). Given a ?xed β we can use M (β, D) to approximate M (β ? (D), D). This suggests an iterative algorithm where we repeatedly compute a model from the hard instances de?ned by the model from the last iteration. This is further justi?ed by the following ?xed-point theorem. Theorem 2. If β ? (M (β, D)) = β then β = β ? (D). Let C be an initial “cache” of examples. In practice we can take the positive examples together with random negative examples. Consider the following iterative algorithm: 1. Let β := β ? (C). 2. Shrink C by letting C := M (β, C). 3. Grow C by adding examples from M (β, D) up to a memory limit L. Theorem 3. If |C| < L after each iteration of Step 2, the algorithm will converge to β = β ? (D) in ?nite time.

3.4. Implementation details

Many of the ideas discussed here are only approximately implemented in our current system. In practice, when training a latent SVM we iteratively apply classical SVM training to triples x1 , z1 , y1 , . . ., xn , zn , yn where zi is selected to be the best scoring latent label for xi under the 4

3.3. Data Mining Hard Negatives

In object detection the vast majority of training examples are negative. This makes it infeasible to consider all

model trained in the previous iteration. Each of these triples leads to an example Φ(xi , zi ), yi for training a linear classi?er. This allows us to use a highly optimized SVM package (SVMLight [17]). On a single CPU, the entire training process takes 3 to 4 hours per object class in the PASCAL datasets, including initialization of the parts. Root Filter Initialization: For each category, we automatically select the dimensions of the root ?lter by looking at statistics of the bounding boxes in the training data.1 We train an initial root ?lter F0 using an SVM with no latent variables. The positive examples are constructed from the unoccluded training examples (as labeled in the PASCAL data). These examples are anisotropically scaled to the size and aspect ratio of the ?lter. We use random subwindows from negative images to generate negative examples. Root Filter Update: Given the initial root ?lter trained as above, for each bounding box in the training set we ?nd the best-scoring placement for the ?lter that signi?cantly overlaps with the bounding box. We do this using the original, un-scaled images. We retrain F0 with the new positive set and the original random negative set, iterating twice. Part Initialization: We employ a simple heuristic to initialize six parts from the root ?lter trained above. First, we select an area a such that 6a equals 80% of the area of the root ?lter. We greedily select the rectangular region of area a from the root ?lter that has the most positive energy. We zero out the weights in this region and repeat until six parts are selected. The part ?lters are initialized from the root ?lter values in the subwindow selected for the part, but ?lled in to handle the higher spatial resolution of the part. The initial deformation costs measure the squared norm of a displacement with ai = (0, 0) and bi = ?(1, 1). Model Update: To update a model we construct new training data triples. For each positive bounding box in the training data, we apply the existing detector at all positions and scales with at least a 50% overlap with the given bounding box. Among these we select the highest scoring placement as the positive example corresponding to this training bounding box (Figure 3). Negative examples are selected by ?nding high scoring detections in images not containing the target object. We add negative examples to a cache until we encounter ?le size limits. A new model is trained by running SVMLight on the positive and negative examples, each labeled with part placements. We update the model 10 times using the cache scheme described above. In each iteration we keep the hard instances from the previous cache and add as many new hard instances as possible within the memory limit. Toward the ?nal iterations, we are able to include all hard instances, M (β, D), in the cache.

1 We picked a simple heuristic by cross-validating over 5 object classes. We set the model aspect to be the most common (mode) aspect in the data. We set the model size to be the largest size not larger than 80% of the data.

Figure 3. The image on the left shows the optimization of the latent variables for a positive example. The dotted box is the bounding box label provided in the PASCAL training set. The large solid box shows the placement of the detection window while the smaller solid boxes show the placements of the parts. The image on the right shows a hard-negative example.

4. Results

We evaluated our system using the PASCAL VOC 2006 and 2007 comp3 challenge datasets and protocol. We refer to [7, 8] for details, but emphasize that both challenges are widely acknowledged as dif?cult testbeds for object detection. Each dataset contains several thousand images of realworld scenes. The datasets specify ground-truth bounding boxes for several object classes, and a detection is considered correct when it overlaps more than 50% with a groundtruth bounding box. One scores a system by the average precision (AP) of its precision-recall curve across a testset. Recent work in pedestrian detection has tended to report detection rates versus false positives per window, measured with cropped positive examples and negative images without objects of interest. These scores are tied to the resolution of the scanning window search and ignore effects of non-maximum suppression, making it dif?cult to compare different systems. We believe the PASCAL scoring method gives a more reliable measure of performance. The 2007 challenge has 20 object categories. We entered a preliminary version of our system in the of?cial competition, and obtained the best score in 6 categories. Our current system obtains the highest score in 10 categories, and the second highest score in 6 categories. Table 1 summarizes the results. Our system performs well on rigid objects such as cars and sofas as well as highly deformable objects such as persons and horses. We also note that our system is successful when given a large or small amount of training data. There are roughly 4700 positive training examples in the person category but only 250 in the sofa category. Figure 4 shows some of the models we learned. Figure 5 shows some example detections. We evaluated different components of our system on the longer-established 2006 person dataset. The top AP score 5

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

table

dog

horse mbike person plant sheep

sofa

train

tv

Our rank Our score Darmstadt INRIA Normal INRIA Plus IRISA MPI Center MPI ESSOL Oxford TKK

3 1 2 1 1 2 2 .180 .411 .092 .098 .249 .349 .396 .301 .092 .246 .012 .002 .068 .197 .265 .136 .287 .041 .025 .077 .279 .294 .281 .318 .060 .110 .028 .031 .000 .164 .172 .152 .157 .098 .016 .001 .186 .120 .262 .409 .393 .432 .186 .078 .043 .072 .002 .116 .184

4 1 1 1 4 2 2 1 1 2 1 4 1 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336 .018 .132 .026 .208 .240 .153 .249 .227 .170 .208 .375 .050 .028 .100 .086 .126 .186 .135 .097 .106 .097 .002 .007 .039 .127 .119 .044 .061 .017 .016 .225 .067 .071 .335 .289 .049 .141 .198 .098 .162 .034 .121 .092 .221 .091 .117 .157 .242 .242 .275 .253 .237 .051 .110 .054 .334 .061 .019 .036 .058 .067 .090 .093 .002 .102 .072 .011 .092 .175 .004 .091 .034 .002 .046 .147

Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Empty boxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current system ranks ?rst in 10 out of 20 classes. A preliminary version of our system ranked ?rst in 6 classes in the of?cial competition.

Bicycle Sofa Car Bottle

Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells in the root and part ?lters, with the part ?lters placed at the center of the allowable displacements. We also show the spatial model for each part, where bright values represent “cheap” placements, and dark values represent “expensive” placements.

in the PASCAL competition was .16, obtained using a rigid template model of HOG features [5]. The best previous result of .19 adds a segmentation-based veri?cation step [20]. Figure 6 summarizes the performance of several models we trained. Our root-only model is equivalent to the model from [5] and it scores slightly higher at .18. Performance jumps to .24 when the model is trained with a LSVM that selects a latent position and scale for each positive example. This suggests LSVMs are useful even for rigid templates because they allow for self-adjustment of the detection window in the training examples. Adding deformable parts increases performance to .34 AP — a factor of two above the best previous score. Finally, we trained a model with parts 6

but no root ?lter and obtained .29 AP. This illustrates the advantage of using a multiscale representation. We also investigated the effect of the spatial model and allowable deformations on the 2006 person dataset. Recall that si is the allowable displacement of a part, measured in HOG cells. We trained a rigid model with high-resolution parts by setting si to 0. This model outperforms the rootonly system by .27 to .24. If we increase the amount of allowable displacements without using a deformation cost, we start to approach a bag-of-features. Performance peaks at si = 1, suggesting it is useful to constrain the part displacements. The optimal strategy allows for larger displacements while using an explicit deformation cost. The follow-

Figure 5. Some results from the PASCAL 2007 dataset. Each row shows detections using a model for a speci?c class (Person, Bottle, Car, Sofa, Bicycle, Horse). The ?rst three columns show correct detections while the last column shows false positives. Our system is able to detect objects over a wide range of scales (such as the cars) and poses (such as the horses). The system can also detect partially occluded objects such as a person behind a bush. Note how the false detections are often quite reasonable, for example detecting a bus with the car model, a bicycle sign with the bicycle model, or a dog with the horse model. In general the part ?lters represent meaningful object parts that are well localized in each detection such as the head in the person model.

7

PASCAL2006 Person 1 0.9 0.8 0.7 precision 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 recall 1 Root (0.18) Root+Latent (0.24) Parts+Latent (0.29) Root+Parts+Latent (0.34)

Figure 6. Evaluation of our system on the PASCAL VOC 2006 person dataset. Root uses only a root ?lter and no latent placement of the detection windows on positive examples. Root+Latent uses a root ?lter with latent placement of the detection windows. Parts+Latent is a part-based system with latent detection windows but no root ?lter. Root+Parts+Latent includes both root and part ?lters, and latent placement of the detection windows.

ing table shows AP as a function of freely allowable deformation in the ?rst three columns. The last column gives the performance when using a quadratic deformation cost and an allowable displacement of 2 HOG cells. si AP 0 .27 1 .33 2 .31 3 .31 2 + quadratic cost .34

5. Discussion

We introduced a general framework for training SVMs with latent structure. We used it to build a recognition system based on multiscale, deformable models. Experimental results on dif?cult benchmark data suggests our system is the current state-of-the-art in object detection. LSVMs allow for exploration of additional latent structure for recognition. One can consider deeper part hierarchies (parts with parts), mixture models (frontal vs. side cars), and three-dimensional pose. We would like to train and detect multiple classes together using a shared vocabulary of parts (perhaps visual words). We also plan to use A* search [11] to ef?ciently search over latent parameters during detection.

References

[1] Y. Amit and A. Trouve. POP: Patchwork of parts models for object recognition. IJCV, 75(2):267–282, November 2007. [2] M. Burl, M. Weber, and P. Perona. A probabilistic approach to object recognition using local photometry and global geometry. In ECCV, pages II:628–641, 1998.

[3] D. Crandall, P. Felzenszwalb, and D. Huttenlocher. Spatial priors for part-based recognition using statistical models. In CVPR, pages 10–17, 2005. [4] D. Crandall and D. Huttenlocher. Weakly supervised learning of part-based spatial models for visual object recognition. In ECCV, pages I: 16–29, 2006. [5] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, pages I: 886–893, 2005. [6] B. Epshtein and S. Ullman. Semantic hierarchies for recognizing objects and parts. In CVPR, 2007. [7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop. [8] M. Everingham, A. Zisserman, C. K. I. Williams, and L. Van Gool. The PASCAL Visual Object Classes Challenge 2006 (VOC2006) Results. http://www.pascalnetwork.org/challenges/VOC/voc2006/results.pdf. [9] P. Felzenszwalb and D. Huttenlocher. Distance transforms of sampled functions. Cornell Computing and Information Science Technical Report TR2004-1963, September 2004. [10] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1), 2005. [11] P. Felzenszwalb and D. McAllester. The generalized A* architecture. JAIR, 29:153–190, 2007. [12] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. In CVPR, 2003. [13] M. Fischler and R. Elschlager. The representation and matching of pictorial structures. IEEE Transactions on Computer, 22(1):67–92, January 1973. [14] A. Holub and P. Perona. A discriminative framework for modelling object classes. In CVPR, pages I: 664–671, 2005. [15] S. Ioffe and D. Forsyth. Probabilistic methods for ?nding people. IJCV, 43(1):45–68, June 2001. [16] Y. Jin and S. Geman. Context and hierarchy in a probabilistic image model. In CVPR, pages II: 2145–2152, 2006. [17] T. Joachims. Making large-scale svm learning practical. In B. Sch¨ lkopf, C. Burges, and A. Smola, editors, Advances in o Kernel Methods - Support Vector Learning. MIT Press, 1999. [18] Y. LeCun, S. Chopra, R. Hadsell, R. Marc’Aurelio, and F. Huang. A tutorial on energy-based learning. In G. Bakir, T. Hofman, B. Sch¨ lkopf, A. Smola, and B. Taskar, editors, o Predicting Structured Data. MIT Press, 2006. [19] A. Quattoni, S. Wang, L. Morency, M. Collins, and T. Darrell. Hidden conditional random ?elds. PAMI, 29(10):1848– 1852, October 2007. [20] D. Ramanan. Using segmentation to verify object hypotheses. In CVPR, pages 1–8, 2007. [21] D. Ramanan and C. Sminchisescu. Training deformable models for localization. In CVPR, pages I: 206–213, 2006. [22] H. Schneiderman and T. Kanade. Object detection using the statistics of parts. IJCV, 56(3):151–177, February 2004. [23] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classi?cation of texture and object categories: A comprehensive study. IJCV, 73(2):213–238, June 2007.

8