博文

[论文阅读]通过Dictionary Learning的视觉跟踪和运动估计

已有 3239 次阅读 2013-5-25 12:16 |个人分类:论文阅读|系统分类:论文交流

.伊朗著名的谢里夫理工大学有篇关于Sparse Coding来进行视觉跟踪和运动估计的论文。希望能与阅读这篇文章或者对这篇文章有兴趣的同仁交流探讨,共同学习。

摘要

这篇文章中，我们给出了一种新的方法来解决tracking问题。这种方法结合了sparse representation和motion estimation来跟踪object。最近，sparse representation在signal processing和computer vision领域获得很大的关注。Sparse representation可以用来作为classifier但是有很高的time complexity。因此，我们利用motion information局部计算，来减少computation time。实验表明实现的结果accurate enough，并且比sparse classifier拥有更少的computation time。

简介

Tracking是在computer vision中一个非常著名问题。应用领域包括security survelliance system，augmented reality等等。很多tracking algorithm都被developed，但是还没有任何一个方法可以在任意的环境中有satisfying result。因此，新的tracking method仍然不断的develop。其中主要的challenge包括occlusion，illumination change，scale change，varying viewpoint，和deterioration。

Visual tracking中的早期工作使用sum of squared difference(SSD)作为cost function。另外一种approach是particle filter framework，其把tracking作为time series上的state space上进行的estimation。有论文是将Appearance-adaptive model和partile filter进行融合来实现robust tracking。Covariance tracker是另外一种方法，使用基于covariance的object description。还有论文是使用不同类型的feature和modality结合来有效地track nonrigid object。

在上一个十年，很多来自machine learning community的researcher尝试这种方法solve classification problem。这种classification方法主要有object of interest类的分类和background类的分类。在本case中，一个classifier提前训练来区分object和background。一个common的方法用于incorporate分类器到tracking problem是，顺序的使用classifier和tracker。Tracker发现object去哪里而classifier会给出一个score。这种方法会不停重复直到classification score达到一个满意的分数。最频繁使用的classifier是 Support Vector Machine(SVM)。这种方法的disadvantage是tracker并不能保证move到best location而是best maching image region。基于Kernel的density estimation是另外一种用在visual tracking中的常用的machine learning method。在这些方法中，区域内的visual feature的distribution是观测的目标。

一些新的方法使用sparse representation。Sparse representation首先是用来作为reconstruction method，例如利用最少数量的predefined block来reproduce signal。信号通过一系列sparse set的basis funtion来represent。Sparse representation在computer vision中的使用，主要由neuroscience field的工作inspire的。Motion estimation是另外一个在computer vision领域中well-studied的field。一个motion estimator表现为一个对certain area的confidence measure。同时，candidate block的发现是一个在learning process非常重要的。这里工作的主要contribution是呈现一种方法，把sparse code和motion estimation联合起来去tackle这个tracking problem。我们在包含了heavy occlusion, large illumination和pose changes的video sequence上进行test。提出的方法与三个其他的IVT tracker，fragment based tracker，L1 tracker相比较在accuracy上拥有excellent performance。在section II，给出了sparse coding相关工作short review。在section III，给出了用到的方法。在section IV，给出了实验的结果。

相关工作

这里，我们简要的介绍在sparse representation field上的work，来发掘如何利用这种方法来解决tracking problem。一种综合的图像去噪方法用sparse representation来实现。Author使用了两种不同类型的dictionary，一个pre-defined dictionary和一个learned dictionary来图像去噪。Learned dictionary给一系列corrupted image带来了非常令人满意的结果。这一成就给了其他人incentive来learning和update这个dictionary。其他也产生了一些online的update dictionary的方法。

A problem formulation

最近十年，sparse representation field在signal processing community有极大的attention. 这种attention是由于很多natural signal本质上就是sparse的，因此可以approximate甚至fully recover这种sparse code。这种reconstruction可以通过来自dictionary based的linear combination来实现。The dictionary可以定义为 $D=\left [ d_{1},...d_{k} \right ]\in R^{n\times k}$ ，其中 $d_{i}$ 是第 $i^{th}$ 的basis并且 $n$ 是每个basis的dimension。我们想要发现一些来自dictionary的bases，以至于reconstructed signal $\widehat{X}$ ，拥有和original signal $X$ 极大的相似性。下面问题的formulation可以通过下面的公式进行demonstrate。

$\widehat{X}=D\alpha$

而vector $\alpha$ 选择哪个bases来进行reconstruction。 $T$ 是一个predefined的threshold， $\left \| \cdot \right \|$ 是 l0-pseudo norm，定义为

$\left \| x \right \|_{0}=\begin{Bmatrix}s.t. x_{j}\neq 0 \end{Bmatrix}=\lim{\sum \left |x_{j} \right |^q}$

注意 l0-pseudo norm意味着输入vector中non-zero elements的count。不幸的是，因为问题的NP-hard Nature， $\widehat{\alpha }$ 还没有发现有效的方法，但是已经证明 $l_{1}$ norm 可以用来产生sparse solution。因此，reformulated problem是

$\widehat{\alpha }=argmin\begin{Vmatrix}X-D\alpha \end{Vmatrix}+\lambda \left \|\alpha \right \|$

通过这个公式，sparseness和minimum energy difference都可以被实现。 $\lambda$ 是regularization term，用来控制sparseness和reconstruction error之间的tradeoff。据我所知，没有一个在 $\lambda$ 与模型的稀疏性之间的theoretical link。使用l1 norm而不是l0-pseudo norm主要有三个方面的benefits。首先，如果问题的solution是足够稀疏的， $l_{1}$ norm拥有和 $l_{0}$ norm一样的结果。第二， $l_{0}$ pseudo norm拥有更高的time complexity，并且不能在reasonable amount的时间内完成。另一方面， $l_{1}$ norm的representation可以转换non-convex问题到convex问题。Convex问题是well-studied problem，可以通过numerous method来解决。第三个至关重要的property是 $l_{1}$ norm的结果 $l_{0}$ norm的更要stable。这说明dictionary对small perturbation的检测更加鲁棒。如果dictionary不是一个pre-defined，我们需要发现sparse code以及proper dictionary，因此优化问题的general form变成了

$argmin_{D,\alpha }\left \| X-D\alpha \right \|+\lambda \left \| \alpha \right \|$

这不是fully convex但是通过固定 $D$ 和 $\alpha$ ，并且最小化另外一个参数，这个问题可以被treat为convex的问题。K-SVD和MOD可以用来同时发现proper dictionary和sparse code。这些方法是iterative approaches，并且并且可以最小化reconstruction error。在这些方法中，sparse code $\alpha$ 和dictionary $D$ 在两个phases 中被updated。第一phase是当 $D$ 是固定的，sparse code明显的updated。这种task可以通过类似greedy orthogonal matching pursuit来实现。另外一个phase是当dictionary的atom被updated。在MOD， $\alpha$ 是固定的，而atom of dictionary使用least square criteria进行更新。在K-SVD中，不仅atom of dictionary被更新，而且non-zero coefficient的sparse code也同时被更新。

提出方法

Sparse representation的bottlenect主要是computing sparse code的很高的计算量。为了解决这个问题，我们呈现了estimate这些sparse code的新的方法。同时，通过incorporate一个motion estimator,计算时间被significantly减少了。这个方法中，我们计算了第i幅图像的sparse code并且estimate第i+1图像的sparse code。这种方法的假设是两幅successive frame是similar，因此他们的sparse code必须相同。

我们考虑tracking problem作为一个two class的分类问题。我create一个针对object class的dictionary和针对background class的dictionary。在第一个frame，dictionary基于object和background。对于其他frames，或者sparse code包含完全计算的或者approximted。提出的方法包含三个主要步骤，精确的sparse coding，approximate sparse coding和dictionary learning。

A Extract Sparse Coding

精确的sparse code通过solve

$argmin_{\alpha }\left \| X -D\alpha \right \|+\lambda \left \| \alpha \right \|$

其中 $\lambda$ 是控制sparseness和reconstruction error之间tradeoff的因子。这个model会通过 $l_{1}$ regularized least square method 来进行解决。 $l_{1}$ norm 拥有一个针对coefficient vector $\alpha$ 的sparse solution。在 $\lambda$ 和sparseness之间并没有理论上的关系。

B Approximate Sparse Coding

一个可以用于motion estimation的假设是一帧上的pixel和下一帧上的correspondence拥有同样的intensity value。Block region estimator通过定义support region来解决ill-posedness。这个support region定义为pixel of interest的neighborhood。最佳的motion vector是最小化像素间的intensity difference。

预测的motion vector可以预测为：

$v_{i}=argmin_{v}AD_{i}(v)$

$AD_{i}(v)=\sum \left | f(x,y)-f^{-}(x-v_{x},y-v_{y}]) \right |$

在其中， $AD_{i}(v)$ 是block的像素强度的绝对值差， $f$ 和 $f^{-}$ 分别是current和previous的frame data。

近似当前帧的sparse code的需要之前帧的sparse code。所以，假设dictionary是固定的，之前帧的每个patch的精确sparse code已经计算出来，每个patch的当前帧的sparse code由之前帧的类似patch。为了拥有一个more robust的approximation，我们也考虑这种reconstruction error。所以，sparse code在approximation phase可表示为

$\alpha _{i,t}=argmin_{\alpha }\left \| x_{i,t}-D\alpha \right \|+\lambda \left \| \alpha -\alpha _{i,t-1} \right \|$

$x_{i,t}$ 是 $i$ th patch的frame $t$ ， $x_{i,t}$ 是 $i$ th的patch和frame $t$ ， $\alpha _{i%uFF0Ct}$ 是近似的sparse code。右边的第一个项是reconstruction error，而第二项是当前帧和之前帧的相似性。

C Dictionary learning/Updating

Dictionary learning method发现两个分别针对object class和background class的dictionary。 $z$ 是label of classes，并且 $\eta$ 控制两个dictionary的correlation。Dictionary learning的method使用warm restart来更新bases，因此拥有更快的计算复杂度。计算sparse code method。

实验结果

在这一部分，我们report该算法的性能并且和其他三个tracker进行比较，包括IVT，Fragment based tracker，和L1 tracker，其中L1 tracker是基于稀疏编码的tracker。这些算法都在一个很well-known的叫做“Red Team"的video sequence。我们的方法是implemented的是在2.0GHZ的Intel Corei7处理器上的MATLAB软件上实现。

我们这个tracker的initial condition可以定义如下：第一帧的object和background通过two bounding boxes来给出。每个patch的size是9 $\times$ 9，每个dictionary拥有20bases。每个region上的patches都有50%的overlap。我们使用RGB和HOGfeature。我们使用了两个dictionaries，一个是为了background，另一个是object。参数 $\lambda$ 是设定为 $\frac{1}{2}$ 。

我们同时mention我们feed我们的tracker只是even frame，而不是所有的frames。Figure 2和figure 3给出了比较其他方法的qualitative和quantative results。我们比较了我们的方法和其他方法之间的Euclidian error。在left pannel，我们conducted一种叫做“David outdoor“的序列。我们的方法显示了一种promising accuracy，从了frame i=44到frame i=60。Poor performance根源于我们的方法尝试在small neighborhood去发现similiar patches。在右面板，我们的方法拥有superior performance，因为我们的方法使用patch-based representation，并且update这个dictionary。因此，它可以adapt序列中恶劣的occlusion。因为L1 tracker也是基于sparse coding的方法，我们在计算时间上与我们自己的方法进行比较，表二给出了我们的方法和 L1 tracker方法的比较。表二显示我们的方法比L1 tracker方法更快，因为我们的方法进行了优化。

结论

该方法的novelty是结合了motion estimator和sparse code classifier来实现结果。同时，dictionary learning process拥有两个properties，他是online的并且supervised，比pure supervised dictionary learning method可以实现更好的结果

转载本文请联系原作者获取授权，同时请注明本文来自刘小邦科学网博客。
链接地址：https://blog.sciencenet.cn/blog-942948-693335.html

上一篇：[机器学习]Standford Machine Learning Course by Andrew Ng
下一篇：[科研方法]科研中的个人时间管理

收藏 IP: 210.75.252.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

博文发布时间已经超过87600小时，评论已关闭。

刘小邦

扫一扫，分享此博文

刘小邦的个人博客分享 http://blog.sciencenet.cn/u/iamliuzhiyong 浮生浪迹笑明月千愁散尽一剑轻

博文

[论文阅读]通过Dictionary Learning的视觉跟踪和运动估计

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

刘小邦

全部作者的其他最新博文

全部精选博文导读

相关博文

刘小邦的个人博客分享 http://blog.sciencenet.cn/u/iamliuzhiyong 浮生浪迹笑明月 千愁散尽一剑轻

博文

[论文阅读]通过Dictionary Learning的视觉跟踪和运动估计

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

刘小邦

全部作者的其他最新博文

全部精选博文导读

相关博文

刘小邦的个人博客分享 http://blog.sciencenet.cn/u/iamliuzhiyong 浮生浪迹笑明月千愁散尽一剑轻

该博文允许注册用户评论请点击登录评论 (0 个评论)