博文

Image Segmentation Using Hardware Forest Classifiers

已有 3251 次阅读 2013-7-8 19:46 |个人分类:论文阅读|系统分类:论文交流

摘要

图像分割就是分割图像到segments或者subset，目的是进行further analysis，分隔开foreground上interesting objects和background上的un-interesting objects。在很多image processing application，这个process需要对每个像素基础上进行一序列的computational step，因此binding 图像的size和resolution与系统performance。当application需要更大的resolution和更大的image的时候，可用的计算资源往往超出了cpu的限制，特别是在消费电子中power和thermal有约束的应用领域。在这篇文章中，我们使用了基于hardware tree-based classifier来解决图像分割问题。这个应用其实就是从kinect sensor上获取的深度图像，进行background removal。在图像被分割之后，subsequent step就是对object in scene进行分类。这种方法是很灵活的，为了解决不同的应用领域，我们只需要改变classifier所需的tree。我们描述了两种distinct approaches，并且在commercial-grade testing environment上evaluate两种方法的性能。

Introduction

Mirosoft Kinect 是一个Natural User Interface(NUI)，其允许user用身体与计算机系统进行交互。这个技术包括帧率为30fps的depth sensing camera，和图像处理的软件。这种应用程序track multiple human participants，并且得到individual body parts。这是一个computationally complex的过程，需要Host system上耗费tremendous amount的计算资源，尽管有GPU的加速。这种计算要求限制了application和form factors，目前不能在mobile phones和tablet computer上应用。所有的这些要求一个lower power，使用dedicated hardware可以在可接受的power范围内handle the load。

目前的Kinect Software Pipeline包含了四个stages。分别是background Removal，body part classification，centroid calculation，和model Fitting。Background removal本质上在depth map上tag每个像素属于player还是background。Body part classification进一步refine the player classification，归属于31个body parts之一。这一步已经之前在hardware上验证的。第三步，centroid calculation，主要是aggregate概率图到一个centroid，比如指定body part中心的位置。最后一步，是Model Fitting，aggregate the centroid into human skeleton(aggregate，累计，aggregation)，解决noise和occlusion。Model Fitting Step目前不是computationally intensive，因此在CPU上运行。图2显示了如何Microsoft Xbox console，第一步和第四步在CPU上运行，而第二步和第三步都在GPU上运行。新的alternative pipeline如图2b所示。BGR 阶段通过identifying the island or subsets，来简化body part classification。这个过程是高度sequential，包括比较每个active pixel和它的neighbor。有效地，BGR决定了是否每个each pixel属于parts of a human player。如果一个tree based classifier可以identify parts of a body，证明可以separate a human body from other objects in the scene。这是一个我们成功的investigate的基本假设。

为了在power-constrained environment中，使用Natural User Interface，我们倾向于直接connect the camera到pipeline（connect camera to pipeline是一条思路），而计算绝大多数任务within the device，但是send结果到host CPU。这就有了hardware implementation。现有的Microsoft Xbox Software implementation是非常复杂的，并且port to hardware极具挑战。在这个工作中，我们提出了一个alternative，novel method，叫做Forest Fire算法。输入的depth image被送到第一个classifier，从而区分interesting portion from background。而regular second stage直接使用filtered image。两个stage都是由硬件实现的，这两个classifier实际上是replica，每个在不同的dataset上进行训练。我们也评估了两种方法，其一是并把first phases和second phases融合起来，其二是使用单一的classifier，在同一时间perform BGR和body part。第二个方法会更简单，但是会导致lower quality和worse performance。Section II 给出了background material，包括了相关的工作。Section III 描述了我们实现的系统，扩展到特别的问题，包括floor detection，player tagging，和forest training。Section IV报告结果，而Section V给出结论。

BackGround

A. Previous Work on Image Segmentation

使用hardware acceleration来进行image segmentation是十分局限的，绝大多数工作集中在对软件算法上的加速。Application domain包括image quality，general vision application，player identification和body tracking，medical imaging，甚至3d world reconstruction。我们的工作是基于Criminisi，在一系列的computer vision应用领域中，使用decision forest classifier。包括在Microsoft Kinect上identifying body part。我们现在extend it到硬件上进行图像分割。Yin 使用software classification forest在non-depth sensing webcam来进行image segmentation。这种方法compare well to stereo camera，但是最高的性能只有7.7fps，缺乏实时的性能。Kinsella 改进了webcamera的image quality，在Digilent Spartan 3 Evaluation Board实现了several image segmentation algorithm。Segmentation for object classification需要一个不同的方法。目前最成功的statistically trained tree classifier。Yang使用GPU加速image segmentation，来进行background removal。GPU使用用来计算所有像素点的squared distance，然后设置threshold分割object in the scene。一些其他的image morphology来进行图像分割，比如erosion and dilation of edges of objects。Yang 报告了30%的性能改进。当这些工作可以conceivably用来进行我们的分割，power requirement和limited speedup是主要的concern。MacLean 提供了overview of the field，并且motivate FPGA对计算机视觉领域的suitability。

B. Software BGR

使用motion between frames，BGR的软件本identify pixel作为candidate，进行active player tagging。使用connected component algorithm，这些active pixel被结合起来附近的pixel变成pixel island，使用gradient descent approach。在理想的状态下，一个player mask会emerge成为一个孤岛。但是，实际上一个player mask往往需要多个island装配集合而成。这通过使用motion，history from previous frames，以及相当复杂的manually designed rules进行combining和splitting islands。这将导致complex和完全sequential software，使得它在FPGA上实现undesirable。在这篇文章中，我们开发了alternative approaches，解决问题获得可接受的quality和更好的性能。注意BGR并不是简单的分割players from the background。Players是individually tagged，consistently from frame to frame。而且那些vaguely resemble human的shape要被rejected。除此之外，Model fitting stage要求identification of the floor plane，从而精确的定义player的位置in world space。Hardware BGR implementation需要提供同样quality和更好quality的information。

我们已经测试了一种possibility，运行BGR在software embeded processor。在第一个测试中，我们使用Atom Processor at 1.6Ghz，这会产生大约14.3 fps 的 frame rate。在第二个测试中，我们使用ARM Processor on Microsoft Surface tablet，运行在1.2GHz，产生的frame rate大约是7fps。在Xilinx Zynq上的ARM是类似的，但是运行速率只有一半。当没有任何的implementation很好的tuned，距离最小可以接受的frame rate是30 fps，还有较大距离，并且没有考虑remaining phases of the pipeline。注意in practice，the frame rate可能会更高，可能要高达90fps，这样会允许end-user application来运行concurrently。

C. FPGA Forest Fire

Forest Fire是一个Microsoft Kinect上使用的random tree based classification algorithm，用来classify pixel of a depth作为human body parts。每个active pixel traverse 多个binary trees。开始于root，a decision基于一个evaluation function，用来行进到或者left，或者right child。最终的，这个traversal会达到leaf node，而这当前像素的probabilities属于一个特定的body part。这些从不同的树上得到的结果最终会aggregated together。

Oberg 产生了一个high performance hardware implementation of Forest Fire Classifier，来进行body part classification。Memory access是主要的bottle neck，并且tree traversal和Sorting FIFO产生了optimal memory access sequence。在这个工作中，我们重新使用了这个core，并进行了少量的modification。系统描述的类似于Microsoft Xbox platform，它split 整个pipeline 变成 4个stages的pipeline。

Implementation描述in the paper。我们intended target类似于Xilinx Zynq。我们量测的Model Fitting Stage on Microsoft Surface tablet是少于1ms per frame，确信我们estimate that the stage并不影响性能。给定一个sequential nature of the model fitting code，hardware implementation不会给出任何的performance benefit，而area cost确实noticeable。该文的系统因此是preferable。

Hardware BGR

Fig3 是一个composite block diagram of tree solution。其Baseline使用BGR software step，使用connected component algorithm来产生foreground map，并且使用RANSAC来计算floor。在one stage，我们feed the input depth into 一个classifier，这个classifier由标记了original 31 body part的增强的forest进行训练。一个additional element可以解决floor data来解决exact equation of the floor plane。Two stages使用两个Forest Fire Classifier的instantiations。第一个instance分割了floor，human，anything else。我们然后feed human foreground map到original Microsoft Xbox的body-part classifier，来进行further identification of various body parts。注意baseline解决了player separation的问题，并且consistently指定一个given body part到一个特定的player。其他两个solution要求additional pre-model fit module。

A. One Stage Classifier

One stage forest classifier非常简单，要求no additional 需要在body part classification中使用的hardware。唯一的差别是我们增强了the set of label of interest。除此之外，使用single classifier最小化外部的memory requirement。主要的disadvantages是每个classifier必须工作在每个pixel，without filtering。相反，image segmentation filters out 背景像素，而分类器只在active foreground map。在实践中，ratio of background和foreground pixe大概是4：1。尽管classifier并没有改变，这会至少triple the processing time。In other words，分类器operate constantly in "stress performance"，高达56fps。但是明显的是，这个classifier tree必须grow to accommodate the floor，但是产生the same original accurate results。尽管这样，为了更精确的比较，新的forest也已经有了three trees，每个都是20level deep。总共的forest size 24MB。唯一的差别是additional class label。系统的性能可达56fps，对所有的输入。

B. Two Stage Classifier

一系列的consideration引领我们开始研究two-stage approach，这也Fig3中有呈现，我们呈现一个coarse-grained classification into foregroudn，background和floor，紧接着第二步是一个more fine-grained classification on the foreground pixel。性能是一个原因，我们使用small和faster forest进行第一步的filter，并且只应用大约1/4或者1/5的图像面积，进行heavy duty body part classification。这提升性能达到超过200 fps。第个原因是所需的hardware resource来实现这个classifier只是small percentage，复制整个unit是非常feasible的。第三个考虑是我们在第二步中重复使用existing forest，这是通过millions of imagess上训练的来的。而实际上，第一个forest不需要太多的training data。除此之外，small number of classes可以更进一步的增强，in the future with additional classes。在实践中，对background和foreground的分离可以在更低的resolution实现。为了实现可能的results，我们在Background Removal阶段subsample the images，然后up-sample the resulting mask，进行第二阶段的body part classification step。最重要的性能是performance。产生好的classification accuracy on fewer classes是可以达到的with smaller forest depth。如果forest是足够小的，有可能store它在chip上，而不是external memory。我们评估一个small forest of 24KB(3 trees, 10 depth levels)，和一个larger 1.5MB forest(3 trees, 16levels). Both都可以fit in the internal memory。Eliminating the external memory accesses明显的减少了computational cost of the step。

C. Floor Computation

Baseline系统使用RANSAC来发现equation of the plane，有着最小的distance from 所有bottom 10%的图像。Additional step是refine the floor candidate变成具有最小eigenvalue值的那个。这个algorithm检查microsoft kinect sensor位于不超过20degree的floor normal。如果我们应用其他形式的filtering，我们可以relax这种vertical orientation，并且在所有floor pixel上perform RANSAC。注意在mobile device，the orientation restriction不能实施，floor pixel必须被分类，不论其camera orientation。所以的hardware approach可以实现这个。

RANSAC 算法提供high degree的noise reduction，例如，从数学上来讲，任意three noise-less floor point都可以产生同一个plane equation。分类和centroid computation是alternative form的noise filtering。这引领我们考虑一个更简单和更有效的方式来计算floor plane equation。类似body part pixel，我们使用hardware streaming k-means algorithm计算centroid for the floor pixel。这产生了relatively大量的floor centroid，但是a much smaller number比total number of floor pixels。（实际上，a hundred or so against a few thousands)。我们然后identify一个bounding cube for the floor，定义一个six centroid。从21个这些点的combination，我们拥有21个candidate planes，其normal 指向floor。每个candidate floor测试与其他remaining centroid来发现拥有最佳fit的plane。通过构建，每个candidate拥有一个well-known orientation with respect to the coordinate system。利用RANSAC，所有的点都是随机选择的，因此both the up/down plane是同样的有效。

Both moduls都可以在我们的系统使用。RANSAC提供了更好的noise reduction，k-means更有效，并且eliminate any up/down uncertainties。注意RANSAC 可以compensate for the advantage of K-means，计算是并行的，并且asynchronously to the frame stream。Floor Orientation在实践中不会频繁的改变，不太可能在每个frame进行改变。

D. Player Tagging and Model Fitting

software BGR给出了需要的function，来进行player labelling和player tracking。不同的player必须labeled在不同的frame，同一个player必须assign同一个ID from frame to frame。The connected component based approach自然的leads to player separation，记忆the center of body mass from frame to frame，实现the required consistency in tagging。因为Forest Fire classifier并没有提供这个function，它必须implemented elsewhere来保持backward compatibility。这个module明确为pre-model-fit。通过experimentation，我们验证original XBOX body part forest和Forest Fire Hardware，来准确的classify pixel into body parts。这个probability maps 简单的 indicate 对应于various instanced的multiple hot spot。因此，two stage classifier，所有的foreground pixel可以label为Player 1，并且pass to到第二个stage。

不幸的是，这个问题已经move到下一步。The microsoft Xbox Model Fitting algorithm仍旧需要partioned centroid，来有效地assemble skeleton。 The pre-model fit step输入a single list of centroid candidate，并且split it into 多个multiple per-player list。这一步的fine detail超出了研究的范围，但是只觉得这个问题必须solvable，当handling 4-8 candidate centroids per body part。一种方法是使用same islanded-based gradient descent algorithm，只在centroid level。一个更简单的方法如下：

首先partion the head centroid，looking at the cluster of head centroid，估计the number and location of each head in the scene。使用the neck和left/right torso candidate来验证the number of players。Double-Check 新的估计针对previous frame，assuming the addition and deletion of a player是一个infrequent event。使用combination of proximity和connectivity test来联通head, neck，shoulder和torso。最终的，add libs 到每个player的centroid set。注意我们应该err on the side of caution。我们可以add the same weak centroid到所有的player sets，并且让Model Fitting sort it out。但是我们必须assign a strong candidate到一个correct list。Consistent tagging可以track the head or torso centroids from frame to frame。观察到的结果，两个combined model fitting step仍旧是1ms per frame。

E. Forest Training

Microsoft Xbox product例行测试a very large repository of clips。比如一个short depth-image movies，大概200-500 frames。我们发现了一个方法，leverage the data，为两个forest产生training set。我们使用baseline的Microsoft Xbox pipeline来identify和eliminate the players from the scenes，只留下background depth images。我们然后使用computer generated具有不同size, body types，poses的 human model。这种方法是自动的，可以产生a large number of training images，让我们产生required variety in the player's poses。这在一定程度上在original clip上丢失。因为CG player是计算机产生的，我们可以提前知道body parts在哪里，因此可以自动的产生ground-truth label。Floor pixels是labeled使用computed floor equation。

Additional work是必须的来refine这个forest training process，并能改进size和quality of training sets。我们使用small trainging sets大概4000 images，大概32类forest，用于one stage version，这会需要long time来进行training。我们使用1000幅images，训练two stage version的第一步。随着这些set size，我们可以在24小时内产生新的forest，使用Intel i7 hex core PC at 3.2Ghz，64GBmomory。目前的quality of the training data可以被改进，现存的插值computer generated human figures到现存scenes，产生artifacts，特别是像hand, feet, wrist和Knee等 body parts。

Result

我们使用下面的setup进行测试。一个host PC提供feed from live cameras或者存在disk上，使用simple interface进行reconfigurable computing。FPGA 返回 centroid和plane coefficient of floor到PC上，然后PC进行final model fit阶段。我们使用Xilinx Virtex6 Evaluation Board。我们评估两个hardware BGR approaches，用三个test clips，并且和original baseline system和ground truth进行比较。每个Suite是基于不同的video clips。Suite 94主要是standing human with类似的background，而suit11主要是seated human with furniture。这个Suite分别包含193，533，和155 clips。每个Clip包含hundreds of frames。The baseline是先有Microsoft Kinect for Windows SDK。One-stage是单个的classifier system，而two-stage是two-step classifier 系统。对每一帧，我们计算final skeleton的joints position from model fitting。图4给出从单幅上获得的结果。第一column是depth map。而第二column是player tagged by the baseline software。第三列是foreground probability map from two-stage system。第四列是floor probability map。第五列是final skeleton。视觉的比较不太实际，所以没有给出quantative result。

A. Skeletal Tracking Results

一个quantitative approach是比较每个joint location，然后average，aggregate the results。这是一个probabilistic task 但是100%的accuracy是不可能的。因此，一个large location error不会对应一个巨大的visual error。一个更有效地方法是look at the length，特别是skeleton的limb的orientation。

在表1，我们报告了不同于ground truth的the number of joints location，以及他们沿着不同方向上的average和maximum distance。越低的值越好，我们计算不同于ground truth的the number of limbs。因为我们拥有两个alternative algorithm来计算floor calculation，我们针对RANSAC和K-Means报告statistics，来评估the rest of pipeline的the effects。测试表明，两种算法都很有效，差别很小。

Table 1显示大多数joint与ground truth很大不同，使用hardware BGR而不是Software BGR有2%到3%的比例的不同。总体来看，two-stage的分类器slightly better 超过one-stage classifier。因此我们可以给出much smaller的训练集。Two-stage的分类器更接近baseline in the average difference。Joint difference from baseline范围为12% for one-stage classifier到8% percent for two-stage classifier。在two-stage的classifier，average limb的deviation人就是非常小。

B. Floor Computation Results

Floor computation accuracy和latency，可能在其他场景中很相关，比如mobile device。我们计算了RANSAC和K-Means的solution。我们使用one frame。我们通过运行无限数量的RANSAC iteration来建立 establish ground-truth，只在有stable result的时候停止。我们discard 12%的不能产生floor图像。而K均值可以发现一个有效的floor

C. Device Utilization

对于One-stage 的 classifier，Forest Fire Module被实例化两次，在第一个stage之后，Foreground image被重新写回。As mentioned，有可能存储store第一个阶段的image segmentation到chip，但是为乐简化，我们存储两个forest的结果到DDR3 memory。我们并没有研究pipeline两个classifier。

表3给出了Virtex 6的utilization。The utilization reported 是7.5%的LUT和30%的BRAM。我们的实现返回了centroid of body parts，而不是pointer to the forest leaves。换句话说。每个forest fire core包含了一个additional centroid computation unit。这已经tripled the LUT，而cutting the block ram到原来的1/5。The implementation of DDR3 controller和PC Interface极大的unchanged。对于One-stage system的操作如下，depth image被下载到FPGA中，通过Ethernet.The centroid of each body part type都被计算，并且写到output buffer.

结论

该文的主要共享是一个complete, fully-embeded realization of Microsoft pipeline,而没有使用Powerfull CPU或者GPU，使用非常低的功耗，使用了commercially available FPGA。为了实现系统in hardware，我们使用了一种新的classification forest，进行图像分割。不是分割connected objects和tagging pixel，我哦们直接classify them变成human body parts和floors。一个straightforward realization with a single classifier并没有表现很好比较commercal grad Xbox baseline。系统是紧凑的，但是excessive noise和limited forest training都是不利因素。新能也是negatively affected by the lack of filtering。而一个two-stage的方法相反表现可以接受的，第一个classifier detect foreground和floor data。而第二个classifier实现actually body part extraction。整个系统is no worse than 7% of the baseline，而只有1%的worse on average。除此之外，我们可以利用shipping main forest，并且增强it with a second，更小的forest acting。而hardware implementation减少了所有的系统的延迟，获得超过200fps的frame rate。我们已经考虑了a number of alternative algorithm进行segmentation。但是没有任何一个可以完成所有的要求，特别是分割开不同的player。我们减少区分问题，变为assign the identified body parts变成correct player，这计算更容易一些。训练forest需求一些creativity。对于evaluation，最佳的metric，是使用final skeletion's limb orientation。除了player，系统必须正确的identify the floor location，比如world-space coordinate。我们描述和评估了RANSAC-based 和 novel k-means based approach。

http://research.microsoft.com/apps/pubs/default.aspx?id=184982

http://research.microsoft.com/apps/pubs/default.aspx?id=170804

转载本文请联系原作者获取授权，同时请注明本文来自刘小邦科学网博客。
链接地址：https://blog.sciencenet.cn/blog-942948-706406.html

上一篇：Tracking the articulated motion of two strongly interacting
下一篇：[机器学习]模式识别-中国科学院空中课堂

收藏 IP: 111.37.7.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

博文发布时间已经超过87600小时，评论已关闭。

刘小邦

扫一扫，分享此博文

刘小邦的个人博客分享 http://blog.sciencenet.cn/u/iamliuzhiyong 浮生浪迹笑明月千愁散尽一剑轻

博文

Image Segmentation Using Hardware Forest Classifiers

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

刘小邦

全部作者的其他最新博文

全部精选博文导读

相关博文

刘小邦的个人博客分享 http://blog.sciencenet.cn/u/iamliuzhiyong 浮生浪迹笑明月 千愁散尽一剑轻

博文

Image Segmentation Using Hardware Forest Classifiers

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

刘小邦

全部作者的其他最新博文

全部精选博文导读

相关博文

刘小邦的个人博客分享 http://blog.sciencenet.cn/u/iamliuzhiyong 浮生浪迹笑明月千愁散尽一剑轻

该博文允许注册用户评论请点击登录评论 (0 个评论)