1429 阅读 2021-03-24 09:34:17 上传
以下文章来源于 言语语言病理学
今日 cs.CV方向共计52篇文章。
检测(3篇)
[1]:Probabilistic two-stage detection
标题:概率两级检测
作者:Xingyi Zhou, Vladlen Koltun, Philipp Krähenbühl
备注:Code is available atthis https URL
链接:https://arxiv.org/abs/2103.07461
摘要:We develop a probabilistic interpretation of two-stage object detection. We show that this probabilistic interpretation motivates a number of common empirical training practices. It also suggests changes to two-stage detection pipelines. Specifically, the first stage should infer proper object-vs-background likelihoods, which should then inform the overall score of the detector. A standard region proposal network (RPN) cannot infer this likelihood sufficiently well, but many one-stage detectors can. We show how to build a probabilistic two-stage detector from any state-of-the-art one-stage detector. The resulting detectors are faster and more accurate than both their one- and two-stage precursors. Our detector achieves 56.4 mAP on COCO test-dev with single-scale testing, outperforming all published results. Using a lightweight backbone, our detector achieves 49.2 mAP on COCO at 33 fps on a Titan Xp, outperforming the popular YOLOv4 model.[2]:CRFace: Confidence Ranker for Model-Agnostic Face Detection Refinement
标题:CRFace:用于模型无关人脸检测细化的置信度分级器
作者:Noranart Vesdapunt, Baoyuan Wang
备注:CVPR 2021
链接:https://arxiv.org/abs/2103.07017
摘要:Face detection is a fundamental problem for many downstream face applications, and there is a rising demand for faster, more accurate yet support for higher resolution face detectors. Recent smartphones can record a video in 8K resolution, but many of the existing face detectors still fail due to the anchor size and training data. We analyze the failure cases and observe a large number of correct predicted boxes with incorrect confidences. To calibrate these confidences, we propose a confidence ranking network with a pairwise ranking loss to re-rank the predicted confidences locally within the same image. Our confidence ranker is model-agnostic, so we can augment the data by choosing the pairs from multiple face detectors during the training, and generalize to a wide range of face detectors during the testing. On WiderFace, we achieve the highest AP on the single-scale, and our AP is competitive with the previous multi-scale methods while being significantly faster. On 8K resolution, our method solves the GPU memory issue and allows us to indirectly train on 8K. We collect 8K resolution test set to show the improvement, and we will release our test set as a new benchmark for future research.[3]:Hyperspectral Image Denoising and Anomaly Detection Based on Low-rank and Sparse Representations
标题:基于低秩稀疏表示的高光谱图像去噪与异常检测
作者:Lina Zhuang, Lianru Gao, Bing Zhang, Xiyou Fu, Jose M. Bioucas-Dias
链接:https://arxiv.org/abs/2103.07437
摘要:Hyperspectral imaging measures the amount of electromagnetic energy across the instantaneous field of view at a very high resolution in hundreds or thousands of spectral channels. This enables objects to be detected and the identification of materials that have subtle differences between them. However, the increase in spectral resolution often means that there is a decrease in the number of photons received in each channel, which means that the noise linked to the image formation process is greater. This degradation limits the quality of the extracted information and its potential applications. Thus, denoising is a fundamental problem in hyperspectral image (HSI) processing. As images of natural scenes with highly correlated spectral channels, HSIs are characterized by a high level of self-similarity and can be well approximated by low-rank representations. These characteristics underlie the state-of-the-art methods used in HSI denoising. However, where there are rarely occurring pixel types, the denoising performance of these methods is not optimal, and the subsequent detection of these pixels may be compromised. To address these hurdles, in this article, we introduce RhyDe (Robust hyperspectral Denoising), a powerful HSI denoiser, which implements explicit low-rank representation, promotes self-similarity, and, by using a form of collaborative sparsity, preserves rare pixels. The denoising and detection effectiveness of the proposed robust HSI denoiser is illustrated using semireal and real data.
分割(5篇)
[1]:3D Semantic Scene Completion: a Survey
标题:三维语义场景完成研究综述
作者:Luis Roldao, Raoul de Charette, Anne Verroust-Blondet
备注:Journal submission
链接:https://arxiv.org/abs/2103.07466
摘要:Semantic Scene Completion (SSC) aims to jointly estimate the complete geometry and semantics of a scene, assuming partial sparse input. In the last years following the multiplication of large-scale 3D datasets, SSC has gained significant momentum in the research community because it holds unresolved challenges. Specifically, SSC lies in the ambiguous completion of large unobserved areas and the weak supervision signal of the ground truth. This led to a substantially increasing number of papers on the matter. This survey aims to identify, compare and analyze the techniques providing a critical analysis of the SSC literature on both methods and datasets. Throughout the paper, we provide an in-depth analysis of the existing works covering all choices made by the authors while highlighting the remaining avenues of research. SSC performance of the SoA on the most popular datasets is also evaluated and analyzed.[2]:Juggling With Representations: On the Information Transfer Between Imagery, Point Clouds, and Meshes for Multi-Modal Semantics
标题:处理表示:多模态语义中图像、点云和网格之间的信息传递
作者:Dominik Laupheimer, Norbert Haala
链接:https://arxiv.org/abs/2103.07348
摘要:The automatic semantic segmentation of the huge amount of acquired remote sensing data has become an important task in the last decade. Images and Point Clouds (PCs) are fundamental data representations, particularly in urban mapping applications. Textured 3D meshes integrate both data representations geometrically by wiring the PC and texturing the surface elements with available imagery. We present a mesh-centered holistic geometry-driven methodology that explicitly integrates entities of imagery, PC and mesh. Due to its integrative character, we choose the mesh as the core representation that also helps to solve the visibility problem for points in imagery. Utilizing the proposed multi-modal fusion as the backbone and considering the established entity relationships, we enable the sharing of information across the modalities imagery, PC and mesh in a two-fold manner: (i) feature transfer and (ii) label transfer. By these means, we achieve to enrich feature vectors to multi-modal feature vectors for each representation. Concurrently, we achieve to label all representations consistently while reducing the manual label effort to a single representation. Consequently, we facilitate to train machine learning algorithms and to semantically segment any of these data representations - both in a multi-modal and single-modal sense. The paper presents the association mechanism and the subsequent information transfer, which we believe are cornerstones for multi-modal scene analysis. Furthermore, we discuss the preconditions and limitations of the presented approach in detail. We demonstrate the effectiveness of our methodology on the ISPRS 3D semantic labeling contest (Vaihingen 3D) and a proprietary data set (Hessigheim 3D).[3]:Discriminative Region Suppression for Weakly-Supervised Semantic Segmentation
标题:弱监督语义分割中的区分性区域抑制
作者:Beomyoung Kim, Sangeun Han. Junmo Kim
备注:AAAI 2021
链接:https://arxiv.org/abs/2103.07246
摘要:Weakly-supervised semantic segmentation (WSSS) using image-level labels has recently attracted much attention for reducing annotation costs. Existing WSSS methods utilize localization maps from the classification network to generate pseudo segmentation labels. However, since localization maps obtained from the classifier focus only on sparse discriminative object regions, it is difficult to generate high-quality segmentation labels. To address this issue, we introduce discriminative region suppression (DRS) module that is a simple yet effective method to expand object activation regions. DRS suppresses the attention on discriminative regions and spreads it to adjacent non-discriminative regions, generating dense localization maps. DRS requires few or no additional parameters and can be plugged into any network. Furthermore, we introduce an additional learning strategy to give a self-enhancement of localization maps, named localization map refinement learning. Benefiting from this refinement learning, localization maps are refined and enhanced by recovering some missing parts or removing noise itself. Due to its simplicity and effectiveness, our approach achieves mIoU 71.4% on the PASCAL VOC 2012 segmentation benchmark using only image-level labels. Extensive experiments demonstrate the effectiveness of our approach. The code is available atthis https URL.[4]:Thousand to One: Semantic Prior Modeling for Conceptual Coding
标题:千对一:概念编码的语义先验建模
作者:Jianhui Chang, Zhenghui Zhao, Lingbo Yang, Chuanmin Jia, Jian Zhang, Siwei Ma
备注:ICME 2021 ORAL accepted
链接:https://arxiv.org/abs/2103.07131
摘要:Conceptual coding has been an emerging research topic recently, which encodes natural images into disentangled conceptual representations for compression. However, the compression performance of the existing methods is still sub-optimal due to the lack of comprehensive consideration of rate constraint and reconstruction quality. To this end, we propose a novel end-to-end semantic prior modeling-based conceptual coding scheme towards extremely low bitrate image compression, which leverages semantic-wise deep representations as a unified prior for entropy estimation and texture synthesis. Specifically, we employ semantic segmentation maps as structural guidance for extracting deep semantic prior, which provides fine-grained texture distribution modeling for better detail construction and higher flexibility in subsequent high-level vision tasks. Moreover, a cross-channel entropy model is proposed to further exploit the inter-channel correlation of the spatially independent semantic prior, leading to more accurate entropy estimation for rate-constrained training. The proposed scheme achieves an ultra-high 1000x compression ratio, while still enjoying high visual reconstruction quality and versatility towards visual processing and analysis tasks.[5]:Semantic Segmentation for Real Point Cloud Scenes via Bilateral Augmentation and Adaptive Fusion
标题:基于双边增强和自适应融合的真实点云场景语义分割
作者:Shi Qiu, Saeed Anwar, Nick Barnes
备注:Accepted in CVPR2021
链接:https://arxiv.org/abs/2103.07074
摘要:Given the prominence of current 3D sensors, a fine-grained analysis on the basic point cloud data is worthy of further investigation. Particularly, real point cloud scenes can intuitively capture complex surroundings in the real world, but due to 3D data's raw nature, it is very challenging for machine perception. In this work, we concentrate on the essential visual task, semantic segmentation, for large-scale point cloud data collected in reality. On the one hand, to reduce the ambiguity in nearby points, we augment their local context by fully utilizing both geometric and semantic features in a bilateral structure. On the other hand, we comprehensively interpret the distinctness of the points from multiple resolutions and represent the feature map following an adaptive fusion method at point-level for accurate semantic segmentation. Further, we provide specific ablation studies and intuitive visualizations to validate our key modules. By comparing with state-of-the-art networks on three different benchmarks, we demonstrate the effectiveness of our network.
分类、识别(2篇)
[1]:ACTION-Net: Multipath Excitation for Action Recognition
标题:动作网络:用于动作识别的多路径激励
作者:Zhengwei Wang, Qi She, Aljosa Smolic
备注:To appear in CVPR 2021
链接:https://arxiv.org/abs/2103.07372
摘要:Spatial-temporal, channel-wise, and motion patterns are three complementary and crucial types of information for video action recognition. Conventional 2D CNNs are computationally cheap but cannot catch temporal relationships; 3D CNNs can achieve good performance but are computationally intensive. In this work, we tackle this dilemma by designing a generic and effective module that can be embedded into 2D CNNs. To this end, we propose a spAtio-temporal, Channel and moTion excitatION (ACTION) module consisting of three paths: Spatio-Temporal Excitation (STE) path, Channel Excitation (CE) path, and Motion Excitation (ME) path. The STE path employs one channel 3D convolution to characterize spatio-temporal representation. The CE path adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels in terms of the temporal aspect. The ME path calculates feature-level temporal differences, which is then utilized to excite motion-sensitive channels. We equip 2D CNNs with the proposed ACTION module to form a simple yet effective ACTION-Net with very limited extra computational cost. ACTION-Net is demonstrated by consistently outperforming 2D CNN counterparts on three backbones (i.e., ResNet-50, MobileNet V2 and BNInception) employing three datasets (i.e., Something-Something V2, Jester, and EgoGesture). Codes are available at \url{this https URL}.[2]:Sequential Random Network for Fine-grained Image Classification
标题:序列随机网络在细粒度图像分类中的应用
作者:Chaorong Li, Malu Zhang, Wei Huang, Fengqing Qin, Anping Zeng, Yuanyuan Huang
链接:https://arxiv.org/abs/2103.07230
摘要:Deep Convolutional Neural Network (DCNN) and Transformer have achieved remarkable successes in image recognition. However, their performance in fine-grained image recognition is still difficult to meet the requirements of actual needs. This paper proposes a Sequence Random Network (SRN) to enhance the performance of DCNN. The output of DCNN is one-dimensional features. This one-dimensional feature abstractly represents image information, but it does not express well the detailed information of image. To address this issue, we use the proposed SRN which composed of BiLSTM and several Tanh-Dropout blocks (called BiLSTM-TDN), to further process DCNN one-dimensional features for highlighting the detail information of image. After the feature transform by BiLSTM-TDN, the recognition performance has been greatly improved. We conducted the experiments on six fine-grained image datasets. Except for FGVC-Aircraft, the accuracies of the proposed methods on the other datasets exceeded 99%. Experimental results show that BiLSTM-TDN is far superior to the existing state-of-the-art methods. In addition to DCNN, BiLSTM-TDN can also be extended to other models, such as Transformer.
跟踪(2篇)
[1]:Monocular Quasi-Dense 3D Object Tracking
标题:单目准稠密三维目标跟踪
作者:Hou-Ning Hu, Yung-Hsu Yang, Tobias Fischer, Trevor Darrell, Fisher Yu, Min Sun
备注:24 pages, 14 figures, including appendix (6 pages)
链接:https://arxiv.org/abs/2103.07351
摘要:A reliable and accurate 3D tracking framework is essential for predicting future locations of surrounding objects and planning the observer's actions in numerous applications such as autonomous driving. We propose a framework that can effectively associate moving objects over time and estimate their full 3D bounding box information from a sequence of 2D images captured on a moving platform. The object association leverages quasi-dense similarity learning to identify objects in various poses and viewpoints with appearance cues only. After initial 2D association, we further utilize 3D bounding boxes depth-ordering heuristics for robust instance association and motion-based 3D trajectory prediction for re-identification of occluded vehicles. In the end, an LSTM-based object velocity learning module aggregates the long-term trajectory information for more accurate motion extrapolation. Experiments on our proposed simulation data and real-world benchmarks, including KITTI, nuScenes, and Waymo datasets, show that our tracking framework offers robust object association and tracking on urban-driving scenarios. On the Waymo Open benchmark, we establish the first camera-only baseline in the 3D tracking and 3D detection challenges. Our quasi-dense 3D tracking pipeline achieves impressive improvements on the nuScenes 3D tracking benchmark with near five times tracking accuracy of the best vision-only submission among all published methods. Our code, data and trained models are available atthis https URL.[2]:Siamese Infrared and Visible Light Fusion Network for RGB-T Tracking
标题:用于RGB-t跟踪的暹罗红外和可见光融合网络
作者:Peng Jingchao, Zhao Haitao, Hu Zhengwei, Zhuang Yi, Wang Bofan
链接:https://arxiv.org/abs/2103.07302
摘要:Due to the different photosensitive properties of infrared and visible light, the registered RGB-T image pairs shot in the same scene exhibit quite different characteristics. This paper proposes a siamese infrared and visible light fusion Network (SiamIVFN) for RBG-T image-based tracking. SiamIVFN contains two main subnetworks: a complementary-feature-fusion network (CFFN) and a contribution-aggregation network (CAN). CFFN utilizes a two-stream multilayer convolutional structure whose filters for each layer are partially coupled to fuse the features extracted from infrared images and visible light images. CFFN is a feature-level fusion network, which can cope with the misalignment of the RGB-T image pairs. Through adaptively calculating the contributions of infrared and visible light features obtained from CFFN, CAN makes the tracker robust under various light conditions. Experiments on two RGB-T tracking benchmark datasets demonstrate that the proposed SiamIVFN has achieved state-of-the-art performance. The tracking speed of SiamIVFN is 147.6FPS, the current fastest RGB-T fusion tracker.
人脸(1篇)
[1]:Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association
标题:寻找声音的形状:一个自适应的语音-人脸联想学习框架
作者:Peisong Wen, Qianqian Xu, Yangbangyan Jiang, Zhiyong Yang, Yuan He, Qingming Huang
链接:https://arxiv.org/abs/2103.07293
摘要:Nowadays, we have witnessed the early progress on learning the association between voice and face automatically, which brings a new wave of studies to the computer vision community. However, most of the prior arts along this line (a) merely adopt local information to perform modality alignment and (b) ignore the diversity of learning difficulty across different subjects. In this paper, we propose a novel framework to jointly address the above-mentioned issues. Targeting at (a), we propose a two-level modality alignment loss where both global and local information are considered. Compared with the existing methods, we introduce a global loss into the modality alignment process. The global component of the loss is driven by the identity classification. Theoretically, we show that minimizing the loss could maximize the distance between embeddings across different identities while minimizing the distance between embeddings belonging to the same identity, in a global sense (instead of a mini-batch). Targeting at (b), we propose a dynamic reweighting scheme to better explore the hard but valuable identities while filtering out the unlearnable identities. Experiments show that the proposed method outperforms the previous methods in multiple settings, including voice-face matching, verification and retrieval.
人体姿态估计、位姿估计(3篇)
[1]:Deep Dual Consecutive Network for Human Pose Estimation
标题:用于人体姿态估计的深度双连续网络
作者:Zhenguang Liu, Haoming Chen, Runyang Feng, Shuang Wu, Shouling Ji, Bailin Yang, Xun Wang
备注:8 pages, 4 figures
链接:https://arxiv.org/abs/2103.07254
摘要:Multi-frame human pose estimation in complicated situations is challenging. Although state-of-the-art human joints detectors have demonstrated remarkable results for static images, their performances come short when we apply these models to video sequences. Prevalent shortcomings include the failure to handle motion blur, video defocus, or pose occlusions, arising from the inability in capturing the temporal dependency among video frames. On the other hand, directly employing conventional recurrent neural networks incurs empirical difficulties in modeling spatial contexts, especially for dealing with pose occlusions. In this paper, we propose a novel multi-frame human pose estimation framework, leveraging abundant temporal cues between video frames to facilitate keypoint detection. Three modular components are designed in our framework. A Pose Temporal Merger encodes keypoint spatiotemporal context to generate effective searching scopes while a Pose Residual Fusion module computes weighted pose residuals in dual directions. These are then processed via our Pose Correction Network for efficient refining of pose estimations. Our method ranks No.1 in the Multi-frame Person Pose Estimation Challenge on the large-scale benchmark datasets PoseTrack2017 and PoseTrack2018. We have released our code, hoping to inspire future research.[2]:Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation
标题:神经重投影误差:融合特征学习和摄像机姿态估计
作者:Hugo Germain, Vincent Lepetit, Guillaume Bourmaud
链接:https://arxiv.org/abs/2103.07153
摘要:Absolute camera pose estimation is usually addressed by sequentially solving two distinct subproblems: First a feature matching problem that seeks to establish putative 2D-3D correspondences, and then a Perspective-n-Point problem that minimizes, with respect to the camera pose, the sum of so-called Reprojection Errors (RE). We argue that generating putative 2D-3D correspondences 1) leads to an important loss of information that needs to be compensated as far as possible, within RE, through the choice of a robust loss and the tuning of its hyperparameters and 2) may lead to an RE that conveys erroneous data to the pose estimator. In this paper, we introduce the Neural Reprojection Error (NRE) as a substitute for RE. NRE allows to rethink the camera pose estimation problem by merging it with the feature learning problem, hence leveraging richer information than 2D-3D correspondences and eliminating the need for choosing a robust loss and its hyperparameters. Thus NRE can be used as training loss to learn image descriptors tailored for pose estimation. We also propose a coarse-to-fine optimization method able to very efficiently minimize a sum of NRE terms with respect to the camera pose. We experimentally demonstrate that NRE is a good substitute for RE as it significantly improves both the robustness and the accuracy of the camera pose estimate while being computationally and memory highly efficient. From a broader point of view, we believe this new way of merging deep learning and 3D geometry may be useful in other computer vision applications.[3]:FS-Net: Fast Shape-based Network for Category-Level 6D Object Pose Estimation with Decoupled Rotation Mechanism
标题:FS-Net:一种基于形状的类别级6维解耦旋转物体姿态估计快速网络
作者:Wei Chen, Xi Jia, Hyung Jin Chang, Jinming Duan, Linlin Shen, Ales Leonardis
备注:accepted by CVPR2021, oral
链接:https://arxiv.org/abs/2103.07054
摘要:In this paper, we focus on category-level 6D pose and size estimation from monocular RGB-D image. Previous methods suffer from inefficient category-level pose feature extraction which leads to low accuracy and inference speed. To tackle this problem, we propose a fast shape-based network (FS-Net) with efficient category-level feature extraction for 6D pose estimation. First, we design an orientation aware autoencoder with 3D graph convolution for latent feature extraction. The learned latent feature is insensitive to point shift and object size thanks to the shift and scale-invariance properties of the 3D graph convolution. Then, to efficiently decode category-level rotation information from the latent feature, we propose a novel decoupled rotation mechanism that employs two decoders to complementarily access the rotation information. Meanwhile, we estimate translation and size by two residuals, which are the difference between the mean of object points and ground truth translation, and the difference between the mean size of the category and ground truth size, respectively. Finally, to increase the generalization ability of FS-Net, we propose an online box-cage based 3D deformation mechanism to augment the training data. Extensive experiments on two benchmark datasets show that the proposed method achieves state-of-the-art performance in both category- and instance-level 6D object pose estimation. Especially in category-level pose estimation, without extra synthetic data, our method outperforms existing methods by 6.3% on the NOCS-REAL dataset.
弱/半/无监督(2篇)
[1]:VDSM: Unsupervised Video Disentanglement with State-Space Modeling and Deep Mixtures of Experts
标题:VDSM:基于状态空间建模和专家深度混合的无监督视频解纠缠
作者:Matthew J. Vowels, Necati Cihan Camgoz, Richard Bowden
链接:https://arxiv.org/abs/2103.07292
摘要:Disentangled representations support a range of downstream tasks including causal reasoning, generative modeling, and fair machine learning. Unfortunately, disentanglement has been shown to be impossible without the incorporation of supervision or inductive bias. Given that supervision is often expensive or infeasible to acquire, we choose to incorporate structural inductive bias and present an unsupervised, deep State-Space-Model for Video Disentanglement (VDSM). The model disentangles latent time-varying and dynamic factors via the incorporation of hierarchical structure with a dynamic prior and a Mixture of Experts decoder. VDSM learns separate disentangled representations for the identity of the object or person in the video, and for the action being performed. We evaluate VDSM across a range of qualitative and quantitative tasks including identity and dynamics transfer, sequence generation, Fréchet Inception Distance, and factor classification. VDSM provides state-of-the-art performance and exceeds adversarial methods, even when the methods use additional supervision.[2]:The Semi-Supervised iNaturalist-Aves Challenge at FGVC7 Workshop
标题:在FGVC7研讨会上,半监督不自然主义者面临挑战
作者:Jong-Chyi Su, Subhransu Maji
备注:Tech report for Semi-iNat 2020 challenge, please seethis http URL
链接:https://arxiv.org/abs/2103.06937
摘要:This document describes the details and the motivation behind a new dataset we collected for the semi-supervised recognition challenge~\cite{semi-aves} at the FGVC7 workshop at CVPR 2020. The dataset contains 1000 species of birds sampled from the iNat-2018 dataset for a total of nearly 150k images. From this collection, we sample a subset of classes and their labels, while adding the images from the remaining classes to the unlabeled set of images. The presence of out-of-domain data (novel classes), high class-imbalance, and fine-grained similarity between classes poses significant challenges for existing semi-supervised recognition techniques in the literature. The dataset is available here: \url{this https URL}
强化学习(1篇)
[1]:Large Batch Simulation for Deep Reinforcement Learning
标题:深度强化学习的大批量模拟
作者:Brennan Shacklett, Erik Wijmans, Aleksei Petrenko, Manolis Savva, Dhruv Batra, Vladlen Koltun, Kayvon Fatahalian
备注:Published as a conference paper at ICLR 2021
链接:https://arxiv.org/abs/2103.07013
摘要:We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work, realizing end-to-end training speeds of over 19,000 frames of experience per second on a single GPU and up to 72,000 frames per second on a single eight-GPU machine. The key idea of our approach is to design a 3D renderer and embodied navigation simulator around the principle of "batch simulation": accepting and executing large batches of requests simultaneously. Beyond exposing large amounts of work at once, batch simulation allows implementations to amortize in-memory storage of scene assets, rendering work, data loading, and synchronization costs across many simulation requests, dramatically improving the number of simulated agents per GPU and overall simulation throughput. To balance DNN inference and training costs with faster simulation, we also build a computationally efficient policy DNN that maintains high task performance, and modify training algorithms to maintain sample efficiency when training with large mini-batches. By combining batch simulation and DNN performance optimizations, we demonstrate that PointGoal navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system using a 64-GPU cluster over three days. We provide open-source reference implementations of our batch 3D renderer and simulator to facilitate incorporation of these ideas into RL systems.
Zero/One-Shot、迁移学习、Domain Adaptation(2篇)
[1]:Searching by Generating: Flexible and Efficient One-Shot NAS with Architecture Generator
标题:通过生成进行搜索:使用架构生成器实现灵活高效的一次性NAS
作者:Sian-Yao Huang, Wei-Ta Chu
备注:Accepted to CVPR 2021
链接:https://arxiv.org/abs/2103.07289
摘要:In one-shot NAS, sub-networks need to be searched from the supernet to meet different hardware constraints. However, the search cost is high and $N$ times of searches are needed for $N$ different constraints. In this work, we propose a novel search strategy called architecture generator to search sub-networks by generating them, so that the search process can be much more efficient and flexible. With the trained architecture generator, given target hardware constraints as the input, $N$ good architectures can be generated for $N$ constraints by just one forward pass without re-searching and supernet retraining. Moreover, we propose a novel single-path supernet, called unified supernet, to further improve search efficiency and reduce GPU memory consumption of the architecture generator. With the architecture generator and the unified supernet, we propose a flexible and efficient one-shot NAS framework, called Searching by Generating NAS (SGNAS). With the pre-trained supernt, the search time of SGNAS for $N$ different hardware constraints is only 5 GPU hours, which is $4N$ times faster than previous SOTA single-path methods. After training from scratch, the top1-accuracy of SGNAS on ImageNet is 77.1%, which is comparable with the SOTAs. The code is available at:this https URL.[2]:In the light of feature distributions: moment matching for Neural Style Transfer
标题:基于特征分布的神经风格转换矩匹配
作者:Nikolai Kalischek, Jan Dirk Wegner, Konrad Schindler
链接:https://arxiv.org/abs/2103.07208
摘要:Style transfer aims to render the content of a given image in the graphical/artistic style of another image. The fundamental concept underlying NeuralStyle Transfer (NST) is to interpret style as a distribution in the feature space of a Convolutional Neural Network, such that a desired style can be achieved by matching its feature distribution. We show that most current implementations of that concept have important theoretical and practical limitations, as they only partially align the feature distributions. We propose a novel approach that matches the distributions more precisely, thus reproducing the desired style more faithfully, while still being computationally efficient. Specifically, we adapt the dual form of Central Moment Discrepancy (CMD), as recently proposed for domain adaptation, to minimize the difference between the target style and the feature distribution of the output image. The dual interpretation of this metric explicitly matches all higher-order centralized moments and is therefore a natural extension of existing NST methods that only take into account the first and second moments. Our experiments confirm that the strong theoretical properties also translate to visually better style transfer, and better disentangle style from semantic image content.
时序动作检测、视频相关(2篇)
[1]:PatchNet -- Short-range Template Matching for Efficient Video Processing
标题:PatchNet——用于高效视频处理的短距离模板匹配
作者:Huizi Mao, Sibo Zhu, Song Han, William J. Dally
链接:https://arxiv.org/abs/2103.07371
摘要:Object recognition is a fundamental problem in many video processing tasks, accurately locating seen objects at low computation cost paves the way for on-device video recognition. We propose PatchNet, an efficient convolutional neural network to match objects in adjacent video frames. It learns the patchwise correlation features instead of pixel features. PatchNet is very compact, running at just 58MFLOPs, $5\times$ simpler than MobileNetV2. We demonstrate its application on two tasks, video object detection and visual object tracking. On ImageNet VID, PatchNet reduces the flops of R-FCN ResNet-101 by 5x and EfficientDet-D0 by 3.4x with less than 1% mAP loss. On OTB2015, PatchNet reduces SiamFC and SiamRPN by 2.5x with no accuracy loss. Experiments on Jetson Nano further demonstrate 2.8x to 4.3x speed-ups associated with flops reduction. Code is open sourced atthis https URL.[2]:Learning Long-Term Style-Preserving Blind Video Temporal Consistency
标题:保持长时风格的盲视频时间一致性学习
作者:Hugo Thimonier, Julien Despois, Robin Kips, Matthieu Perrot
链接:https://arxiv.org/abs/2103.07278
摘要:When trying to independently apply image-trained algorithms to successive frames in videos, noxious flickering tends to appear. State-of-the-art post-processing techniques that aim at fostering temporal consistency, generate other temporal artifacts and visually alter the style of videos. We propose a postprocessing model, agnostic to the transformation applied to videos (e.g. style transfer, image manipulation using GANs, etc.), in the form of a recurrent neural network. Our model is trained using a Ping Pong procedure and its corresponding loss, recently introduced for GAN video generation, as well as a novel style preserving perceptual loss. The former improves long-term temporal consistency learning, while the latter fosters style preservation. We evaluate our model on the DAVIS andthis http URLdatasets and show that our approach offers state-of-the-art results concerning flicker removal, and better keeps the overall style of the videos than previous approaches.
Networks(5篇)
[1]:PLADE-Net: Towards Pixel-Level Accuracy for Self-Supervised Single-View Depth Estimation with Neural Positional Encoding and Distilled Matting Loss
标题:PLADE网:基于神经位置编码和提取消光损失的自监督单视点深度估计的像素级精度
作者:Juan Luis Gonzalez Bello, Munchurl Kim
备注:Accepted paper (poster) at CVPR2021
链接:https://arxiv.org/abs/2103.07362
摘要:In this paper, we propose a self-supervised single-view pixel-level accurate depth estimation network, called PLADE-Net. The PLADE-Net is the first work that shows unprecedented accuracy levels, exceeding 95\% in terms of the $\delta^1$ metric on the challenging KITTI dataset. Our PLADE-Net is based on a new network architecture with neural positional encoding and a novel loss function that borrows from the closed-form solution of the matting Laplacian to learn pixel-level accurate depth estimation from stereo images. Neural positional encoding allows our PLADE-Net to obtain more consistent depth estimates by letting the network reason about location-specific image properties such as lens and projection distortions. Our novel distilled matting Laplacian loss allows our network to predict sharp depths at object boundaries and more consistent depths in highly homogeneous regions. Our proposed method outperforms all previous self-supervised single-view depth estimation methods by a large margin on the challenging KITTI dataset, with unprecedented levels of accuracy. Furthermore, our PLADE-Net, naively extended for stereo inputs, outperforms the most recent self-supervised stereo methods, even without any advanced blocks like 1D correlations, 3D convolutions, or spatial pyramid pooling. We present extensive ablation studies and experiments that support our method's effectiveness on the KITTI, CityScapes, and Make3D datasets.[2]:Learnable Companding Quantization for Accurate Low-bit Neural Networks
标题:精确低位神经网络的可学习压缩量化
作者:Kohei Yamamoto
备注:Accepted at CVPR 2021
链接:https://arxiv.org/abs/2103.07156
摘要:Quantizing deep neural networks is an effective method for reducing memory consumption and improving inference speed, and is thus useful for implementation in resource-constrained devices. However, it is still hard for extremely low-bit models to achieve accuracy comparable with that of full-precision models. To address this issue, we propose learnable companding quantization (LCQ) as a novel non-uniform quantization method for 2-, 3-, and 4-bit models. LCQ jointly optimizes model weights and learnable companding functions that can flexibly and non-uniformly control the quantization levels of weights and activations. We also present a new weight normalization technique that allows more stable training for quantization. Experimental results show that LCQ outperforms conventional state-of-the-art methods and narrows the gap between quantized and full-precision models for image classification and object detection tasks. Notably, the 2-bit ResNet-50 model on ImageNet achieves top-1 accuracy of 75.1% and reduces the gap to 1.7%, allowing LCQ to further exploit the potential of non-uniform quantization.[3]:Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks
标题:基于渐进变换和混合密度网络的连续三维多通道手语生成
作者:Ben Saunders, Necati Cihan Camgoz, Richard Bowden
链接:https://arxiv.org/abs/2103.06982
摘要:Sign languages are multi-channel visual languages, where signers use a continuous 3D space to communicate.Sign Language Production (SLP), the automatic translation from spoken to sign languages, must embody both the continuous articulation and full morphology of sign to be truly understandable by the Deaf community. Previous deep learning-based SLP works have produced only a concatenation of isolated signs focusing primarily on the manual features, leading to a robotic and non-expressive production.
In this work, we propose a novel Progressive Transformer architecture, the first SLP model to translate from spoken language sentences to continuous 3D multi-channel sign pose sequences in an end-to-end manner. Our transformer network architecture introduces a counter decoding that enables variable length continuous sequence generation by tracking the production progress over time and predicting the end of sequence. We present extensive data augmentation techniques to reduce prediction drift, alongside an adversarial training regime and a Mixture Density Network (MDN) formulation to produce realistic and expressive sign pose sequences.
We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging PHOENIX14T dataset and setting baselines for future research. We further provide a user evaluation of our SLP model, to understand the Deaf reception of our sign pose productions.[4]:CORSAIR: Convolutional Object Retrieval and Symmetry-AIded Registration
标题:CORSAIR:卷积目标检索和对称辅助配准
作者:Tianyu Zhao, Qiaojun Feng, Sai Jadhav, Nikolay Atanasov
备注:8 pages, 8 figures
链接:https://arxiv.org/abs/2103.06911
摘要:This paper considers online object-level mapping using partial point-cloud observations obtained online in an unknown environment. We develop and approach for fully Convolutional Object Retrieval and Symmetry-AIded Registration (CORSAIR). Our model extends the Fully Convolutional Geometric Features model to learn a global object-shape embedding in addition to local point-wise features from the point-cloud observations. The global feature is used to retrieve a similar object from a category database, and the local features are used for robust pose registration between the observed and the retrieved object. Our formulation also leverages symmetries, present in the object shapes, to obtain promising local-feature pairs from different symmetry classes for matching. We present results from synthetic and real-world datasets with different object categories to verify the robustness of our method.[5]:Interleaving Learning, with Application to Neural Architecture Search
标题:交错学习在神经结构搜索中的应用
作者:Hao Ban, Pengtao Xie
备注:arXiv admin note: substantial text overlap witharXiv:2012.04863
链接:https://arxiv.org/abs/2103.07018
摘要:Interleaving learning is a human learning technique where a learner interleaves the studies of multiple topics, which increases long-term retention and improves ability to transfer learned knowledge. Inspired by the interleaving learning technique of humans, in this paper we explore whether this learning methodology is beneficial for improving the performance of machine learning models as well. We propose a novel machine learning framework referred to as interleaving learning (IL). In our framework, a set of models collaboratively learn a data encoder in an interleaving fashion: the encoder is trained by model 1 for a while, then passed to model 2 for further training, then model 3, and so on; after trained by all models, the encoder returns back to model 1 and is trained again, then moving to model 2, 3, etc. This process repeats for multiple rounds. Our framework is based on multi-level optimization consisting of multiple inter-connected learning stages. An efficient gradient-based algorithm is developed to solve the multi-level optimization problem. We apply interleaving learning to search neural architectures for image classification on CIFAR-10, CIFAR-100, and ImageNet. The effectiveness of our method is strongly demonstrated by the experimental results.
GAN、图像文本生成(3篇)
[1]:Advanced Multiple Linear Regression Based Dark Channel Prior Applied on Dehazing Image and Generating Synthetic Haze
标题:基于改进多元线性回归的暗通道预处理方法在图像去杂和合成雾度生成中的应用
作者:Binghan Li, Yindong Hua, Mi Lu
链接:https://arxiv.org/abs/2103.07065
摘要:Haze removal is an extremely challenging task, and object detection in the hazy environment has recently gained much attention due to the popularity of autonomous driving and traffic surveillance. In this work, the authors propose a multiple linear regression haze removal model based on a widely adopted dehazing algorithm named Dark Channel Prior. Training this model with a synthetic hazy dataset, the proposed model can reduce the unanticipated deviations generated from the rough estimations of transmission map and atmospheric light in Dark Channel Prior. To increase object detection accuracy in the hazy environment, the authors further present an algorithm to build a synthetic hazy COCO training dataset by generating the artificial haze to the MS COCO training dataset. The experimental results demonstrate that the proposed model obtains higher image quality and shares more similarity with ground truth images than most conventional pixel-based dehazing algorithms and neural network based haze-removal models. The authors also evaluate the mean average precision of Mask R-CNN when training the network with synthetic hazy COCO training dataset and preprocessing test hazy dataset by removing the haze with the proposed dehazing model. It turns out that both approaches can increase the object detection accuracy significantly and outperform most existing object detection models over hazy images.[2]:HumanGAN: A Generative Model of Humans Images
标题:人形:人类形象的生成模式
作者:Kripasindhu Sarkar, Lingjie Liu, Vladislav Golyanik, Christian Theobalt
备注:Accepted at CVPR21
链接:https://arxiv.org/abs/2103.06902
摘要:Generative adversarial networks achieve great performance in photorealistic image synthesis in various domains, including human images. However, they usually employ latent vectors that encode the sampled outputs globally. This does not allow convenient control of semantically-relevant individual parts of the image, and is not able to draw samples that only differ in partial aspects, such as clothing style. We address these limitations and present a generative model for images of dressed humans offering control over pose, local body part appearance and garment style. This is the first method to solve various aspects of human image generation such as global appearance sampling, pose transfer, parts and garment transfer, and parts sampling jointly in a unified framework. As our model encodes part-based latent appearance vectors in a normalized pose-independent space and warps them to different poses, it preserves body and clothing appearance under varying posture. Experiments show that our flexible and general generative method outperforms task-specific baselines for pose-conditioned image generation, pose transfer and part sampling in terms of realism and output resolution.[3]:Game-theoretic Understanding of Adversarially Learned Features
标题:对手学习特征的博弈论理解
作者:Jie Ren, Die Zhang, Yisen Wang, Lu Chen, Zhanpeng Zhou, Xu Cheng, Xin Wang, Yiting Chen, Jie Shi, Quanshi Zhang
链接:https://arxiv.org/abs/2103.07364
摘要:This paper aims to understand adversarial attacks and defense from a new perspecitve, i.e., the signal-processing behavior of DNNs. We novelly define the multi-order interaction in game theory, which satisfies six properties. With the multi-order interaction, we discover that adversarial attacks mainly affect high-order interactions to fool the DNN. Furthermore, we find that the robustness of adversarially trained DNNs comes from category-specific low-order interactions. Our findings provide more insights into and make a revision of previous understanding for the shape bias of adversarially learned features. Besides, the multi-order interaction can also explain the recoverability of adversarial examples.
点云、三维重建(2篇)
[1]:Urban Surface Reconstruction in SAR Tomography by Graph-Cuts
标题:基于图像切割的SAR层析成像中的城市地表重建
作者:Clément Rambour, Loïc Denis, Florence Tupin, Hélène Oriot, Yue Huang, Laurent Ferro-Famil
链接:https://arxiv.org/abs/2103.07202
摘要:SAR (Synthetic Aperture Radar) tomography reconstructs 3-D volumes from stacks of SAR images. High-resolution satellites such as TerraSAR-X provide images that can be combined to produce 3-D models. In urban areas, sparsity priors are generally enforced during the tomographic inversion process in order to retrieve the location of scatterers seen within a given radar resolution cell. However, such priors often miss parts of the urban surfaces. Those missing parts are typically regions of flat areas such as ground or rooftops. This paper introduces a surface segmentation algorithm based on the computation of the optimal cut in a flow network. This segmentation process can be included within the 3-D reconstruction framework in order to improve the recovery of urban surfaces. Illustrations on a TerraSAR-X tomographic dataset demonstrate the potential of the approach to produce a 3-D model of urban surfaces such as ground, façades and rooftops.[2]:An Efficient Hypergraph Approach to Robust Point Cloud Resampling
标题:一种高效的鲁棒点云重采样超图方法
作者:Qinwen Deng, Songyang Zhang, Zhi Ding
链接:https://arxiv.org/abs/2103.06999
摘要:Efficient processing and feature extraction of largescale point clouds are important in related computer vision and cyber-physical systems. This work investigates point cloud resampling based on hypergraph signal processing (HGSP) to better explore the underlying relationship among different cloud points and to extract contour-enhanced features. Specifically, we design hypergraph spectral filters to capture multi-lateral interactions among the signal nodes of point clouds and to better preserve their surface outlines. Without the need and the computation to first construct the underlying hypergraph, our low complexity approach directly estimates hypergraph spectrum of point clouds by leveraging hypergraph stationary processes from the observed 3D coordinates. Evaluating the proposed resampling methods with several metrics, our test results validate the high efficacy of hypergraph characterization of point clouds and demonstrate the robustness of hypergraph-based resampling under noisy observations.
自动驾驶、SLAM、双目视觉、立体视觉(1篇)
[1]:PVStereo: Pyramid Voting Module for End-to-End Self-Supervised Stereo Matching
标题:PVStereo:用于端到端自监督立体匹配的金字塔投票模块
作者:Hengli Wang, Rui Fan, Peide Cai, Ming Liu
备注:8 pages, 8 figures and 2 tables. This paper is accepted by IEEE RA-L with ICRA 2021
链接:https://arxiv.org/abs/2103.07094
摘要:Supervised learning with deep convolutional neural networks (DCNNs) has seen huge adoption in stereo matching. However, the acquisition of large-scale datasets with well-labeled ground truth is cumbersome and labor-intensive, making supervised learning-based approaches often hard to implement in practice. To overcome this drawback, we propose a robust and effective self-supervised stereo matching approach, consisting of a pyramid voting module (PVM) and a novel DCNN architecture, referred to as OptStereo. Specifically, our OptStereo first builds multi-scale cost volumes, and then adopts a recurrent unit to iteratively update disparity estimations at high resolution; while our PVM can generate reliable semi-dense disparity images, which can be employed to supervise OptStereo training. Furthermore, we publish the HKUST-Drive dataset, a large-scale synthetic stereo dataset, collected under different illumination and weather conditions for research purposes. Extensive experimental results demonstrate the effectiveness and efficiency of our self-supervised stereo matching approach on the KITTI Stereo benchmarks and our HKUST-Drive dataset. PVStereo, our best-performing implementation, greatly outperforms all other state-of-the-art self-supervised stereo matching approaches. Our project page is available atthis http URL.
医学生物应用(2篇)
[1]:Radiomic Deformation and Textural Heterogeneity (R-DepTH) Descriptor to characterize Tumor Field Effect: Application to Survival Prediction in Glioblastoma
标题:放射变形和结构异质性(R-DepTH)描述符在胶质母细胞瘤生存预测中的应用
作者:Marwa Ismail, Prateek Prasanna, Kaustav Bera, Volodymyr Statsevych, Virginia Hill, Gagandeep Singh, Sasan Partovi, Niha Beig, Sean McGarry, Peter Laviolette, Manmeet Ahluwalia, Anant Madabhushi, Pallavi Tiwari
链接:https://arxiv.org/abs/2103.07423
摘要:The concept of tumor field effect implies that cancer is a systemic disease with its impact way beyond the visible tumor confines. For instance, in Glioblastoma (GBM), an aggressive brain tumor, the increase in intracranial pressure due to tumor burden often leads to brain herniation and poor outcomes. Our work is based on the rationale that highly aggressive tumors tend to grow uncontrollably, leading to pronounced biomechanical tissue deformations in the normal parenchyma, which when combined with local morphological differences in the tumor confines on MRI scans, will comprehensively capture tumor field effect. Specifically, we present an integrated MRI-based descriptor, radiomic-Deformation and Textural Heterogeneity (r-DepTH). This descriptor comprises measurements of the subtle perturbations in tissue deformations throughout the surrounding normal parenchyma due to mass effect. This involves non-rigidly aligning the patients MRI scans to a healthy atlas via diffeomorphic registration. The resulting inverse mapping is used to obtain the deformation field magnitudes in the normal parenchyma. These measurements are then combined with a 3D texture descriptor, Co-occurrence of Local Anisotropic Gradient Orientations (COLLAGE), which captures the morphological heterogeneity within the tumor confines, on MRI scans. R-DepTH, on N = 207 GBM cases (training set (St) = 128, testing set (Sv) = 79), demonstrated improved prognosis of overall survival by categorizing patients into low- (prolonged survival) and high-risk (poor survival) groups (on St, p-value = 0.0000035, and on Sv, p-value = 0.0024). R-DepTH descriptor may serve as a comprehensive MRI-based prognostic marker of disease aggressiveness and survival in solid tumors.[2]:Longitudinal Quantitative Assessment of COVID-19 Infection Progression from Chest CTs
标题:胸部ct对COVID-19感染进展的纵向定量评估
作者:Seong Tae Kim, Leili Goli, Magdalini Paschali, Ashkan Khakzar, Matthias Keicher, Tobias Czempiel, Egon Burian, Rickmer Braren, Nassir Navab, Thomas Wendler
备注:10 pages, 2 figures
链接:https://arxiv.org/abs/2103.07240
摘要:Chest computed tomography (CT) has played an essential diagnostic role in assessing patients with COVID-19 by showing disease-specific image features such as ground-glass opacity and consolidation. Image segmentation methods have proven to help quantify the disease burden and even help predict the outcome. The availability of longitudinal CT series may also result in an efficient and effective method to reliably assess the progression of COVID-19, monitor the healing process and the response to different therapeutic strategies. In this paper, we propose a new framework to identify infection at a voxel level (identification of healthy lung, consolidation, and ground-glass opacity) and visualize the progression of COVID-19 using sequential low-dose non-contrast CT scans. In particular, we devise a longitudinal segmentation network that utilizes the reference scan information to improve the performance of disease identification. Experimental results on a clinical longitudinal dataset collected in our institution show the effectiveness of the proposed method compared to the static deep neural networks for disease quantification.
其他(16篇)
[1]:Real-time Nonrigid Mosaicking of Laparoscopy Images
标题:腹腔镜图像的实时非刚性拼接
作者:Haoyin Zhou, Jagadeesan Jayender
备注:published on IEEE Transactions on Medical Imaging
链接:https://arxiv.org/abs/2103.07414
摘要:The ability to extend the field of view of laparoscopy images can help the surgeons to obtain a better understanding of the anatomical context. However, due to tissue deformation, complex camera motion and significant three-dimensional (3D) anatomical surface, image pixels may have non-rigid deformation and traditional mosaicking methods cannot work robustly for laparoscopy images in real-time. To solve this problem, a novel two-dimensional (2D) non-rigid simultaneous localization and mapping (SLAM) system is proposed in this paper, which is able to compensate for the deformation of pixels and perform image mosaicking in real-time. The key algorithm of this 2D non-rigid SLAM system is the expectation maximization and dual quaternion (EMDQ) algorithm, which can generate smooth and dense deformation field from sparse and noisy image feature matches in real-time. An uncertainty-based loop closing method has been proposed to reduce the accumulative errors. To achieve real-time performance, both CPU and GPU parallel computation technologies are used for dense mosaicking of all pixels. Experimental results on \textit{in vivo} and synthetic data demonstrate the feasibility and accuracy of our non-rigid mosaicking method.[2]:Information Maximization Clustering via Multi-View Self-Labelling
标题:基于多视图自标记的信息最大化聚类
作者:Foivos Ntelemis, Yaochu Jin, Spencer A. Thomas
链接:https://arxiv.org/abs/2103.07368
摘要:Image clustering is a particularly challenging computer vision task, which aims to generate annotations without human supervision. Recent advances focus on the use of self-supervised learning strategies in image clustering, by first learning valuable semantics and then clustering the image representations. These multiple-phase algorithms, however, increase the computational time and their final performance is reliant on the first stage. By extending the self-supervised approach, we propose a novel single-phase clustering method that simultaneously learns meaningful representations and assigns the corresponding annotations. This is achieved by integrating a discrete representation into the self-supervised paradigm through a classifier net. Specifically, the proposed clustering objective employs mutual information, and maximizes the dependency between the integrated discrete representation and a discrete probability distribution. The discrete probability distribution is derived though the self-supervised process by comparing the learnt latent representation with a set of trainable prototypes. To enhance the learning performance of the classifier, we jointly apply the mutual information across multi-crop views. Our empirical results show that the proposed framework outperforms state-of-the-art techniques with the average accuracy of 89.1% and 49.0%, respectively, on CIFAR-10 and CIFAR-100/20 datasets. Finally, the proposed method also demonstrates attractive robustness to parameter settings, making it ready to be applicable to other datasets.[3]:UIEC^2-Net: CNN-based Underwater Image Enhancement Using Two Color Space
标题:UIEC^2-Net:基于CNN的双色空间水下图像增强
作者:Yudong Wang, Jichang Guo, Huan Gao, Huihui Yue
备注:11 pages, 11 figures
链接:https://arxiv.org/abs/2103.07138
摘要:Underwater image enhancement has attracted much attention due to the rise of marine resource development in recent years. Benefit from the powerful representation capabilities of Convolution Neural Networks(CNNs), multiple underwater image enhancement algorithms based on CNNs have been proposed in the last few years. However, almost all of these algorithms employ RGB color space setting, which is insensitive to image properties such as luminance and saturation. To address this problem, we proposed Underwater Image Enhancement Convolution Neural Network using 2 Color Space (UICE^2-Net) that efficiently and effectively integrate both RGB Color Space and HSV Color Space in one single CNN. To our best knowledge, this method is the first to use HSV color space for underwater image enhancement based on deep learning. UIEC^2-Net is an end-to-end trainable network, consisting of three blocks as follow: a RGB pixel-level block implements fundamental operations such as denoising and removing color cast, a HSV global-adjust block for globally adjusting underwater image luminance, color and saturation by adopting a novel neural curve layer, and an attention map block for combining the advantages of RGB and HSV block output images by distributing weight to each pixel. Experimental results on synthetic and real-world underwater images show the good performance of our proposed method in both subjective comparisons and objective metrics.[4]:iToF2dToF: A Robust and Flexible Representation for Data-Driven Time-of-Flight Imaging
标题:iToF2dToF:一种健壮灵活的数据驱动飞行时间成像表示方法
作者:Felipe Gutierrez-Barragan, Huaijin Chen, Mohit Gupta, Andreas Velten, Jinwei Gu
备注:32 pages
链接:https://arxiv.org/abs/2103.07087
摘要:Indirect Time-of-Flight (iToF) cameras are a promising depth sensing technology. However, they are prone to errors caused by multi-path interference (MPI) and low signal-to-noise ratio (SNR). Traditional methods, after denoising, mitigate MPI by estimating a transient image that encodes depths. Recently, data-driven methods that jointly denoise and mitigate MPI have become state-of-the-art without using the intermediate transient representation. In this paper, we propose to revisit the transient representation. Using data-driven priors, we interpolate/extrapolate iToF frequencies and use them to estimate the transient image. Given direct ToF (dToF) sensors capture transient images, we name our method iToF2dToF. The transient representation is flexible. It can be integrated with different rule-based depth sensing algorithms that are robust to low SNR and can deal with ambiguous scenarios that arise in practice (e.g., specular MPI, optical cross-talk). We demonstrate the benefits of iToF2dToF over previous methods in real depth sensing scenarios.[5]:Dual Attention-in-Attention Model for Joint Rain Streak and Raindrop Removal
标题:联合雨纹和雨滴去除注意模型中的双重注意
作者:Kaihao Zhang, Dongxu Li, Wenhan Luo, Wenqi Ren, Lin Ma, Hongdong Li
链接:https://arxiv.org/abs/2103.07051
摘要:Rain streaks and rain drops are two natural phenomena, which degrade image capture in different ways. Currently, most existing deep deraining networks take them as two distinct problems and individually address one, and thus cannot deal adequately with both simultaneously. To address this, we propose a Dual Attention-in-Attention Model (DAiAM) which includes two DAMs for removing both rain streaks and raindrops. Inside the DAM, there are two attentive maps - each of which attends to the heavy and light rainy regions, respectively, to guide the deraining process differently for applicable regions. In addition, to further refine the result, a Differential-driven Dual Attention-in-Attention Model (D-DAiAM) is proposed with a "heavy-to-light" scheme to remove rain via addressing the unsatisfying deraining regions. Extensive experiments on one public raindrop dataset, one public rain streak and our synthesized joint rain streak and raindrop (JRSRD) dataset have demonstrated that the proposed method not only is capable of removing rain streaks and raindrops simultaneously, but also achieves the state-of-the-art performance on both tasks.[6]:The Location of Optimal Object Colors with More Than Two Transitions
标题:具有两个以上变换的最佳对象颜色的位置
作者:Scott A. Burns
链接:https://arxiv.org/abs/2103.06997
摘要:The chromaticity diagram associated with the CIE 1931 color matching functions is shown to be slightly non-convex. While having no impact on practical colorimetric computations, the non-convexity does have a significant impact on the shape of some optimal object color reflectance distributions associated with the outer surface of the object color solid. Instead of the usual two-transition Schrödinger form, many optimal colors exhibit higher transition counts. A linear programming formulation is developed and is used to locate where these higher-transition optimal object colors reside on the object color solid surface.[7]:Efficient Pairwise Neuroimage Analysis using the Soft Jaccard Index and 3D Keypoint Sets
标题:基于软Jaccard索引和3D关键点集的高效成对神经图像分析
作者:Laurent Chauvin, Kuldeep Kumar, Christian Desrosiers, William Wells III, Matthew Toews
备注:10 pages, 7 figures
链接:https://arxiv.org/abs/2103.06966
摘要:We propose a novel pairwise distance measure between variable sized sets of image keypoints for the purpose of large-scale medical image indexing. Our measure generalizes the Jaccard distance to account for soft set equivalence (SSE) between set elements, via an adaptive kernel framework accounting for uncertainty in keypoint appearance and geometry. Novel kernels are proposed to quantify variability of keypoint geometry in location and scale. Our distance measure may be estimated between $N^2$ image pairs in $O(N~log~N)$ operations via keypoint indexing. Experiments validate our method in predicting 509,545 pairwise relationships from T1-weighted MRI brain volumes of monozygotic and dizygotic twins, siblings and half-siblings sharing 100%-25% of their polymorphic genes. Soft set equivalence and keypoint geometry kernels outperform standard hard set equivalence (HSE) in predicting family relationships. High accuracy is achieved, with monozygotic twin identification near 100% and several cases of unknown family labels, due to errors in the genotyping process, are correctly paired with family members. Software is provided for efficient fine-grained curation of large, generic image datasets.[8]:DefakeHop: A Light-Weight High-Performance Deepfake Detector
标题:DefakeHop:一种重量轻、性能高的深度假探测器
作者:Hong-Shuo Chen, Mozhdeh Rouhsedaghat, Hamza Ghani, Shuowen Hu, Suya You, C.-C. Jay Kuo
备注:Accepted at ICME 2021
链接:https://arxiv.org/abs/2103.06929
摘要:A light-weight high-performance Deepfake detection method, called DefakeHop, is proposed in this work. State-of-the-art Deepfake detection methods are built upon deep neural networks. DefakeHop extracts features automatically using the successive subspace learning (SSL) principle from various parts of face images. The features are extracted by c/w Saab transform and further processed by our feature distillation module using spatial dimension reduction and soft classification for each channel to get a more concise description of the face. Extensive experiments are conducted to demonstrate the effectiveness of the proposed DefakeHop method. With a small model size of 42,845 parameters, DefakeHop achieves state-of-the-art performance with the area under the ROC curve (AUC) of 100%, 94.95%, and 90.56% on UADFV, Celeb-DF v1 and Celeb-DF v2 datasets, respectively.[9]:Multiview Sensing With Unknown Permutations: An Optimal Transport Approach
标题:具有未知排列的多视点传感:一种最优传输方法
作者:Yanting Ma, Petros T. Boufounos, Hassan Mansour, Shuchin Aeron
备注:5 pages, 3 figures, ICASSP 2021
链接:https://arxiv.org/abs/2103.07458
摘要:In several applications, including imaging of deformable objects while in motion, simultaneous localization and mapping, and unlabeled sensing, we encounter the problem of recovering a signal that is measured subject to unknown permutations. In this paper we take a fresh look at this problem through the lens of optimal transport (OT). In particular, we recognize that in most practical applications the unknown permutations are not arbitrary but some are more likely to occur than others. We exploit this by introducing a regularization function that promotes the more likely permutations in the solution. We show that, even though the general problem is not convex, an appropriate relaxation of the resulting regularized problem allows us to exploit the well-developed machinery of OT and develop a tractable algorithm.[10]:Self-Feature Regularization: Self-Feature Distillation Without Teacher Models
标题:自特征正则化:无需教师模型的自特征提取
作者:Wenxuan Fan, Zhenyan Hou
链接:https://arxiv.org/abs/2103.07350
摘要:Knowledge distillation is the process of transferring the knowledge from a large model to a small model. In this process, the small model learns the generalization ability of the large model and retains the performance close to that of the large model. Knowledge distillation provides a training means to migrate the knowledge of models, facilitating model deployment and speeding up inference. However, previous distillation methods require pre-trained teacher models, which still bring computational and storage overheads. In this paper, a novel general training framework called Self-Feature Regularization~(SFR) is proposed, which uses features in the deep layers to supervise feature learning in the shallow layers, retains more semantic information. Specifically, we firstly use EMD-l2 loss to match local features and a many-to-one approach to distill features more intensively in the channel dimension. Then dynamic label smoothing is used in the output layer to achieve better performance. Experiments further show the effectiveness of our proposed framework.[11]:Patient-specific virtual spine straightening and vertebra inpainting: An automatic framework for osteoplasty planning
标题:针对患者的虚拟脊柱矫直和椎体修复:骨成形规划的自动框架
作者:Christina Bukas, Bailiang Jian, Luis F. Rodriguez Venegas, Francesca De Benetti, Sebastian Ruehling, Anjany Sekubojina, Jens Gempt, Jan S. Kirschke, Marie Piraud, Johannes Oberreuter, Nassir Navab, Thomas Wendler
备注:7 pages, 5 figures, 5 tables
链接:https://arxiv.org/abs/2103.07279
摘要:Symptomatic spinal vertebral compression fractures (VCFs) often require osteoplasty treatment. A cement-like material is injected into the bone to stabilize the fracture, restore the vertebral body height and alleviate pain. Leakage is a common complication and may occur due to too much cement being injected. In this work, we propose an automated patient-specific framework that can allow physicians to calculate an upper bound of cement for the injection and estimate the optimal outcome of osteoplasty. The framework uses the patient CT scan and the fractured vertebra label to build a virtual healthy spine using a high-level approach. Firstly, the fractured spine is segmented with a three-step Convolution Neural Network (CNN) architecture. Next, a per-vertebra rigid registration to a healthy spine atlas restores its curvature. Finally, a GAN-based inpainting approach replaces the fractured vertebra with an estimation of its original shape. Based on this outcome, we then estimate the maximum amount of bone cement for injection. We evaluate our framework by comparing the virtual vertebrae volumes of ten patients to their healthy equivalent and report an average error of 3.88$\pm$7.63\%. The presented pipeline offers a first approach to a personalized automatic high-level framework for planning osteoplasty procedures.[12]:Robust and generalizable embryo selection based on artificial intelligence and time-lapse image sequences
标题:基于人工智能和延时图像序列的鲁棒广义胚胎选择
作者:Jørgen Berntsen, Jens Rimestad, Jacob Theilgaard Lassen, Dang Tran, Mikkel Fly Kragh
链接:https://arxiv.org/abs/2103.07262
摘要:Assessing and selecting the most viable embryos for transfer is an essential part of in vitro fertilization (IVF). In recent years, several approaches have been made to improve and automate the procedure using artificial intelligence (AI) and deep learning. Based on images of embryos with known implantation data (KID), AI models have been trained to automatically score embryos related to their chance of achieving a successful implantation. However, as of now, only limited research has been conducted to evaluate how embryo selection models generalize to new clinics and how they perform in subgroup analyses across various conditions. In this paper, we investigate how a deep learning-based embryo selection model using only time-lapse image sequences performs across different patient ages and clinical conditions, and how it correlates with traditional morphokinetic parameters. The model was trained and evaluated based on a large dataset from 18 IVF centers consisting of 115,832 embryos, of which 14,644 embryos were transferred KID embryos. In an independent test set, the AI model sorted KID embryos with an area under the curve (AUC) of a receiver operating characteristic curve of 0.67 and all embryos with an AUC of 0.95. A clinic hold-out test showed that the model generalized to new clinics with an AUC range of 0.60-0.75 for KID embryos. Across different subgroups of age, insemination method, incubation time, and transfer protocol, the AUC ranged between 0.63 and 0.69. Furthermore, model predictions correlated positively with blastocyst grading and negatively with direct cleavages. The fully automated iDAScore v1.0 model was shown to perform at least as good as a state-of-the-art manual embryo selection model. Moreover, full automatization of embryo scoring implies fewer manual evaluations and eliminates biases due to inter- and intraobserver variation.[13]:Deep Gaussian Scale Mixture Prior for Spectral Compressive Imaging
标题:用于光谱压缩成像的深高斯尺度混合先验算法
作者:Tao Huang, Weisheng Dong, Xin Yuan, Jinjian Wu, Guangming Shi
备注:10 pages, 8 figures, CVPR 2021
链接:https://arxiv.org/abs/2103.07152
摘要:In coded aperture snapshot spectral imaging (CASSI) system, the real-world hyperspectral image (HSI) can be reconstructed from the captured compressive image in a snapshot. Model-based HSI reconstruction methods employed hand-crafted priors to solve the reconstruction problem, but most of which achieved limited success due to the poor representation capability of these hand-crafted priors. Deep learning based methods learning the mappings between the compressive images and the HSIs directly achieved much better results. Yet, it is nontrivial to design a powerful deep network heuristically for achieving satisfied results. In this paper, we propose a novel HSI reconstruction method based on the Maximum a Posterior (MAP) estimation framework using learned Gaussian Scale Mixture (GSM) prior. Different from existing GSM models using hand-crafted scale priors (e.g., the Jeffrey's prior), we propose to learn the scale prior through a deep convolutional neural network (DCNN). Furthermore, we also propose to estimate the local means of the GSM models by the DCNN. All the parameters of the MAP estimation algorithm and the DCNN parameters are jointly optimized through end-to-end training. Extensive experimental results on both synthetic and real datasets demonstrate that the proposed method outperforms existing state-of-the-art methods. The code is available atthis https URL.[14]:DP-Image: Differential Privacy for Image Data in Feature Space
标题:DP图像:特征空间中图像数据的差异隐私
作者:Bo Liu, Ming Ding, Hanyu Xue, Tianqing Zhu, Dayong Ye, Li Song, Wanlei Zhou
链接:https://arxiv.org/abs/2103.07073
摘要:The excessive use of images in social networks, government databases, and industrial applications has posed great privacy risks and raised serious concerns from the public. Even though differential privacy (DP) is a widely accepted criterion that can provide a provable privacy guarantee, the application of DP on unstructured data such as images is not trivial due to the lack of a clear qualification on the meaningful difference between any two images. In this paper, for the first time, we introduce a novel notion of image-aware differential privacy, referred to as DP-image, that can protect user's personal information in images, from both human and AI adversaries. The DP-Image definition is formulated as an extended version of traditional differential privacy, considering the distance measurements between feature space vectors of images. Then we propose a mechanism to achieve DP-Image by adding noise to an image feature vector. Finally, we conduct experiments with a case study on face image privacy. Our results show that the proposed DP-Image method provides excellent DP protection on images, with a controllable distortion to faces.[15]:Severity Quantification and Lesion Localization of COVID-19 on CXR using Vision Transformer
标题:应用视觉转换器对CXR上COVID-19的严重程度量化和病变定位
作者:Gwanghyun Kim, Sangjoon Park, Yujin Oh, Joon Beom Seo, Sang Min Lee, Jin Hwan Kim, Sungjun Moon, Jae-Kwang Lim, Jong Chul Ye
备注:8 pages
链接:https://arxiv.org/abs/2103.07062
摘要:Under the global pandemic of COVID-19, building an automated framework that quantifies the severity of COVID-19 and localizes the relevant lesion on chest X-ray images has become increasingly important. Although pixel-level lesion severity labels, e.g. lesion segmentation, can be the most excellent target to build a robust model, collecting enough data with such labels is difficult due to time and labor-intensive annotation tasks. Instead, array-based severity labeling that assigns integer scores on six subdivisions of lungs can be an alternative choice enabling the quick labeling. Several groups proposed deep learning algorithms that quantify the severity of COVID-19 using the array-based COVID-19 labels and localize the lesions with explainability maps. To further improve the accuracy and interpretability, here we propose a novel Vision Transformer tailored for both quantification of the severity and clinically applicable localization of the COVID-19 related lesions. Our model is trained in a weakly-supervised manner to generate the full probability maps from weak array-based labels. Furthermore, a novel progressive self-training method enables us to build a model with a small labeled dataset. The quantitative and qualitative analysis on the external testset demonstrates that our method shows comparable performance with radiologists for both tasks with stability in a real-world application.[16]:Vision Transformer for COVID-19 CXR Diagnosis using Chest X-ray Feature Corpus
标题:利用胸部X线特征语料库诊断COVID-19cXr的视觉转换器
作者:Sangjoon Park, Gwanghyun Kim, Yujin Oh, Joon Beom Seo, Sang Min Lee, Jin Hwan Kim, Sungjun Moon, Jae-Kwang Lim, Jong Chul Ye
备注:10 pages
链接:https://arxiv.org/abs/2103.07055
摘要:Under the global COVID-19 crisis, developing robust diagnosis algorithm for COVID-19 using CXR is hampered by the lack of the well-curated COVID-19 data set, although CXR data with other disease are abundant. This situation is suitable for vision transformer architecture that can exploit the abundant unlabeled data using pre-training. However, the direct use of existing vision transformer that uses the corpus generated by the ResNet is not optimal for correct feature embedding. To mitigate this problem, we propose a novel vision Transformer by using the low-level CXR feature corpus that are obtained to extract the abnormal CXR features. Specifically, the backbone network is trained using large public datasets to obtain the abnormal features in routine diagnosis such as consolidation, glass-grass opacity (GGO), etc. Then, the embedded features from the backbone network are used as corpus for vision transformer training. We examine our model on various external test datasets acquired from totally different institutions to assess the generalization ability. Our experiments demonstrate that our method achieved the state-of-art performance and has better generalization capability, which are crucial for a widespread deployment.中文来自机器翻译,仅供参考。









