计算机视觉方向0427-LingLab

计算机视觉方向0427

3099 阅读 2021-05-17 09:59:15 上传

以下文章来源于语言学探索

今日 cs.CV方向共计101篇文章。

检测(16[1]：MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
标题：MDETR——用于端到端多模态理解的调制检测
作者：Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, Nicolas Carion
链接：https://arxiv.org/abs/2104.12763
摘要：Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This makes it challenging for such systems to capture the long tail of visual concepts expressed in free form text. In this paper we propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. We pre-train the network on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. We then fine-tune on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation, achieving state-of-the-art results on popular benchmarks. We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting. We show that our pre-training approach provides a way to handle the long tail of object categories which have very few labelled instances. Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR. The code and models are available atthis https URL.[2]：2.5D Visual Relationship Detection
标题：2.5D视觉关系检测
作者：Yu-Chuan Su, Soravit Changpinyo, Xiangning Chen, Sathish Thoppay, Cho-Jui Hsieh, Lior Shapira, Radu Soricut, Hartwig Adam, Matthew Brown, Ming-Hsuan Yang, Boqing Gong
链接：https://arxiv.org/abs/2104.12727
摘要：Visual 2.5D perception involves understanding the semantics and geometry of a scene through reasoning about object relationships with respect to the viewer in an environment. However, existing works in visual recognition primarily focus on the semantics. To bridge this gap, we study 2.5D visual relationship detection (2.5VRD), in which the goal is to jointly detect objects and predict their relative depth and occlusion relationships. Unlike general VRD, 2.5VRD is egocentric, using the camera's viewpoint as a common reference for all 2.5D relationships. Unlike depth estimation, 2.5VRD is object-centric and not only focuses on depth. To enable progress on this task, we create a new dataset consisting of 220k human-annotated 2.5D relationships among 512K objects from 11K images. We analyze this dataset and conduct extensive experiments including benchmarking multiple state-of-the-art VRD models on this task. Our results show that existing models largely rely on semantic cues and simple heuristics to solve 2.5VRD, motivating further research on models for 2.5D perception. The new dataset is available atthis https URL.[3]：PatchGuard++: Efficient Provable Attack Detection against Adversarial Patches
标题：None
作者：Chong Xiang, Prateek Mittal
备注：ICLR 2021 Workshop on Security and Safety in Machine Learning Systems
链接：https://arxiv.org/abs/2104.12609
摘要：An adversarial patch can arbitrarily manipulate image pixels within a restricted region to induce model misclassification. The threat of this localized attack has gained significant attention because the adversary can mount a physically-realizable attack by attaching patches to the victim object. Recent provably robust defenses generally follow the PatchGuard framework by using CNNs with small receptive fields and secure feature aggregation for robust model predictions. In this paper, we extend PatchGuard to PatchGuard++ for provably detecting the adversarial patch attack to boost both provable robust accuracy and clean accuracy. In PatchGuard++, we first use a CNN with small receptive fields for feature extraction so that the number of features corrupted by the adversarial patch is bounded. Next, we apply masks in the feature space and evaluate predictions on all possible masked feature maps. Finally, we extract a pattern from all masked predictions to catch the adversarial patch attack. We evaluate PatchGuard++ on ImageNette (a 10-class subset of ImageNet), ImageNet, and CIFAR-10 and demonstrate that PatchGuard++ significantly improves the provable robustness and clean performance.[4]：Variational Pedestrian Detection
标题：可变行人检测
作者：Yuang Zhang, Huanyu He, Jianguo Li, Yuxi Li, John See, Weiyao Lin
链接：https://arxiv.org/abs/2104.12389
摘要：Pedestrian detection in a crowd is a challenging task due to a high number of mutually-occluding human instances, which brings ambiguity and optimization difficulties to the current IoU-based ground truth assignment procedure in classical object detection methods. In this paper, we develop a unique perspective of pedestrian detection as a variational inference problem. We formulate a novel and efficient algorithm for pedestrian detection by modeling the dense proposals as a latent variable while proposing a customized Auto Encoding Variational Bayes (AEVB) algorithm. Through the optimization of our proposed algorithm, a classical detector can be fashioned into a variational pedestrian detector. Experiments conducted on CrowdHuman and CityPersons datasets show that the proposed algorithm serves as an efficient solution to handle the dense pedestrian detection problem for the case of single-stage detectors. Our method can also be flexibly applied to two-stage detectors, achieving notable performance enhancement.[5]：ODDObjects: A Framework for Multiclass Unsupervised Anomaly Detection on Masked Objects
标题：ODDObjects：一种基于蒙版对象的多类无监督异常检测框架
作者：Ricky Ma
备注：11 pages, 15 Postscript figures
链接：https://arxiv.org/abs/2104.12300
摘要：This paper presents a novel framework for unsupervised anomaly detection on masked objects called ODDObjects, which stands for Out-of-Distribution Detection on Objects. ODDObjects is designed to detect anomalies of various categories using unsupervised autoencoders trained on COCO-style datasets. The method utilizes autoencoder-based image reconstruction, where high reconstruction error indicates the possibility of an anomaly. The framework extends previous work on anomaly detection with autoencoders, comparing state-of-the-art models trained on object recognition datasets. Various model architectures were compared, and experimental results show that memory-augmented deep convolutional autoencoders perform the best at detecting out-of-distribution objects.[6]：Single Stage Class Agnostic Common Object Detection: A Simple Baseline
标题：单阶段类无关公共目标检测：一个简单的基线
作者：Chuong H. Nguyen, Thuy C. Nguyen, Anh H. Vo, Yamazaki Masayuki
备注：This paper is accepted to International Conference on Pattern Recognition Applications and Methods (ICPRAM) 2021
链接：https://arxiv.org/abs/2104.12245
摘要：This paper addresses the problem of common object detection, which aims to detect objects of similar categories from a set of images. Although it shares some similarities with the standard object detection and co-segmentation, common object detection, recently promoted by \cite{Jiang2019a}, has some unique advantages and challenges. First, it is designed to work on both closed-set and open-set conditions, a.k.a. known and unknown objects. Second, it must be able to match objects of the same category but not restricted to the same instance, texture, or posture. Third, it can distinguish multiple objects. In this work, we introduce the Single Stage Common Object Detection (SSCOD) to detect class-agnostic common objects from an image set. The proposed method is built upon the standard single-stage object detector. Furthermore, an embedded branch is introduced to generate the object's representation feature, and their similarity is measured by cosine distance. Experiments are conducted on PASCAL VOC 2007 and COCO 2014 datasets. While being simple and flexible, our proposed SSCOD built upon ATSSNet performs significantly better than the baseline of the standard object detection, while still be able to match objects of unknown categories. Our source code can be found at \href{this https URL}{(URL)}[7]：Breast Mass Detection with Faster R-CNN: On the Feasibility of Learning from Noisy Annotations
标题：快速r-CNN乳腺肿块检测：噪声注释学习的可行性研究
作者：Sina Famouri, Lia Morra, Leonardo Mangia, Fabrizio Lamberti
链接：https://arxiv.org/abs/2104.12218
摘要：In this work we study the impact of noise on the training of object detection networks for the medical domain, and how it can be mitigated by improving the training procedure. Annotating large medical datasets for training data-hungry deep learning models is expensive and time consuming. Leveraging information that is already collected in clinical practice, in the form of text reports, bookmarks or lesion measurements would substantially reduce this cost. Obtaining precise lesion bounding boxes through automatic mining procedures, however, is difficult. We provide here a quantitative evaluation of the effect of bounding box coordinate noise on the performance of Faster R-CNN object detection networks for breast mass detection. Varying degrees of noise are simulated by randomly modifying the bounding boxes: in our experiments, bounding boxes could be enlarged up to six times the original size. The noise is injected in the CBIS-DDSM collection, a well curated public mammography dataset for which accurate lesion location is available. We show how, due to an imperfect matching between the ground truth and the network bounding box proposals, the noise is propagated during training and reduces the ability of the network to correctly classify lesions from background. When using the standard Intersection over Union criterion, the area under the FROC curve decreases by up to 9%. A novel matching criterion is proposed to improve tolerance to noise.[8]：Temp-Frustum Net: 3D Object Detection with Temporal Fusion
标题：基于时间融合的三维目标检测
作者：Emeç Erçelik, Ekim Yurtsever, Alois Knoll
备注：To be published in 32nd IEEE Intelligent Vehicles Symposium
链接：https://arxiv.org/abs/2104.12106
摘要：3D object detection is a core component of automated driving systems. State-of-the-art methods fuse RGB imagery and LiDAR point cloud data frame-by-frame for 3D bounding box regression. However, frame-by-frame 3D object detection suffers from noise, field-of-view obstruction, and sparsity. We propose a novel Temporal Fusion Module (TFM) to use information from previous time-steps to mitigate these problems. First, a state-of-the-art frustum network extracts point cloud features from raw RGB and LiDAR point cloud data frame-by-frame. Then, our TFM module fuses these features with a recurrent neural network. As a result, 3D object detection becomes robust against single frame failures and transient occlusions. Experiments on the KITTI object tracking dataset show the efficiency of the proposed TFM, where we obtain ~6%, ~4%, and ~6% improvements on Car, Pedestrian, and Cyclist classes, respectively, compared to frame-by-frame baselines. Furthermore, ablation studies reinforce that the subject of improvement is temporal fusion and show the effects of different placements of TFM in the object detection pipeline. Our code is open-source and available atthis https URL.[9]：Unsupervised Learning of Multi-level Structures for Anomaly Detection
标题：用于异常检测的多层结构无监督学习
作者：Songmin Dai, Jide Li, Lu Wang, Congcong Zhu, Yifan Wu, Xiaoqiang Li
链接：https://arxiv.org/abs/2104.12102
摘要：The main difficulty in high-dimensional anomaly detection tasks is the lack of anomalous data for training. And simply collecting anomalous data from the real world, common distributions, or the boundary of normal data manifold may face the problem of missing anomaly modes. This paper first introduces a novel method to generate anomalous data by breaking up global structures while preserving local structures of normal data at multiple levels. It can efficiently expose local abnormal structures of various levels. To fully exploit the exposed multi-level abnormal structures, we propose to train multiple level-specific patch-based detectors with contrastive losses. Each detector learns to detect local abnormal structures of corresponding level at all locations and outputs patchwise anomaly scores. By aggregating the outputs of all level-specific detectors, we obtain a model that can detect all potential anomalies. The effectiveness is evaluated on MNIST, CIFAR10, and ImageNet10 dataset, where the results surpass the accuracy of state-of-the-art methods. Qualitative experiments demonstrate our model is robust that it unbiasedly detects all anomaly modes.[10]：RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory
标题：RelTransformer：从局部上下文、场景和记忆中平衡视觉关系检测
作者：Jun Chen, Aniket Agarwal, Sherif Abdelkarim, Deyao Zhu, Mohamed Elhoseiny
链接：https://arxiv.org/abs/2104.11934
摘要：Visual relationship recognition (VRR) is a fundamental scene understanding task. The structure that VRR provides is essential to improve the AI interpretability in downstream tasks such as image captioning and visual question answering. Several recent studies showed that the long-tail problem in VRR is even more critical than that in object recognition due to the compositional complexity and structure. To overcome this limitation, we propose a novel transformer-based framework, dubbed as RelTransformer, which performs relationship prediction using rich semantic features from multiple image levels. We assume that more abundantcon textual features can generate more accurate and discriminative relationships, which can be useful when sufficient training data are lacking. The key feature of our model is its ability to aggregate three different-level features (local context, scene, and dataset-level) to compositionally predict the visual relationship. We evaluate our model on the visual genome and two "long-tail" VRR datasets, GQA-LT and VG8k-LT. Extensive experiments demonstrate that our RelTransformer could improve over the state-of-the-art baselines on all the datasets. In addition, our model significantly improves the accuracy of GQA-LT by 27.4% upon the best baselines on tail-relationship prediction. Our code is available inthis https URL.[11]：Anomaly Detection for Solder Joints Using $β$-VAE
标题：None
作者：Furkan Ulger, Seniha Esen Yuksel, Atila Yilmaz
链接：https://arxiv.org/abs/2104.11927
摘要：In the assembly process of printed circuit boards (PCB), most of the errors are caused by solder joints in Surface Mount Devices (SMD). In the literature, traditional feature extraction based methods require designing hand-crafted features and rely on the tiered RGB illumination to detect solder joint errors, whereas the supervised Convolutional Neural Network (CNN) based approaches require a lot of labelled abnormal samples (defective solder joints) to achieve high accuracy. To solve the optical inspection problem in unrestricted environments with no special lighting and without the existence of error-free reference boards, we propose a new beta-Variational Autoencoders (beta-VAE) architecture for anomaly detection that can work on both IC and non-IC components. We show that the proposed model learns disentangled representation of data, leading to more independent features and improved latent space representations. We compare the activation and gradient-based representations that are used to characterize anomalies; and observe the effect of different beta parameters on accuracy and on untwining the feature representations in beta-VAE. Finally, we show that anomalies on solder joints can be detected with high accuracy via a model trained on directly normal samples without designated hardware or feature engineering.[12]：M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers
标题：M3DeTR：多表示，多尺度，相互关系的三维物体检测与变压器
作者：Tianrui Guan, Jun Wang, Shiyi Lan, Rohan Chandra, Zuxuan Wu, Larry Davis, Dinesh Manocha
链接：https://arxiv.org/abs/2104.11896
摘要：We present a novel architecture for 3D object detection, M3DeTR, which combines different point cloud representations (raw, voxels, bird-eye view) with different feature scales based on multi-scale feature pyramids. M3DeTR is the first approach that unifies multiple point cloud representations, feature scales, as well as models mutual relationships between point clouds simultaneously using transformers. We perform extensive ablation experiments that highlight the benefits of fusing representation and scale, and modeling the relationships. Our method achieves state-of-the-art performance on the KITTI 3D object detection dataset and Waymo Open Dataset. Results show that M3DeTR improves the baseline significantly by 1.48% mAP for all classes on Waymo Open Dataset. In particular, our approach ranks 1st on the well-known KITTI 3D Detection Benchmark for both car and cyclist classes, and ranks 1st on Waymo Open Dataset with single frame point cloud input.[13]：A Survey of Modern Deep Learning based Object Detection Models
标题：基于深度学习的现代目标检测模型综述
作者：Syed Sahil Abbas Zaidi, Mohammad Samar Ansari, Asra Aslam, Nadia Kanwal, Mamoona Asghar, Brian Lee
备注：Preprint submitted to IET Computer Vision
链接：https://arxiv.org/abs/2104.11892
摘要：Object Detection is the task of classification and localization of objects in an image or video. It has gained prominence in recent years due to its widespread applications. This article surveys recent developments in deep learning based object detectors. Concise overview of benchmark datasets and evaluation metrics used in detection is also provided along with some of the prominent backbone architectures used in recognition tasks. It also covers contemporary lightweight classification models used on edge devices. Lastly, we compare the performances of these architectures on multiple metrics.[14]：Oriented Bounding Boxes for Small and Freely Rotated Objects
标题：小型自由旋转对象的定向边界框
作者：Mohsen Zand, Ali Etemad, Michael Greenspan
备注：IEEE Transactions on Geoscience and Remote Sensing, 2021
链接：https://arxiv.org/abs/2104.11854
摘要：A novel object detection method is presented that handles freely rotated objects of arbitrary sizes, including tiny objects as small as $2\times 2$ pixels. Such tiny objects appear frequently in remotely sensed images, and present a challenge to recent object detection algorithms. More importantly, current object detection methods have been designed originally to accommodate axis-aligned bounding box detection, and therefore fail to accurately localize oriented boxes that best describe freely rotated objects. In contrast, the proposed CNN-based approach uses potential pixel information at multiple scale levels without the need for any external resources, such as anchor boxes.The method encodes the precise location and orientation of features of the target objects at grid cell locations. Unlike existing methods which regress the bounding box location and dimension,the proposed method learns all the required information by classification, which has the added benefit of enabling oriented bounding box detection without any extra computation. It thus infers the bounding boxes only at inference time by finding the minimum surrounding box for every set of the same predicted class labels. Moreover, a rotation-invariant feature representation is applied to each scale, which imposes a regularization constraint to enforce covering the 360 degree range of in-plane rotation of the training samples to share similar features. Evaluations on the xView and DOTA datasets show that the proposed method uniformly improves performance over existing state-of-the-art methods.[15]：FedDPGAN: Federated Differentially Private Generative Adversarial Networks Framework for the Detection of COVID-19 Pneumonia
标题：FedDPGAN：用于检测COVID-19肺炎的联邦差异私有生成对抗性网络框架
作者：Longling Zhang, Bochen Shen, Ahmed Barnawi, Shan Xi, Neeraj Kumar, Yi Wu
链接：https://arxiv.org/abs/2104.12581
摘要：Existing deep learning technologies generally learn the features of chest X-ray data generated by Generative Adversarial Networks (GAN) to diagnose COVID-19 pneumonia. However, the above methods have a critical challenge: data privacy. GAN will leak the semantic information of the training data which can be used to reconstruct the training samples by attackers, thereby this method will leak the privacy of the patient. Furthermore, for this reason that is the limitation of the training data sample, different hospitals jointly train the model through data sharing, which will also cause the privacy leakage. To solve this problem, we adopt the Federated Learning (FL) frame-work which is a new technique being used to protect the data privacy. Under the FL framework and Differentially Private thinking, we propose a FederatedDifferentially Private Generative Adversarial Network (FedDPGAN) to detectCOVID-19 pneumonia for sustainable smart cities. Specifically, we use DP-GAN to privately generate diverse patient data in which differential privacy technology is introduced to make sure the privacy protection of the semantic information of training dataset. Furthermore, we leverage FL to allow hospitals to collaboratively train COVID-19 models without sharing the original data. Under Independent and Identically Distributed (IID) and non-IID settings, The evaluation of the proposed model is on three types of chest X-ray (CXR) images dataset (COVID-19, normal, and normal pneumonia). A large number of the truthful reports make the verification of our model can effectively diagnose COVID-19 without compromising privacy.[16]：On the Role of Sensor Fusion for Object Detection in Future Vehicular Networks
标题：传感器融合在未来车载网络目标检测中的作用
作者：Valentina Rossi, Paolo Testolina, Marco Giordani, Michele Zorzi
备注：This paper has been accepted for presentation at the Joint European Conference on Networks and Communications & 6G Summit (EuCNC/6G Summit). 6 pages, 6 figures, 1 table
链接：https://arxiv.org/abs/2104.11785
摘要：Fully autonomous driving systems require fast detection and recognition of sensitive objects in the environment. In this context, intelligent vehicles should share their sensor data with computing platforms and/or other vehicles, to detect objects beyond their own sensors' fields of view. However, the resulting huge volumes of data to be exchanged can be challenging to handle for standard communication technologies. In this paper, we evaluate how using a combination of different sensors affects the detection of the environment in which the vehicles move and operate. The final objective is to identify the optimal setup that would minimize the amount of data to be distributed over the channel, with negligible degradation in terms of object detection accuracy. To this aim, we extend an already available object detection algorithm so that it can consider, as an input, camera images, LiDAR point clouds, or a combination of the two, and compare the accuracy performance of the different approaches using two realistic datasets. Our results show that, although sensor fusion always achieves more accurate detections, LiDAR only inputs can obtain similar results for large objects while mitigating the burden on the channel.

分割(8篇)

[1]：Rich Semantics Improve Few-shot Learning
标题：丰富的语义有助于减少镜头学习
作者：Mohamed Afham, Salman Khan, Muhammad Haris Khan, Muzammal Naseer, Fahad Shahbaz Khan
链接：https://arxiv.org/abs/2104.12709
摘要：Human learning benefits from multi-modal inputs that often appear as rich semantics (e.g., description of an object's attributes while learning about it). This enables us to learn generalizable concepts from very limited visual examples. However, current few-shot learning (FSL) methods use numerical class labels to denote object classes which do not provide rich semantic meanings about the learned concepts. In this work, we show that by using 'class-level' language descriptions, that can be acquired with minimal annotation cost, we can improve the FSL performance. Given a support set and queries, our main idea is to create a bottleneck visual feature (hybrid prototype) which is then used to generate language descriptions of the classes as an auxiliary task during training. We develop a Transformer based forward and backward encoding mechanism to relate visual and semantic tokens that can encode intricate relationships between the two modalities. Forcing the prototypes to retain semantic information about class description acts as a regularizer on the visual features, improving their generalization to novel classes at inference. Furthermore, this strategy imposes a human prior on the learned representations, ensuring that the model is faithfully relating visual and semantic concepts, thereby improving model interpretability. Our experiments on four datasets and ablation studies show the benefit of effectively modeling rich semantics for FSL.[2]：Spherical formulation of geometric motion segmentation constraints in fisheye cameras
标题：鱼眼相机几何运动分割约束的球面表示法
作者：Letizia Mariotti, Ciaran Eising
备注：arXiv admin note: text overlap witharXiv:2003.03262
链接：https://arxiv.org/abs/2104.12404
摘要：We introduce a visual motion segmentation method employing spherical geometry for fisheye cameras and automoated driving. Three commonly used geometric constraints in pin-hole imagery (the positive height, positive depth and epipolar constraints) are reformulated to spherical coordinates, making them invariant to specific camera configurations as long as the camera calibration is known. A fourth constraint, known as the anti-parallel constraint, is added to resolve motion-parallax ambiguity, to support the detection of moving objects undergoing parallel or near-parallel motion with respect to the host vehicle. A final constraint constraint is described, known as the spherical three-view constraint, is described though not employed in our proposed algorithm. Results are presented and analyzed that demonstrate that the proposal is an effective motion segmentation approach for direct employment on fisheye imagery.[3]：Learning to Better Segment Objects from Unseen Classes with Unlabeled Videos
标题：学习用未标记的视频更好地从看不见的类中分割对象
作者：Yuming Du, Yang Xiao, Vincent Lepetit
备注：18 pages, 14 figures. See project pagethis https URL
链接：https://arxiv.org/abs/2104.12276
摘要：The ability to localize and segment objects from unseen classes would open the door to new applications, such as autonomous object learning in active vision. Nonetheless, improving the performance on unseen classes requires additional training data, while manually annotating the objects of the unseen classes can be labor-extensive and expensive. In this paper, we explore the use of unlabeled video sequences to automatically generate training data for objects of unseen classes. It is in principle possible to apply existing video segmentation methods to unlabeled videos and automatically obtain object masks, which can then be used as a training set even for classes with no manual labels available. However, our experiments show that these methods do not perform well enough for this purpose. We therefore introduce a Bayesian method that is specifically designed to automatically create such a training set: Our method starts from a set of object proposals and relies on (non-realistic) analysis-by-synthesis to select the correct ones by performing an efficient optimization over all the frames simultaneously. Through extensive experiments, we show that our method can generate a high-quality training set which significantly boosts the performance of segmenting objects of unseen classes. We thus believe that our method could open the door for open-world instance segmentation using abundant Internet videos.[4]：A novel segmentation dataset for signatures on bank checks
标题：一种新的银行支票签名分割数据集
作者：Muhammad Saif Ullah Khan
链接：https://arxiv.org/abs/2104.12203
摘要：The dataset presented provides high-resolution images of real, filled out bank checks containing various complex backgrounds, and handwritten text and signatures in the respective fields, along with both pixel-level and patch-level segmentation masks for the signatures on the checks. The images of bank checks were obtained from different sources, including other publicly available check datasets, publicly available images on the internet, as well as scans and images of real checks. Using the GIMP graphics software, pixel-level segmentation masks for signatures on these checks were manually generated as binary images. An automated script was then used to generate patch-level masks. The dataset was created to train and test networks for extracting signatures from bank checks and other similar documents with very complex backgrounds.[5]：MIDeepSeg: Minimally Interactive Segmentation of Unseen Objects from Medical Images Using Deep Learning
标题：MIDeepSeg：基于深度学习的医学图像中不可见物体的最小交互式分割
作者：Xiangde Luo, Guotai Wang, Tao Song, Jingyang Zhang, Michael Aertsen, Jan Deprest, Sebastien Ourselin, Tom Vercauteren, Shaoting Zhang
备注：17 pages, 16 figures
链接：https://arxiv.org/abs/2104.12166
摘要：Segmentation of organs or lesions from medical images plays an essential role in many clinical applications such as diagnosis and treatment planning. Though Convolutional Neural Networks (CNN) have achieved the state-of-the-art performance for automatic segmentation, they are often limited by the lack of clinically acceptable accuracy and robustness in complex cases. Therefore, interactive segmentation is a practical alternative to these methods. However, traditional interactive segmentation methods require a large amount of user interactions, and recently proposed CNN-based interactive segmentation methods are limited by poor performance on previously unseen objects. To solve these problems, we propose a novel deep learning-based interactive segmentation method that not only has high efficiency due to only requiring clicks as user inputs but also generalizes well to a range of previously unseen objects. Specifically, we first encode user-provided interior margin points via our proposed exponentialized geodesic distance that enables a CNN to achieve a good initial segmentation result of both previously seen and unseen objects, then we use a novel information fusion method that combines the initial segmentation with only few additional user clicks to efficiently obtain a refined segmentation. We validated our proposed framework through extensive experiments on 2D and 3D medical image segmentation tasks with a wide range of previous unseen objects that were not present in the training set. Experimental results showed that our proposed framework 1) achieves accurate results with fewer user interactions and less time compared with state-of-the-art interactive frameworks and 2) generalizes well to previously unseen objects.[6]：Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images
标题：DCFAM：一种新的高分辨率遥感图像语义分割方法
作者：Libo Wang, Rui Li, Chenxi Duan, Shenghui Fang
链接：https://arxiv.org/abs/2104.12137
摘要：The fully-convolutional network (FCN) with an encoder-decoder architecture has become the standard paradigm for semantic segmentation. The encoder-decoder architecture utilizes an encoder to capture multi-level feature maps, which are then incorporated into the final prediction by a decoder. As the context is critical for precise segmentation, tremendous effort has been made to extract such information in an intelligent manner, including employing dilated/atrous convolutions or inserting attention modules. However, the aforementioned endeavors are all based on the FCN architecture with ResNet backbone which cannot tackle the context issue from the root. By contrast, we introduce the Swin Transformer as the backbone to fully extract the context information and design a novel decoder named densely connected feature aggregation module (DCFAM) to restore the resolution and generate the segmentation map. The extensive experiments on two datasets demonstrate the effectiveness of the proposed scheme.[7]：Calibrating LiDAR and Camera using Semantic Mutual information
标题：基于语义互信息的激光雷达与相机标定
作者：Peng Jiang, Philip Osteen, Srikanth Saripalli
备注：5 pages, 6 figures, submitted to ICRA 2021 workshop
链接：https://arxiv.org/abs/2104.12023
摘要：We propose an algorithm for automatic, targetless, extrinsic calibration of a LiDAR and camera system using semantic information. We achieve this goal by maximizing mutual information (MI) of semantic information between sensors, leveraging a neural network to estimate semantic mutual information, and matrix exponential for calibration computation. Using kernel-based sampling to sample data from camera measurement based on LiDAR projected points, we formulate the problem as a novel differentiable objective function which supports the use of gradient-based optimization methods. We also introduce an initial calibration method using 2D MI-based image registration. Finally, we demonstrate the robustness of our method and quantitatively analyze the accuracy on a synthetic dataset and also evaluate our algorithm qualitatively on KITTI360 and RELLIS-3D benchmark datasets, showing improvement over recent comparable approaches.[8]：Learning to Address Intra-segment Misclassification in Retinal Imaging
标题：学习解决视网膜成像中的节段内错误分类
作者：Yukun Zhou, Moucheng Xu, Yipeng Hu, Hongxiang Lin, Joseph Jacob, Pearse Keane, Daniel Alexander
备注：10 pages, 5 figures, and 2 tables
链接：https://arxiv.org/abs/2104.12138
摘要：Accurate multi-class segmentation is a long-standing challenge in medical imaging, especially in scenarios where classes share strong similarity. Segmenting retinal blood vessels in retinal photographs is one such scenario, in which arteries and veins need to be identified and differentiated from each other and from the background. Intra-segment misclassification, i.e. veins classified as arteries or vice versa, frequently occurs when arteries and veins intersect, whereas in binary retinal vessel segmentation, error rates are much lower. We thus propose a new approach that decomposes multi-class segmentation into multiple binary, followed by a binary-to-multi-class fusion network. The network merges representations of artery, vein, and multi-class feature maps, each of which are supervised by expert vessel annotation in adversarial training. A skip-connection based merging process explicitly maintains class-specific gradients to avoid gradient vanishing in deep layers, to favor the discriminative features. The results show that, our model respectively improves F1-score by 4.4\%, 5.1\%, and 4.2\% compared with three state-of-the-art deep learning based methods on DRIVE-AV, LES-AV, and HRF-AV data sets.

分类、识别(7篇)

[1]：Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets
标题：高效注释大规模图像分类数据集的良好实践
作者：Yuan-Hong Liao, Amlan Kar, Sanja Fidler
备注：CVPR 2021 Oral
链接：https://arxiv.org/abs/2104.12690
摘要：Data is the engine of modern computer vision, which necessitates collecting large-scale datasets. This is expensive, and guaranteeing the quality of the labels is a major challenge. In this paper, we investigate efficient annotation strategies for collecting multi-class classification labels for a large collection of images. While methods that exploit learnt models for labeling exist, a surprisingly prevalent approach is to query humans for a fixed number of labels per datum and aggregate them, which is expensive. Building on prior work on online joint probabilistic modeling of human annotations and machine-generated beliefs, we propose modifications and best practices aimed at minimizing human labeling effort. Specifically, we make use of advances in self-supervised learning, view annotation as a semi-supervised learning problem, identify and mitigate pitfalls and ablate several key design choices to propose effective guidelines for labeling. Our analysis is done in a more realistic simulation that involves querying human labelers, which uncovers issues with evaluation using existing worker simulation methods. Simulated experiments on a 125k image subset of the ImageNet100 show that it can be annotated to 80% top-1 accuracy with 0.35 annotations per image on average, a 2.7x and 6.7x improvement over prior work and manual annotation, respectively. Project page:this https URL[2]：Model Guided Road Intersection Classification
标题：模型引导的道路交叉口分类
作者：Augusto Luis Ballardini, Álvaro Hernández, Miguel Ángel Sotelo
备注：To be presented at the 2021 32nd IEEE Intelligent Vehicles Symposium (IV) (IV 2021)
链接：https://arxiv.org/abs/2104.12417
摘要：Understanding complex scenarios from in-vehicle cameras is essential for safely operating autonomous driving systems in densely populated areas. Among these, intersection areas are one of the most critical as they concentrate a considerable number of traffic accidents and fatalities. Detecting and understanding the scene configuration of these usually crowded areas is then of extreme importance for both autonomous vehicles and modern ADAS aimed at preventing road crashes and increasing the safety of vulnerable road users. This work investigates inter-section classification from RGB images using well-consolidate neural network approaches along with a method to enhance the results based on the teacher/student training paradigm. An extensive experimental activity aimed at identifying the best input configuration and evaluating different network parameters on both the well-known KITTI dataset and the new KITTI-360 sequences shows that our method outperforms current state-of-the-art approaches on a per-frame basis and prove the effectiveness of the proposed learning scheme.[3]：Wise-SrNet: A Novel Architecture for Enhancing Image Classification by Learning Spatial Resolution of Feature Maps
标题：Wise-SrNet：一种通过学习特征地图的空间分辨率来增强图像分类的新结构
作者：Mohammad Rahimzadeh, Soroush Parvin, Elnaz Safi, Mohammad Reza Mohammadi
备注：The code is shared atthis https URL
链接：https://arxiv.org/abs/2104.12294
摘要：One of the main challenges since the advancement of convolutional neural networks is how to connect the extracted feature map to the final classification layer. VGG models used two sets of fully connected layers for the classification part of their architectures, which significantly increases the number of models' weights. ResNet and next deep convolutional models used the Global Average Pooling (GAP) layer to compress the feature map and feed it to the classification layer. Although using the GAP layer reduces the computational cost, but also causes losing spatial resolution of the feature map, which results in decreasing learning efficiency. In this paper, we aim to tackle this problem by replacing the GAP layer with a new architecture called Wise-SrNet. It is inspired by the depthwise convolutional idea and is designed for processing spatial resolution and also not increasing computational cost. We have evaluated our method using three different datasets: Intel Image Classification Challenge, MIT Indoors Scenes, and a part of the ImageNet dataset. We investigated the implementation of our architecture on several models of Inception, ResNet and DensNet families. Applying our architecture has revealed a significant effect on increasing convergence speed and accuracy. Our Experiments on images with 224x224 resolution increased the Top-1 accuracy between 2% to 8% on different datasets and models. Running our models on 512x512 resolution images of the MIT Indoors Scenes dataset showed a notable result of improving the Top-1 accuracy within 3% to 26%. We will also demonstrate the GAP layer's disadvantage when the input images are large and the number of classes is not few. In this circumstance, our proposed architecture can do a great help in enhancing classification results. The code is shared atthis https URL.[4]：3D/2D regularized CNN feature hierarchy for Hyperspectral image classification
标题：用于高光谱图像分类的3D/2D正则化CNN特征层次
作者：Muhammad Ahmad, Manuel Mazzara, Salvatore Distefano
链接：https://arxiv.org/abs/2104.12136
摘要：Convolutional Neural Networks (CNN) have been rigorously studied for Hyperspectral Image Classification (HSIC) and are known to be effective in exploiting joint spatial-spectral information with the expense of lower generalization performance and learning speed due to the hard labels and non-uniform distribution over labels. Several regularization techniques have been used to overcome the aforesaid issues. However, sometimes models learn to predict the samples extremely confidently which is not good from a generalization point of view. Therefore, this paper proposed an idea to enhance the generalization performance of a hybrid CNN for HSIC using soft labels that are a weighted average of the hard labels and uniform distribution over ground labels. The proposed method helps to prevent CNN from becoming over-confident. We empirically show that in improving generalization performance, label smoothing also improves model calibration which significantly improves beam-search. Several publicly available Hyperspectral datasets are used to validate the experimental evaluation which reveals improved generalization performance, statistical significance, and computational complexity as compared to the state-of-the-art models. The code will be made available atthis https URL.[5]：ASPCNet: A Deep Adaptive Spatial Pattern Capsule Network for Hyperspectral Image Classification
标题：ASPCNet：一种用于高光谱图像分类的深度自适应空间模式胶囊网络
作者：Jinping Wang, Xiaojun Tan, Jianhuang Lai, Jun Li, Canqun Xiang
链接：https://arxiv.org/abs/2104.12085
摘要：Previous studies have shown the great potential of capsule networks for the spatial contextual feature extraction from {hyperspectral images (HSIs)}. However, the sampling locations of the convolutional kernels of capsules are fixed and cannot be adaptively changed according to the inconsistent semantic information of HSIs. Based on this observation, this paper proposes an adaptive spatial pattern capsule network (ASPCNet) architecture by developing an adaptive spatial pattern (ASP) unit, that can rotate the sampling location of convolutional kernels on the basis of an enlarged receptive field. Note that this unit can learn more discriminative representations of HSIs with fewer parameters. Specifically, two cascaded ASP-based convolution operations (ASPConvs) are applied to input images to learn relatively high-level semantic features, transmitting hierarchical structures among capsules more accurately than the use of the most fundamental features. Furthermore, the semantic features are fed into ASP-based conv-capsule operations (ASPCaps) to explore the shapes of objects among the capsules in an adaptive manner, further exploring the potential of capsule networks. Finally, the class labels of image patches centered on test samples can be determined according to the fully connected capsule layer. Experiments on three public datasets demonstrate that ASPCNet can yield competitive performance with higher accuracies than state-of-the-art methods.[6]：Parallel Scale-wise Attention Network for Effective Scene Text Recognition
标题：并行尺度注意网络在场景文本识别中的应用
作者：Usman Sajid, Michael Chow, Jin Zhang, Taejoon Kim, Guanghui Wang
链接：https://arxiv.org/abs/2104.12076
摘要：The paper proposes a new text recognition network for scene-text images. Many state-of-the-art methods employ the attention mechanism either in the text encoder or decoder for the text alignment. Although the encoder-based attention yields promising results, these schemes inherit noticeable limitations. They perform the feature extraction (FE) and visual attention (VA) sequentially, which bounds the attention mechanism to rely only on the FE final single-scale output. Moreover, the utilization of the attention process is limited by only applying it directly to the single scale feature-maps. To address these issues, we propose a new multi-scale and encoder-based attention network for text recognition that performs the multi-scale FE and VA in parallel. The multi-scale channels also undergo regular fusion with each other to develop the coordinated knowledge together. Quantitative evaluation and robustness analysis on the standard benchmarks demonstrate that the proposed network outperforms the state-of-the-art in most cases.[7]：A deep learning model for gastric diffuse-type adenocarcinoma classification in whole slide images
标题：全层图像胃弥漫型腺癌分类的深度学习模型
作者：Fahdi Kanavati, Masayuki Tsuneki
链接：https://arxiv.org/abs/2104.12478
摘要：Gastric diffuse-type adenocarcinoma represents a disproportionately high percentage of cases of gastric cancers occurring in the young, and its relative incidence seems to be on the rise. Usually it affects the body of the stomach, and presents shorter duration and worse prognosis compared with the differentiated (intestinal) type adenocarcinoma. The main difficulty encountered in the differential diagnosis of gastric adenocarcinomas occurs with the diffuse-type. As the cancer cells of diffuse-type adenocarcinoma are often single and inconspicuous in a background desmoplaia and inflammation, it can often be mistaken for a wide variety of non-neoplastic lesions including gastritis or reactive endothelial cells seen in granulation tissue. In this study we trained deep learning models to classify gastric diffuse-type adenocarcinoma from WSIs. We evaluated the models on five test sets obtained from distinct sources, achieving receiver operator curve (ROC) area under the curves (AUCs) in the range of 0.95-0.99. The highly promising results demonstrate the potential of AI-based computational pathology for aiding pathologists in their diagnostic workflow system.

跟踪(2篇)

[1]：A Novel Unified Stereo Stimuli based Binocular Eye-Tracking System for Accurate 3D Gaze Estimation
标题：一种新的基于统一立体刺激的双目视觉跟踪系统，用于精确的三维注视估计
作者：Sunjing Lin, Yu Liu, Shaochu Wang, Chang Li, Han Wang
链接：https://arxiv.org/abs/2104.12167
摘要：In addition to the high cost and complex setup, the main reason for the limitation of the three-dimensional (3D) display is the problem of accurately estimating the user's current point-of-gaze (PoG) in a 3D space. In this paper, we present a novel noncontact technique for the PoG estimation in a stereoscopic environment, which integrates a 3D stereoscopic display system and an eye-tracking system. The 3D stereoscopic display system can provide users with a friendly and immersive high-definition viewing experience without wearing any equipment. To accurately locate the user's 3D PoG in the field of view, we build a regression-based 3D eye-tracking model with the eye movement data and stereo stimulus videos as input. Besides, to train an optimal regression model, we also design and annotate a dataset that contains 30 users' eye-tracking data corresponding to two designed stereo test scenes. Innovatively, this dataset introduces feature vectors between eye region landmarks for the gaze vector estimation and a combined feature set for the gaze depth estimation. Moreover, five traditional regression models are trained and evaluated based on this dataset. Experimental results show that the average errors of the 3D PoG are about 0.90~cm on the X-axis, 0.83~cm on the Y-axis, and 1.48~cm$/$0.12~m along the Z-axis with the scene-depth range in 75~cm$/$8~m, respectively.[2]：Distractor-Aware Fast Tracking via Dynamic Convolutions and MOT Philosophy
标题：基于动态卷积和MOT原理的分心感知快速跟踪
作者：Zikai Zhang, Bineng Zhong, Shengping Zhang, Zhenjun Tang, Xin Liu, Zhaoxiang Zhang
链接：https://arxiv.org/abs/2104.12041
摘要：A practical long-term tracker typically contains three key properties, i.e. an efficient model design, an effective global re-detection strategy and a robust distractor awareness mechanism. However, most state-of-the-art long-term trackers (e.g., Pseudo and re-detecting based ones) do not take all three key properties into account and therefore may either be time-consuming or drift to distractors. To address the issues, we propose a two-task tracking frame work (named DMTrack), which utilizes two core components (i.e., one-shot detection and re-identification (re-id) association) to achieve distractor-aware fast tracking via Dynamic convolutions (d-convs) and Multiple object tracking (MOT) philosophy. To achieve precise and fast global detection, we construct a lightweight one-shot detector using a novel dynamic convolutions generation method, which provides a unified and more flexible way for fusing target information into the search field. To distinguish the target from distractors, we resort to the philosophy of MOT to reason distractors explicitly by maintaining all potential similarities' tracklets. Benefited from the strength of high recall detection and explicit object association, our tracker achieves state-of-the-art performance on the LaSOT, OxUvA, TLP, VOT2018LT and VOT2019LT benchmarks and runs in real-time (3x faster than comparisons).

人体姿态估计、位姿估计(1篇)

[1]：Parallel mesh reconstruction streams for pose estimation of interacting hands
标题：用于交互手姿态估计的并行网格重建流
作者：Uri Wollner, Guy Ben-Yosef
链接：https://arxiv.org/abs/2104.12123
摘要：We present a new multi-stream 3D mesh reconstruction network (MSMR-Net) for hand pose estimation from a single RGB image. Our model consists of an image encoder followed by a mesh-convolution decoder composed of connected graph convolution layers. In contrast to previous models that form a single mesh decoding path, our decoder network incorporates multiple cross-resolution trajectories that are executed in parallel. Thus, global and local information are shared to form rich decoding representations at minor additional parameter cost compared to the single trajectory network. We demonstrate the effectiveness of our method in hand-hand and hand-object interaction scenarios at various levels of interaction. To evaluate the former scenario, we propose a method to generate RGB images of closely interacting hands. Moreoever, we suggest a metric to quantify the degree of interaction and show that close hand interactions are particularly challenging. Experimental results show that the MSMR-Net outperforms existing algorithms on the hand-object FreiHAND dataset as well as on our own hand-hand dataset.

Zero/One-Shot、迁移学习、Domain Adaptation(3篇)

[1]：Dynamic VAEs with Generative Replay for Continual Zero-shot Learning
标题：连续零拍学习的动态生成重放VAEs
作者：Subhankar Ghosh
备注：10 pages, 10 figures. arXiv admin note: text overlap witharXiv:2102.03778
链接：https://arxiv.org/abs/2104.12468
摘要：Continual zero-shot learning(CZSL) is a new domain to classify objects sequentially the model has not seen during training. It is more suitable than zero-shot and continual learning approaches in real-case scenarios when data may come continually with only attributes for a few classes and attributes and features for other classes. Continual learning(CL) suffers from catastrophic forgetting, and zero-shot learning(ZSL) models cannot classify objects like state-of-the-art supervised classifiers due to lack of actual data(or features) during training. This paper proposes a novel continual zero-shot learning (DVGR-CZSL) model that grows in size with each task and uses generative replay to update itself with previously learned classes to avoid forgetting. We demonstrate our hybrid model(DVGR-CZSL) outperforms the baselines and is effective on several datasets, i.e., CUB, AWA1, AWA2, and aPY. We show our method is superior in task sequentially learning with ZSL(Zero-Shot Learning). We also discuss our results on the SUN dataset.[2]：Adaptive Appearance Rendering
标题：自适应外观渲染
作者：Mengyao Zhai, Ruizhi Deng, Jiacheng Chen, Lei Chen, Zhiwei Deng, Greg Mori
备注：Accepted to BMVC 2018. arXiv admin note: substantial text overlap witharXiv:1712.01955
链接：https://arxiv.org/abs/2104.11931
摘要：We propose an approach to generate images of people given a desired appearance and pose. Disentangled representations of pose and appearance are necessary to handle the compound variability in the resulting generated images. Hence, we develop an approach based on intermediate representations of poses and appearance: our pose-guided appearance rendering network firstly encodes the targets' poses using an encoder-decoder neural network. Then the targets' appearances are encoded by learning adaptive appearance filters using a fully convolutional network. Finally, these filters are placed in the encoder-decoder neural networks to complete the rendering. We demonstrate that our model can generate images and videos that are superior to state-of-the-art methods, and can handle pose guided appearance rendering in both image and video generation.[3]：Automatic Diagnosis of COVID-19 from CT Images using CycleGAN and Transfer Learning
标题：基于CycleGAN和转移学习的CT图像COVID-19的自动诊断
作者：Navid Ghassemi, Afshin Shoeibi, Marjane Khodatars, Jonathan Heras, Alireza Rahimi, Assef Zare, Ram Bilas Pachori, J. Manuel Gorriz
链接：https://arxiv.org/abs/2104.11949
摘要：The outbreak of the corona virus disease (COVID-19) has changed the lives of most people on Earth. Given the high prevalence of this disease, its correct diagnosis in order to quarantine patients is of the utmost importance in steps of fighting this pandemic. Among the various modalities used for diagnosis, medical imaging, especially computed tomography (CT) imaging, has been the focus of many previous studies due to its accuracy and availability. In addition, automation of diagnostic methods can be of great help to physicians. In this paper, a method based on pre-trained deep neural networks is presented, which, by taking advantage of a cyclic generative adversarial net (CycleGAN) model for data augmentation, has reached state-of-the-art performance for the task at hand, i.e., 99.60% accuracy. Also, in order to evaluate the method, a dataset containing 3163 images from 189 patients has been collected and labeled by physicians. Unlike prior datasets, normal data have been collected from people suspected of having COVID-19 disease and not from data from other diseases, and this database is made available publicly.

时序动作检测、视频相关(4篇)

[1]：Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
标题：无标记视频自监督学习的多模态聚类网络
作者：Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang
链接：https://arxiv.org/abs/2104.12671
摘要：Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains. To evaluate our approach, we train our model on the HowTo100M dataset and evaluate its zero-shot retrieval capabilities in two challenging domains, namely text-to-video retrieval, and temporal action localization, showing state-of-the-art results on four different datasets.[2]：GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization
标题：GPT2MVS：用于多模式视频摘要生成的预训练Transformer-2
作者：Jia-Hong Huang, Luka Murn, Marta Mrak, Marcel Worring
备注：This paper is accepted by ACM International Conference on Multimedia Retrieval (ICMR), 2021
链接：https://arxiv.org/abs/2104.12465
摘要：Traditional video summarization methods generate fixed video representations regardless of user interest. Therefore such methods limit users' expectations in content search and exploration scenarios. Multi-modal video summarization is one of the methods utilized to address this problem. When multi-modal video summarization is used to help video exploration, a text-based query is considered as one of the main drivers of video summary generation, as it is user-defined. Thus, encoding the text-based query and the video effectively are both important for the task of multi-modal video summarization. In this work, a new method is proposed that uses a specialized attention network and contextualized word representations to tackle this task. The proposed model consists of a contextualized video summary controller, multi-modal attention mechanisms, an interactive attention network, and a video summary generator. Based on the evaluation of the existing multi-modal video summarization benchmark, experimental results show that the proposed model is effective with the increase of +5.88% in accuracy and +4.06% increase of F1-score, compared with the state-of-the-art method.[3]：VCGAN: Video Colorization with Hybrid Generative Adversarial Network
标题：基于混合生成对抗网络的视频彩色化
作者：Yuzhi Zhao, Lai-Man Po, Wing-Yin Yu, Yasar Abbas Ur Rehman, Mengyang Liu, Yujia Zhang, Weifeng Ou
备注：Submitted Major Revision Manuscript of IEEE Transactions on Multimedia (TMM)
链接：https://arxiv.org/abs/2104.12357
摘要：We propose a hybrid recurrent Video Colorization with Hybrid Generative Adversarial Network (VCGAN), an improved approach to video colorization using end-to-end learning. The VCGAN addresses two prevalent issues in the video colorization domain: Temporal consistency and unification of colorization network and refinement network into a single architecture. To enhance colorization quality and spatiotemporal consistency, the mainstream of generator in VCGAN is assisted by two additional networks, i.e., global feature extractor and placeholder feature extractor, respectively. The global feature extractor encodes the global semantics of grayscale input to enhance colorization quality, whereas the placeholder feature extractor acts as a feedback connection to encode the semantics of the previous colorized frame in order to maintain spatiotemporal consistency. If changing the input for placeholder feature extractor as grayscale input, the hybrid VCGAN also has the potential to perform image colorization. To improve the consistency of far frames, we propose a dense long-term loss that smooths the temporal disparity of every two remote frames. Trained with colorization and temporal losses jointly, VCGAN strikes a good balance between color vividness and video continuity. Experimental results demonstrate that VCGAN produces higher-quality and temporally more consistent colorful videos than existing approaches.[4]：Swimmer Stroke Rate Estimation From Overhead Race Video
标题：基于头顶比赛视频的游泳运动员划水率估计
作者：Timothy Woinoski, Ivan V. Bajić
备注：6 pages, 4 figures, to be presented at the IEEE ICME Workshop on Artificial Intelligence in Sports (AI-Sports), July 2021
链接：https://arxiv.org/abs/2104.12056
摘要：In this work, we propose a swimming analytics system for automatically determining swimmer stroke rates from overhead race video (ORV). General ORV is defined as any footage of swimmers in competition, taken for the purposes of viewing or analysis. Examples of this are footage from live streams, broadcasts, or specialized camera equipment, with or without camera motion. These are the most typical forms of swimming competition footage. We detail how to create a system that will automatically collect swimmer stroke rates in any competition, given the video of the competition of interest. With this information, better systems can be created and additions to our analytics system can be proposed to automatically extract other swimming metrics of interest.

Networks(14篇)

[1]：HAO: Hardware-aware neural Architecture Optimization for Efficient Inference
标题：HAO：用于高效推理的硬件感知神经结构优化
作者：Zhen Dong, Yizhao Gao, Qijing Huang, John Wawrzynek, Hayden K.H. So, Kurt Keutzer
链接：https://arxiv.org/abs/2104.12766
摘要：Automatic algorithm-hardware co-design for DNN has shown great success in improving the performance of DNNs on FPGAs. However, this process remains challenging due to the intractable search space of neural network architectures and hardware accelerator implementation. Differing from existing hardware-aware neural architecture search (NAS) algorithms that rely solely on the expensive learning-based approaches, our work incorporates integer programming into the search algorithm to prune the design space. Given a set of hardware resource constraints, our integer programming formulation directly outputs the optimal accelerator configuration for mapping a DNN subgraph that minimizes latency. We use an accuracy predictor for different DNN subgraphs with different quantization schemes and generate accuracy-latency pareto frontiers. With low computational cost, our algorithm can generate quantized networks that achieve state-of-the-art accuracy and hardware performance on Xilinx Zynq (ZU3EG) FPGA for image classification on ImageNet dataset. The solution searched by our algorithm achieves 72.5% top-1 accuracy on ImageNet at framerate 50, which is 60% faster than MnasNet and 135% faster than FBNet with comparable accuracy.[2]：Image Modeling with Deep Convolutional Gaussian Mixture Models
标题：基于深卷积高斯混合模型的图像建模
作者：Alexander Gepperth, Benedikt Pfülb
备注：accepted at IJCNN2021, 9 pages, 7 figures
链接：https://arxiv.org/abs/2104.12686
摘要：In this conceptual work, we present Deep Convolutional Gaussian Mixture Models (DCGMMs): a new formulation of deep hierarchical Gaussian Mixture Models (GMMs) that is particularly suitable for describing and generating images. Vanilla (i.e., flat) GMMs require a very large number of components to describe images well, leading to long training times and memory issues. DCGMMs avoid this by a stacked architecture of multiple GMM layers, linked by convolution and pooling operations. This allows to exploit the compositionality of images in a similar way as deep CNNs do. DCGMMs can be trained end-to-end by Stochastic Gradient Descent. This sets them apart from vanilla GMMs which are trained by Expectation-Maximization, requiring a prior k-means initialization which is infeasible in a layered structure. For generating sharp images with DCGMMs, we introduce a new gradient-based technique for sampling through non-invertible operations like convolution and pooling. Based on the MNIST and FashionMNIST datasets, we validate the DCGMMs model by demonstrating its superiority over flat GMMs for clustering, sampling and outlier detection.[3]：CompOFA: Compound Once-For-All Networks for Faster Multi-Platform Deployment
标题：CompOFA：一次合成所有网络，用于更快的多平台部署
作者：Manas Sahni, Shreya Varshini, Alind Khare, Alexey Tumanov
备注：Published as a conference paper at ICLR 2021
链接：https://arxiv.org/abs/2104.12642
摘要：The emergence of CNNs in mainstream deployment has necessitated methods to design and train efficient architectures tailored to maximize the accuracy under diverse hardware & latency constraints. To scale these resource-intensive tasks with an increasing number of deployment targets, Once-For-All (OFA) proposed an approach to jointly train several models at once with a constant training cost. However, this cost remains as high as 40-50 GPU days and also suffers from a combinatorial explosion of sub-optimal model configurations. We seek to reduce this search space -- and hence the training budget -- by constraining search to models close to the accuracy-latency Pareto frontier. We incorporate insights of compound relationships between model dimensions to build CompOFA, a design space smaller by several orders of magnitude. Through experiments on ImageNet, we demonstrate that even with simple heuristics we can achieve a 2x reduction in training time and 216x speedup in model search/extraction time compared to the state of the art, without loss of Pareto optimality! We also show that this smaller design space is dense enough to support equally accurate models for a similar diversity of hardware and latency targets, while also reducing the complexity of the training and subsequent extraction algorithms.[4]：Learning from Event Cameras with Sparse Spiking Convolutional Neural Networks
标题：基于稀疏脉冲卷积神经网络的事件摄像机学习
作者：Loïc Cordone, Benoît Miramond, Sonia Ferrante
备注：Accepted to the International Joint Conference on Neural Networks (IJCNN) 2021
链接：https://arxiv.org/abs/2104.12579
摘要：Convolutional neural networks (CNNs) are now the de facto solution for computer vision problems thanks to their impressive results and ease of learning. These networks are composed of layers of connected units called artificial neurons, loosely modeling the neurons in a biological brain. However, their implementation on conventional hardware (CPU/GPU) results in high power consumption, making their integration on embedded systems difficult. In a car for example, embedded algorithms have very high constraints in term of energy, latency and accuracy. To design more efficient computer vision algorithms, we propose to follow an end-to-end biologically inspired approach using event cameras and spiking neural networks (SNNs). Event cameras output asynchronous and sparse events, providing an incredibly efficient data source, but processing these events with synchronous and dense algorithms such as CNNs does not yield any significant benefits. To address this limitation, we use spiking neural networks (SNNs), which are more biologically realistic neural networks where units communicate using discrete spikes. Due to the nature of their operations, they are hardware friendly and energy-efficient, but training them still remains a challenge. Our method enables the training of sparse spiking convolutional neural networks directly on event data, using the popular deep learning framework PyTorch. The performances in terms of accuracy, sparsity and training time on the popular DVS128 Gesture Dataset make it possible to use this bio-inspired approach for the future embedding of real-time applications on low-power neuromorphic hardware.[5]：Inner-ear Augmented Metal Artifact Reduction with Simulation-based 3D Generative Adversarial Networks
标题：基于仿真的三维生成对抗网络内耳增强金属伪影消除
作者：Wang Zihao, Vandersteen Clair, Demarcy Thomas, Gnansia Dan, Raffaelli Charles, Guevara Nicolas, Delingette Herve
链接：https://arxiv.org/abs/2104.12510
摘要：Metal Artifacts creates often difficulties for a high quality visual assessment of post-operative imaging in {c}omputed {t}omography (CT). A vast body of methods have been proposed to tackle this issue, but {these} methods were designed for regular CT scans and their performance is usually insufficient when imaging tiny implants. In the context of post-operative high-resolution {CT} imaging, we propose a 3D metal {artifact} reduction algorithm based on a generative adversarial neural network. It is based on the simulation of physically realistic CT metal artifacts created by cochlea implant electrodes on preoperative images. The generated images serve to train a 3D generative adversarial networks for artifacts reduction. The proposed approach was assessed qualitatively and quantitatively on clinical conventional and cone-beam CT of cochlear implant postoperative images. These experiments show that the proposed method {outperforms other} general metal artifact reduction approaches.[6]：3D Scene Compression through Entropy Penalized Neural Representation Functions
标题：基于熵惩罚神经表示函数的三维场景压缩
作者：Thomas Bird, Johannes Ballé, Saurabh Singh, Philip A. Chou
备注：accepted (in an abridged format) as a contribution to the Learning-based Image Coding special session of the Picture Coding Symposium 2021
链接：https://arxiv.org/abs/2104.12456
摘要：Some forms of novel visual media enable the viewer to explore a 3D scene from arbitrary viewpoints, by interpolating between a discrete set of original views. Compared to 2D imagery, these types of applications require much larger amounts of storage space, which we seek to reduce. Existing approaches for compressing 3D scenes are based on a separation of compression and rendering: each of the original views is compressed using traditional 2D image formats; the receiver decompresses the views and then performs the rendering. We unify these steps by directly compressing an implicit representation of the scene, a function that maps spatial coordinates to a radiance vector field, which can then be queried to render arbitrary viewpoints. The function is implemented as a neural network and jointly trained for reconstruction as well as compressibility, in an end-to-end manner, with the use of an entropy penalty on the parameters. Our method significantly outperforms a state-of-the-art conventional approach for scene compression, achieving simultaneously higher quality reconstructions and lower bitrates. Furthermore, we show that the performance at lower bitrates can be improved by jointly representing multiple scenes using a soft form of parameter sharing.[7]：Vector Neurons: A General Framework for SO(3)-Equivariant Networks
标题：向量神经元：SO（3）-等变网络的一般框架
作者：Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard, Andrea Tagliasacchi, Leonidas Guibas
链接：https://arxiv.org/abs/2104.12229
摘要：Invariance and equivariance to the rotation group have been widely discussed in the 3D deep learning community for pointclouds. Yet most proposed methods either use complex mathematical tools that may limit their accessibility, or are tied to specific input data types and network architectures. In this paper, we introduce a general framework built on top of what we call Vector Neuron representations for creating SO(3)-equivariant neural networks for pointcloud processing. Extending neurons from 1D scalars to 3D vectors, our vector neurons enable a simple mapping of SO(3) actions to latent spaces thereby providing a framework for building equivariance in common neural operations -- including linear layers, non-linearities, pooling, and normalizations. Due to their simplicity, vector neurons are versatile and, as we demonstrate, can be incorporated into diverse network architecture backbones, allowing them to process geometry inputs in arbitrary poses. Despite its simplicity, our method performs comparably well in accuracy and generalization with other more complex and specialized state-of-the-art methods on classification and segmentation tasks. We also show for the first time a rotation equivariant reconstruction network.[8]：Quantization of Deep Neural Networks for Accurate EdgeComputing
标题：精确边缘计算的深层神经网络量化
作者：Wentao Chen, Hailong Qiu, Jian Zhuang, Chutong Zhang, Yu Hu, Qing Lu, Tianchen Wang, Yiyu Shi†, Meiping Huang, Xiaowe Xu
备注：11 pages, 3 figures, 10 tables, accepted by the ACM Journal on Emerging Technologies in Computing Systems (JETC)
链接：https://arxiv.org/abs/2104.12046
摘要：Deep neural networks (DNNs) have demonstrated their great potential in recent years, exceeding the per-formance of human experts in a wide range of applications. Due to their large sizes, however, compressiontechniques such as weight quantization and pruning are usually applied before they can be accommodated onthe edge. It is generally believed that quantization leads to performance degradation, and plenty of existingworks have explored quantization strategies aiming at minimum accuracy loss. In this paper, we argue thatquantization, which essentially imposes regularization on weight representations, can sometimes help toimprove accuracy. We conduct comprehensive experiments on three widely used applications: fully con-nected network (FCN) for biomedical image segmentation, convolutional neural network (CNN) for imageclassification on ImageNet, and recurrent neural network (RNN) for automatic speech recognition, and experi-mental results show that quantization can improve the accuracy by 1%, 1.95%, 4.23% on the three applicationsrespectively with 3.5x-6.4x memory reduction.[9]：Do All MobileNets Quantize Poorly? Gaining Insights into the Effect of Quantization on Depthwise Separable Convolutional Networks Through the Eyes of Multi-scale Distributional Dynamics
标题：所有的MobileNets量子化都很差吗？从多尺度分布动力学的角度研究量子化对深度可分离卷积网络的影响
作者：Stone Yun, Alexander Wong
备注：Accepted for publication in Mobile AI (MAI) Workshop 2021 at CVPR
链接：https://arxiv.org/abs/2104.11849
摘要：As the "Mobile AI" revolution continues to grow, so does the need to understand the behaviour of edge-deployed deep neural networks. In particular, MobileNets are the go-to family of deep convolutional neural networks (CNN) for mobile. However, they often have significant accuracy degradation under post-training quantization. While studies have introduced quantization-aware training and other methods to tackle this challenge, there is limited understanding into why MobileNets (and potentially depthwise-separable CNNs (DWSCNN) in general) quantize so poorly compared to other CNN architectures. Motivated to gain deeper insights into this phenomenon, we take a different strategy and study the multi-scale distributional dynamics of MobileNet-V1, a set of smaller DWSCNNs, and regular CNNs. Specifically, we investigate the impact of quantization on the weight and activation distributional dynamics as information propagates from layer to layer, as well as overall changes in distributional dynamics at the network level. This fine-grained analysis revealed significant dynamic range fluctuations and a "distributional mismatch" between channelwise and layerwise distributions in DWSCNNs that lead to increasing quantized degradation and distributional shift during information propagation. Furthermore, analysis of the activation quantization errors show that there is greater quantization error accumulation in DWSCNN compared to regular CNNs. The hope is that such insights can lead to innovative strategies for reducing such distributional dynamics changes and improve post-training quantization for mobile.[10]：Good Artists Copy, Great Artists Steal: Model Extraction Attacks Against Image Translation Generative Adversarial Networks
标题：好的艺术家复制，伟大的艺术家窃取：模型提取攻击图像翻译生成对抗网络
作者：Sebastian Szyller, Vasisht Duddu, Tommi Gröndahl, N. Asokan
备注：9 pages, 7 figures
链接：https://arxiv.org/abs/2104.12623
摘要：Machine learning models are typically made available to potential client users via inference APIs. Model extraction attacks occur when a malicious client uses information gleaned from queries to the inference API of a victim model $F_V$ to build a surrogate model $F_A$ that has comparable functionality. Recent research has shown successful model extraction attacks against image classification, and NLP models. In this paper, we show the first model extraction attack against real-world generative adversarial network (GAN) image translation models. We present a framework for conducting model extraction attacks against image translation models, and show that the adversary can successfully extract functional surrogate models. The adversary is not required to know $F_V$'s architecture or any other information about it beyond its intended image translation task, and queries $F_V$'s inference interface using data drawn from the same domain as the training data for $F_V$. We evaluate the effectiveness of our attacks using three different instances of two popular categories of image translation: (1) Selfie-to-Anime and (2) Monet-to-Photo (image style transfer), and (3) Super-Resolution (super resolution). Using standard performance metrics for GANs, we show that our attacks are effective in each of the three cases -- the differences between $F_V$ and $F_A$, compared to the target are in the following ranges: Selfie-to-Anime: FID $13.36-68.66$, Monet-to-Photo: FID $3.57-4.40$, and Super-Resolution: SSIM: $0.06-0.08$ and PSNR: $1.43-4.46$. Furthermore, we conducted a large scale (125 participants) user study on Selfie-to-Anime and Monet-to-Photo to show that human perception of the images produced by the victim and surrogate models can be considered equivalent, within an equivalence bound of Cohen's $d=0.3$.[11]：Multi-Cycle-Consistent Adversarial Networks for Edge Denoising of Computed Tomography Images
标题：用于ct图像边缘去噪的多周期一致对抗网络
作者：Xiaowe Xu, Jiawei Zhang, Jinglan Liu, Yukun Ding, Tianchen Wang, Hailong Qiu, Haiyun Yuan, Jian Zhuang, Wen Xie, Yuhao Dong, Qianjun Jia, Meiping Huang, Yiyu Shi
备注：16 pages, 7 figures, 4 tables, accepted by the ACM Journal on Emerging Technologies in Computing Systems (JETC). arXiv admin note: substantial text overlap witharXiv:2002.12130
链接：https://arxiv.org/abs/2104.12044
摘要：As one of the most commonly ordered imaging tests, computed tomography (CT) scan comes with inevitable radiation exposure that increases the cancer risk to patients. However, CT image quality is directly related to radiation dose, thus it is desirable to obtain high-quality CT images with as little dose as possible. CT image denoising tries to obtain high dose like high-quality CT images (domain X) from low dose low-quality CTimages (domain Y), which can be treated as an image-to-image translation task where the goal is to learn the transform between a source domain X (noisy images) and a target domain Y (clean images). In this paper, we propose a multi-cycle-consistent adversarial network (MCCAN) that builds intermediate domains and enforces both local and global cycle-consistency for edge denoising of CT images. The global cycle-consistency couples all generators together to model the whole denoising process, while the local cycle-consistency imposes effective supervision on the process between adjacent domains. Experiments show that both local and global cycle-consistency are important for the success of MCCAN, which outperformsCCADN in terms of denoising quality with slightly less computation resource consumption.[12]：Deep Convolutional Neural Network for Non-rigid Image Registration
标题：基于深度卷积神经网络的非刚性图像配准
作者：Eduard F. Durech
备注：10 pages, 11 figures
链接：https://arxiv.org/abs/2104.12034
摘要：Images taken at different times or positions undergo transformations such as rotation, scaling, skewing, and more. The process of aligning different images which have undergone transformations can be done via registration. Registration is desirable when analyzing time-series data for tracking, averaging, or differential diagnoses of diseases. Efficient registration methods exist for rigid (including linear or affine) transformations; however, for non-rigid (also known as non-affine) transformations, current methods are computationally expensive and time-consuming. In this report, I will explore the ability of a deep neural network (DNN) and, more specifically, a deep convolutional neural network (CNN) to efficiently perform non-rigid image registration. The experimental results show that a CNN can be used for efficient non-rigid image registration and in significantly less computational time than a conventional Diffeomorphic Demons or Pyramiding approach.[13]：On the stability of deep convolutional neural networks under irregular or random deformations
标题：不规则或随机变形下深卷积神经网络的稳定性
作者：Fabio Nicola, S. Ivan Trapasso
备注：36 pages, 6 figures, 2 tables
链接：https://arxiv.org/abs/2104.11977
摘要：The problem of robustness under location deformations for deep convolutional neural networks (DCNNs) is of great theoretical and practical interest. This issue has been studied in pioneering works, especially for scattering-type architectures, for deformation vector fields $\tau(x)$ with some regularity - at least $C^1$. Here we address this issue for any field $\tau\in L^\infty(\mathbb{R}^d;\mathbb{R}^d)$, without any additional regularity assumption, hence including the case of wild irregular deformations such as a noise on the pixel location of an image. We prove that for signals in multiresolution approximation spaces $U_s$ at scale $s$, whenever the network is Lipschitz continuous (regardless of its architecture), stability in $L^2$ holds in the regime $\|\tau\|_{L^\infty}/s\ll 1$, essentially as a consequence of the uncertainty principle. When $\|\tau\|_{L^\infty}/s\gg 1$ instability can occur even for well-structured DCNNs such as the wavelet scattering networks, and we provide a sharp upper bound for the asymptotic growth rate. The stability results are then extended to signals in the Besov space $B^{d/2}_{2,1}$ tailored to the given multiresolution approximation. We also consider the case of more general time-frequency deformations. Finally, we provide stochastic versions of the aforementioned results, namely we study the issue of stability in mean when $\tau(x)$ is modeled as a random field (not bounded, in general) with with identically distributed variables $|\tau(x)|$, $x\in\mathbb{R}^d$.[14]：EXplainable Neural-Symbolic Learning (X-NeSyL) methodology to fuse deep learning representations with expert knowledge graphs: the MonuMAI cultural heritage use case
标题：可解释神经符号学习（X-NeSyL）方法将深度学习表示与专家知识图融合：MonuMAI文化遗产用例
作者：Natalia Díaz-Rodríguez, Alberto Lamas, Jules Sanchez, Gianni Franchi, Ivan Donadello, Siham Tabik, David Filliat, Policarpo Cruz, Rosana Montes, Francisco Herrera
备注：Under review
链接：https://arxiv.org/abs/2104.11914
摘要：The latest Deep Learning (DL) models for detection and classification have achieved an unprecedented performance over classical machine learning algorithms. However, DL models are black-box methods hard to debug, interpret, and certify. DL alone cannot provide explanations that can be validated by a non technical audience. In contrast, symbolic AI systems that convert concepts into rules or symbols -- such as knowledge graphs -- are easier to explain. However, they present lower generalisation and scaling capabilities. A very important challenge is to fuse DL representations with expert knowledge. One way to address this challenge, as well as the performance-explainability trade-off is by leveraging the best of both streams without obviating domain expert knowledge. We tackle such problem by considering the symbolic knowledge is expressed in form of a domain expert knowledge graph. We present the eXplainable Neural-symbolic learning (X-NeSyL) methodology, designed to learn both symbolic and deep representations, together with an explainability metric to assess the level of alignment of machine and human expert explanations. The ultimate objective is to fuse DL representations with expert domain knowledge during the learning process to serve as a sound basis for explainability. X-NeSyL methodology involves the concrete use of two notions of explanation at inference and training time respectively: 1) EXPLANet: Expert-aligned eXplainable Part-based cLAssifier NETwork Architecture, a compositional CNN that makes use of symbolic representations, and 2) SHAP-Backprop, an explainable AI-informed training procedure that guides the DL process to align with such symbolic representations in form of knowledge graphs. We showcase X-NeSyL methodology using MonuMAI dataset for monument facade image classification, and demonstrate that our approach improves explainability and performance.

GAN、图像文本生成(6篇)

[1]：CAGAN: Text-To-Image Generation with Combined Attention GANs
标题：CAGAN：基于联合注意机制的文本图像生成
作者：Henning Schulze, Dogucan Yaman, Alexander Waibel
链接：https://arxiv.org/abs/2104.12663
摘要：Generating images according to natural language descriptions is a challenging task. In this work, we propose the Combined Attention Generative Adversarial Network (CAGAN) to generate photo-realistic images according to textual descriptions. The proposed CAGAN utilises two attention models: word attention to draw different sub-regions conditioned on related words; and squeeze-and-excitation attention to capture non-linear interaction among channels. With spectral normalisation to stabilise training, our proposed CAGAN improves the state of the art on the IS and FID on the CUB dataset and the FID on the more challenging COCO dataset. Furthermore, we demonstrate that judging a model by a single evaluation metric can be misleading by developing an additional model adding local self-attention which scores a higher IS, outperforming the state of the art on the CUB dataset, but generates unrealistic images through feature repetition.[2]：Synthetic 3D Data Generation Pipeline for Geometric Deep Learning in Architecture
标题：面向建筑几何深度学习的综合三维数据生成管道
作者：Stanislava Fedorova, Alberto Tono, Meher Shashwat Nigam, Jiayao Zhang, Amirhossein Ahmadnia, Cecilia Bolognesi, Dominik L. Michels
备注：Project Page:this https URL
链接：https://arxiv.org/abs/2104.12564
摘要：With the growing interest in deep learning algorithms and computational design in the architectural field, the need for large, accessible and diverse architectural datasets increases. We decided to tackle this problem by constructing a field-specific synthetic data generation pipeline that generates an arbitrary amount of 3D data along with the associated 2D and 3D annotations. The variety of annotations, the flexibility to customize the generated building and dataset parameters make this framework suitable for multiple deep learning tasks, including geometric deep learning that requires direct 3D supervision. Creating our building data generation pipeline we leveraged architectural knowledge from experts in order to construct a framework that would be modular, extendable and would provide a sufficient amount of class-balanced data samples. Moreover, we purposefully involve the researcher in the dataset customization allowing the introduction of additional building components, material textures, building classes, number and type of annotations as well as the number of views per 3D model sample. In this way, the framework would satisfy different research requirements and would be adaptable to a large variety of tasks. All code and data are made publicly available.[3]：Generative modeling of spatio-temporal weather patterns with extreme event conditioning
标题：极端事件条件下的时空天气模式生成模拟
作者：Konstantin Klemmer, Sudipan Saha, Matthias Kahl, Tianlin Xu, Xiao Xiang Zhu
备注：ICLR'21 Workshop AI: Modeling Oceans and Climate Change (AIMOCC)
链接：https://arxiv.org/abs/2104.12469
摘要：Deep generative models are increasingly used to gain insights in the geospatial data domain, e.g., for climate data. However, most existing approaches work with temporal snapshots or assume 1D time-series; few are able to capture spatio-temporal processes simultaneously. Beyond this, Earth-systems data often exhibit highly irregular and complex patterns, for example caused by extreme weather events. Because of climate change, these phenomena are only increasing in frequency. Here, we proposed a novel GAN-based approach for generating spatio-temporal weather patterns conditioned on detected extreme events. Our approach augments GAN generator and discriminator with an encoded extreme weather event segmentation mask. These segmentation masks can be created from raw input using existing event detection frameworks. As such, our approach is highly modular and can be combined with custom GAN architectures. We highlight the applicability of our proposed approach in experiments with real-world surface radiation and zonal wind data.[4]：3D Adversarial Attacks Beyond Point Cloud
标题：点云以外的三维对抗攻击
作者：Jinlai Zhang, Lyujie Chen, Binbin Liu, Bo Ouyang, Qizhi Xie, Jihong Zhu, Yanmei Meng
备注：8 pages, 3 figs
链接：https://arxiv.org/abs/2104.12146
摘要：Previous adversarial attacks on 3D point clouds mainly focus on add perturbation to the original point cloud, but the generated adversarial point cloud example does not strictly represent a 3D object in the physical world and has lower transferability or easily defend by the simple SRS/SOR. In this paper, we present a novel adversarial attack, named Mesh Attack to address this problem. Specifically, we perform perturbation on the mesh instead of point clouds and obtain the adversarial mesh examples and point cloud examples simultaneously. To generate adversarial examples, we use a differential sample module that back-propagates the loss of point cloud classifier to the mesh vertices and a mesh loss that regularizes the mesh to be smooth. Extensive experiments demonstrated that the proposed scheme outperforms the SOTA attack methods. Our code is available at: {\footnotesize{\url{this https URL}}}.[5]：Piggyback GAN: Efficient Lifelong Learning for Image Conditioned Generation
标题：背驮式GAN：图像条件生成的有效终身学习
作者：Mengyao Zhai, Lei Chen, Jiawei He, Megha Nawhal, Frederick Tung, Greg Mori
备注：Accepted to ECCV 2020
链接：https://arxiv.org/abs/2104.11939
摘要：Humans accumulate knowledge in a lifelong fashion. Modern deep neural networks, on the other hand, are susceptible to catastrophic forgetting: when adapted to perform new tasks, they often fail to preserve their performance on previously learned tasks. Given a sequence of tasks, a naive approach addressing catastrophic forgetting is to train a separate standalone model for each task, which scales the total number of parameters drastically without efficiently utilizing previous models. In contrast, we propose a parameter efficient framework, Piggyback GAN, which learns the current task by building a set of convolutional and deconvolutional filters that are factorized into filters of the models trained on previous tasks. For the current task, our model achieves high generation quality on par with a standalone model at a lower number of parameters. For previous tasks, our model can also preserve generation quality since the filters for previous tasks are not altered. We validate Piggyback GAN on various image-conditioned generation tasks across different domains, and provide qualitative and quantitative results to show that the proposed approach can address catastrophic forgetting effectively and efficiently.[6]：Ensembles of GANs for synthetic training data generation
标题：用于综合训练数据生成的GANs集合
作者：Gabriel Eilertsen, Apostolia Tsirikoglou, Claes Lundström, Jonas Unger
备注：ICLR 2021 workshop on Synthetic Data Generation: Quality, Privacy, Bias
链接：https://arxiv.org/abs/2104.11797
摘要：Insufficient training data is a major bottleneck for most deep learning practices, not least in medical imaging where data is difficult to collect and publicly available datasets are scarce due to ethics and privacy. This work investigates the use of synthetic images, created by generative adversarial networks (GANs), as the only source of training data. We demonstrate that for this application, it is of great importance to make use of multiple GANs to improve the diversity of the generated data, i.e. to sufficiently cover the data distribution. While a single GAN can generate seemingly diverse image content, training on this data in most cases lead to severe over-fitting. We test the impact of ensembled GANs on synthetic 2D data as well as common image datasets (SVHN and CIFAR-10), and using both DCGANs and progressively growing GANs. As a specific use case, we focus on synthesizing digital pathology patches to provide anonymized training data.

点云、三维重建(1篇)

[1]：Points2Sound: From mono to binaural audio using 3D point cloud scenes
标题：点2声音：使用3D点云场景从单声道到双耳音频
作者：Francesc Lluís, Vasileios Chatziioannou, Alex Hofmann
备注：Demo:this https URL
链接：https://arxiv.org/abs/2104.12462
摘要：Binaural sound that matches the visual counterpart is crucial to bring meaningful and immersive experiences to people in augmented reality (AR) and virtual reality (VR) applications. Recent works have shown the possibility to generate binaural audio from mono using 2D visual information as guidance. Using 3D visual information may allow for a more accurate representation of a virtual audio scene for VR/AR applications. This paper proposes Points2Sound, a multi-modal deep learning model which generates a binaural version from mono audio using 3D point cloud scenes. Specifically, Points2Sound consist of a vision network which extracts visual features from the point cloud scene to condition an audio network, which operates in the waveform domain, to synthesize the binaural version. Both quantitative and perceptual evaluations indicate that our proposed model is preferred over a reference case, based on a recent 2D mono-to-binaural model.

人群计数(1篇)

[1]：Dense Point Prediction: A Simple Baseline for Crowd Counting and Localization
标题：稠密点预测：人群计数和定位的简单基线
作者：Yi Wang, Xinyu Hou, Lap-Pui Chau
备注：6 pages. Accepted at ICME Workshop 2021
链接：https://arxiv.org/abs/2104.12505
摘要：In this paper, we propose a simple yet effective crowd counting and localization network named SCALNet. Unlike most existing works that separate the counting and localization tasks, we consider those tasks as a pixel-wise dense prediction problem and integrate them into an end-to-end framework. Specifically, for crowd counting, we adopt a counting head supervised by the Mean Square Error (MSE) loss. For crowd localization, the key insight is to recognize the keypoint of people, i.e., the center point of heads. We propose a localization head to distinguish dense crowds trained by two loss functions, i.e., Negative-Suppressed Focal (NSF) loss and False-Positive (FP) loss, which balances the positive/negative examples and handles the false-positive predictions. Experiments on the recent and large-scale benchmark, NWPU-Crowd, show that our approach outperforms the state-of-the-art methods by more than 5% and 10% improvement in crowd localization and counting tasks, respectively. The code is publicly available atthis https URL.

数据集(1篇)

[1]：Machine Learning based Lie Detector applied to a Collected and Annotated Dataset
标题：基于机器学习的测谎仪在数据集上的应用
作者：Nuria Rodriguez-Diaz, Decky Aspandi, Federico Sukno, Xavier Binefa
链接：https://arxiv.org/abs/2104.12345
摘要：Lie detection is considered a concern for everyone in their day to day life given its impact on human interactions. Hence, people are normally not only pay attention to what their interlocutors are saying but also try to inspect their visual appearances, including faces, to find any signs that indicate whether the person is telling the truth or not. Unfortunately to date, the automatic lie detection, which may help us to understand this lying characteristics are still fairly limited. Mainly due to lack of a lie dataset and corresponding evaluations. In this work, we have collected a dataset that contains annotated images and 3D information of different participants faces during a card game that incentivise the lying. Using our collected dataset, we evaluated several types of machine learning based lie detector through generalize, personal and cross lie lie experiments. In these experiments, we showed the superiority of deep learning based model in recognizing the lie with best accuracy of 57\% for generalized task and 63\% when dealing with a single participant. Finally, we also highlight the limitation of the deep learning based lie detector when dealing with different types of lie tasks.

医学生物应用(1篇)

[1]：Recalibration of Aleatoric and Epistemic Regression Uncertainty in Medical Imaging
标题：医学影像学任意性和认知性回归不确定性的再校准
作者：Max-Heinrich Laves, Sontje Ihler, Jacob F. Fast, Lüder A. Kahrs, Tobias Ortmaier
备注：Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA)this https URL
链接：https://arxiv.org/abs/2104.12376
摘要：The consideration of predictive uncertainty in medical imaging with deep learning is of utmost importance. We apply estimation of both aleatoric and epistemic uncertainty by variational Bayesian inference with Monte Carlo dropout to regression tasks and show that predictive uncertainty is systematically underestimated. We apply $ \sigma $ scaling with a single scalar value; a simple, yet effective calibration method for both types of uncertainty. The performance of our approach is evaluated on a variety of common medical regression data sets using different state-of-the-art convolutional network architectures. In our experiments, $ \sigma $ scaling is able to reliably recalibrate predictive uncertainty. It is easy to implement and maintains the accuracy. Well-calibrated uncertainty in regression allows robust rejection of unreliable predictions or detection of out-of-distribution samples. Our source code is available atthis https URL

其他(36篇)

[1]：InfographicVQA
标题：信息图表VQA
作者：Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthenis Karatzas, Ernest Valveny, C.V Jawahar
链接：https://arxiv.org/abs/2104.12756
摘要：Infographics are documents designed to effectively communicate information using a combination of textual, graphical and visual elements. In this work, we explore the automatic understanding of infographic images by using Visual Question Answeringthis http URLthis end, we present InfographicVQA, a new dataset that comprises a diverse collection of infographics along with natural language questions and answers annotations. The collected questions require methods to jointly reason over the document layout, textual content, graphical elements, and data visualizations. We curate the dataset with emphasis on questions that require elementary reasoning and basic arithmetic skills. Finally, we evaluate two strong baselines based on state of the art multi-modal VQA models, and establish baseline performance for the new task. The dataset, code and leaderboard will be made available atthis http URL[2]：Improve Vision Transformers Training by Suppressing Over-smoothing
标题：抑制过平滑提高视觉训练效果
作者：Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, Qiang Liu
备注：preprint
链接：https://arxiv.org/abs/2104.12753
摘要：Introducing the transformer structure into computer vision tasks holds the promise of yielding a better speed-accuracy trade-off than traditional convolution networks. However, directly training vanilla transformers on vision tasks has been shown to yield unstable and sub-optimal results. As a result, recent works propose to modify transformer structures by incorporating convolutional layers to improve the performance on vision tasks. This work investigates how to stabilize the training of vision transformers \emph{without} special structure modification. We observe that the instability of transformer training on vision tasks can be attributed to the over-smoothing problem, that the self-attention layers tend to map the different patches from the input image into a similar latent representation, hence yielding the loss of information and degeneration of performance, especially when the number of layers is large. We then propose a number of techniques to alleviate this problem, including introducing additional loss functions to encourage diversity, prevent loss of information, and discriminate different patches by additional patch classification loss for Cutmix. We show that our proposed techniques stabilize the training and allow us to train wider and deeper vision transformers, achieving 85.0\% top-1 accuracy on ImageNet validation set without introducing extra teachers or additional convolution layers. Our code will be made publicly available atthis https URL.[3]：Joint Representation Learning and Novel Category Discovery on Single- and Multi-modal Data
标题：单模态和多模态数据的联合表示学习与新类别发现
作者：Xuihui Jia, Kai Han, Yukun Zhu, Bradley Green
链接：https://arxiv.org/abs/2104.12673
摘要：This paper studies the problem of novel category discovery on single- and multi-modal data with labels from different but relevant categories. We present a generic, end-to-end framework to jointly learn a reliable representation and assign clusters to unlabelled data. To avoid over-fitting the learnt embedding to labelled data, we take inspiration from self-supervised representation learning by noise-contrastive estimation and extend it to jointly handle labelled and unlabelled data. In particular, we propose using category discrimination on labelled data and cross-modal discrimination on multi-modal data to augment instance discrimination used in conventional contrastive learning approaches. We further employ Winner-Take-All (WTA) hashing algorithm on the shared representation space to generate pairwise pseudo labels for unlabelled data to better predict cluster assignments. We thoroughly evaluate our framework on large-scale multi-modal video benchmarks Kinetics-400 and VGG-Sound, and image benchmarks CIFAR10, CIFAR100 and ImageNet, obtaining state-of-the-art results.[4]：Exploiting Explanations for Model Inversion Attacks
标题：利用模型反转攻击的解释
作者：Xuejun Zhao, Wencan Zhang, Xiaokui Xiao, Brian Y. Lim
链接：https://arxiv.org/abs/2104.12669
摘要：The successful deployment of artificial intelligence (AI) in many domains from healthcare to hiring requires their responsible use, particularly in model explanations and privacy. Explainable artificial intelligence (XAI) provides more information to help users to understand model decisions, yet this additional knowledge exposes additional risks for privacy attacks. Hence, providing explanation harms privacy. We study this risk for image-based model inversion attacks and identified several attack architectures with increasing performance to reconstruct private image data from model explanations. We have developed several multi-modal transposed CNN architectures that achieve significantly higher inversion performance than using the target model prediction only. These XAI-aware inversion models were designed to exploit the spatial knowledge in image explanations. To understand which explanations have higher privacy risk, we analyzed how various explanation types and factors influence inversion performance. In spite of some models not providing explanations, we further demonstrate increased inversion performance even for non-explainable target models by exploiting explanations of surrogate models through attention transfer. This method first inverts an explanation from the target prediction, then reconstructs the target image. These threats highlight the urgent and significant privacy risks of explanations and calls attention for new privacy preservation techniques that balance the dual-requirement for AI explainability and privacy.[5]：Appearance-based Gaze Estimation With Deep Learning: A Review and Benchmark
标题：基于表象的深度学习注视估计：回顾与基准
作者：Yihua Cheng, Haofei Wang, Yiwei Bao, Feng Lu
链接：https://arxiv.org/abs/2104.12668
摘要：Gaze estimation reveals where a person is looking. It is an important clue for understanding human intention. The recent development of deep learning has revolutionized many computer vision tasks, the appearance-based gaze estimation is no exception. However, it lacks a guideline for designing deep learning algorithms for gaze estimation tasks. In this paper, we present a comprehensive review of the appearance-based gaze estimation methods with deep learning. We summarize the processing pipeline and discuss these methods from four perspectives: deep feature extraction, deep neural network architecture design, personal calibration as well as device and platform. Since the data pre-processing and post-processing methods are crucial for gaze estimation, we also survey face/eye detection method, data rectification method, 2D/3D gaze conversion method, and gaze origin conversion method. To fairly compare the performance of various gaze estimation approaches, we characterize all the publicly available gaze estimation datasets and collect the code of typical gaze estimation algorithms. We implement these codes and set up a benchmark of converting the results of different methods into the same evaluation metrics. This paper not only serves as a reference to develop deep learning-based gaze estimation methods but also a guideline for future gaze estimation research. Implemented methods and data processing codes are available atthis http URL.[6]：Green View Index Analysis and Optimal Green View Index Path Based on Street View and Deep Learning
标题：基于街景和深度学习的绿景指数分析及最优绿景指数路径
作者：Anqi Hu, Jiahao Zhang, Hiroyuki Kaga
备注：8 pages, 9 figures
链接：https://arxiv.org/abs/2104.12627
摘要：Streetscapes are an important part of the urban landscape, analysing and studying them can increase the understanding of the cities' infrastructure, which can lead to better planning and design of the urban living environment. In this paper, we used Google API to obtain street view images of Osaka City. The semantic segmentation model PSPNet is used to segment the Osaka City street view images and analyse the Green View Index (GVI) data of Osaka area. Based on the GVI data, three methods, namely corridor analysis, geometric network and a combination of them, were then used to calculate the optimal GVI paths in Osaka City. The corridor analysis and geometric network methods allow for a more detailed delineation of the optimal GVI path from general areas to specific routes. Our analysis not only allows for the calculation of specific routes for the optimal GVI paths, but also allows for the visualisation and integration of neighbourhood landscape data. By summarising all the data, a more specific and objective analysis of the landscape in the study area can be carried out and based on this, the available natural resources can be maximised for a better life.[7]：Detecting and Matching Related Objects with One Proposal Multiple Predictions
标题：一种方案多预测相关对象的检测与匹配
作者：Yang Liu, Luiz G. Hafemann, Michael Jamieson, Mehrsan Javan
备注：CVPR workshop 2021
链接：https://arxiv.org/abs/2104.12574
摘要：Tracking players in sports videos is commonly done in a tracking-by-detection framework, first detecting players in each frame, and then performing association over time. While for some sports tracking players is sufficient for game analysis, sports like hockey, tennis and polo may require additional detections, that include the object the player is holding (e.g. racket, stick). The baseline solution for this problem involves detecting these objects as separate classes, and matching them to player detections based on the intersection over union (IoU). This approach, however, leads to poor matching performance in crowded situations, as it does not model the relationship between players and objects. In this paper, we propose a simple yet efficient way to detect and match players and related objects at once without extra cost, by considering an implicit association for prediction of multiple objects through the same proposal box. We evaluate the method on a dataset of broadcast ice hockey videos, and also a new public dataset we introduce called COCO +Torso. On the ice hockey dataset, the proposed method boosts matching performance from 57.1% to 81.4%, while also improving the meanAP of player+stick detections from 68.4% to 88.3%. On the COCO +Torso dataset, we see matching improving from 47.9% to 65.2%. The COCO +Torso dataset, code and pre-trained models will be released atthis https URL.[8]：Mutual Contrastive Learning for Visual Representation Learning
标题：视觉表征学习中的相互对比学习
作者：Chuanguang Yang, Zhulin An, Linhang Cai, Yongjun Xu
备注：10 pages
链接：https://arxiv.org/abs/2104.12565
摘要：We present a collaborative learning method called Mutual Contrastive Learning (MCL) for general visual representation learning. The core idea of MCL is to perform mutual interaction and transfer of contrastive distributions among a cohort of models. Benefiting from MCL, each model can learn extra contrastive knowledge from others, leading to more meaningful feature representations for visual recognition tasks. We emphasize that MCL is conceptually simple yet empirically powerful. It is a generic framework that can be applied to both supervised and self-supervised representation learning. Experimental results on supervised and self-supervised image classification, transfer learning and few-shot learning show that MCL can lead to consistent performance gains, demonstrating that MCL can guide the network to generate better feature representations.[9]：Computer vision in automated parking systems: Design, implementation and challenges
标题：自动停车系统中的计算机视觉：设计、实现和挑战
作者：Markus Heimberger, Jonathan Horgan, Ciaran Hughes, John McDonald, Senthil Yogamani
链接：https://arxiv.org/abs/2104.12537
摘要：Automated driving is an active area of research in both industry and academia. Automated Parking, which is automated driving in a restricted scenario of parking with low speed manoeuvring, is a key enabling product for fully autonomous driving systems. It is also an important milestone from the perspective of a higher end system built from the previous generation driver assistance systems comprising of collision warning, pedestrian detection, etc. In this paper, we discuss the design and implementation of an automated parking system from the perspective of computer vision algorithms. Designing a low-cost system with functional safety is challenging and leads to a large gap between the prototype and the end product, in order to handle all the corner cases. We demonstrate how camera systems are crucial for addressing a range of automated parking use cases and also, to add robustness to systems based on active distance measuring sensors, such as ultrasonics and radar. The key vision modules which realize the parking use cases are 3D reconstruction, parking slot marking recognition, freespace and vehicle/pedestrian detection. We detail the important parking use cases and demonstrate how to combine the vision modules to form a robust parking system. To the best of the authors' knowledge, this is the first detailed discussion of a systemic view of a commercial automated parking system.[10]：Visformer: The Vision-friendly Transformer
标题：Visformer：视觉友好型变压器
作者：Zhengsu Chen, Lingxi Xie, Jianwei Niu, Xuefeng Liu, Longhui Wei, Qi Tian
链接：https://arxiv.org/abs/2104.12533
摘要：The past year has witnessed the rapid development of applying the Transformer module to vision problems. While some researchers have demonstrated that Transformer-based models enjoy a favorable ability of fitting data, there are still growing number of evidences showing that these models suffer over-fitting especially when the training data is limited. This paper offers an empirical study by performing step-by-step operations to gradually transit a Transformer-based model to a convolution-based model. The results we obtain during the transition process deliver useful messages for improving visual recognition. Based on these observations, we propose a new architecture named Visformer, which is abbreviated from the `Vision-friendly Transformer'. With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy, and the advantage becomes more significant when the model complexity is lower or the training set is smaller. The code is available atthis https URL.[11]：EigenGAN: Layer-Wise Eigen-Learning for GANs
标题：特征根：基于特征根的分层特征学习
作者：Zhenliang He, Meina Kan, Shiguang Shan
备注：Code:this https URL
链接：https://arxiv.org/abs/2104.12476
摘要：Recent studies on Generative Adversarial Network (GAN) reveal that different layers of a generative CNN hold different semantics of the synthesized images. However, few GAN models have explicit dimensions to control the semantic attributes represented in a specific layer. This paper proposes EigenGAN which is able to unsupervisedly mine interpretable and controllable dimensions from different generator layers. Specifically, EigenGAN embeds one linear subspace with orthogonal basis into each generator layer. Via the adversarial training to learn a target distribution, these layer-wise subspaces automatically discover a set of "eigen-dimensions" at each layer corresponding to a set of semantic attributes or interpretable variations. By traversing the coefficient of a specific eigen-dimension, the generator can produce samples with continuous changes corresponding to a specific semantic attribute. Taking the human face for example, EigenGAN can discover controllable dimensions for high-level concepts such as pose and gender in the subspace of deep layers, as well as low-level concepts such as hue and color in the subspace of shallow layers. Moreover, under the linear circumstance, we theoretically prove that our algorithm derives the principal components as PCA does. Codes can be found inthis https URL.[12]：Contextualized Keyword Representations for Multi-modal Retinal Image Captioning
标题：多模式视网膜图像字幕的上下文关键字表示
作者：Jia-Hong Huang, Ting-Wei Wu, Marcel Worring
备注：This paper is accepted by ACM International Conference on Multimedia Retrieval (ICMR), 2021
链接：https://arxiv.org/abs/2104.12471
摘要：Medical image captioning automatically generates a medical description to describe the content of a given medical image. A traditional medical image captioning model creates a medical description only based on a single medical image input. Hence, an abstract medical description or concept is hard to be generated based on the traditional approach. Such a method limits the effectiveness of medical image captioning. Multi-modal medical image captioning is one of the approaches utilized to address this problem. In multi-modal medical image captioning, textual input, e.g., expert-defined keywords, is considered as one of the main drivers of medical description generation. Thus, encoding the textual input and the medical image effectively are both important for the task of multi-modal medical image captioning. In this work, a new end-to-end deep multi-modal medical image captioning model is proposed. Contextualized keyword representations, textual feature reinforcement, and masked self-attention are used to develop the proposed approach. Based on the evaluation of the existing multi-modal medical image captioning dataset, experimental results show that the proposed model is effective with the increase of +53.2% in BLEU-avg and +18.6% in CIDEr, compared with the state-of-the-art method.[13]：Practical Wide-Angle Portraits Correction with Deep Structured Models
标题：基于深结构模型的实用广角人像校正
作者：Jing Tan, Shan Zhao, Pengfei Xiong, Jiangyu Liu, Haoqiang Fan, Shuaicheng Liu
备注：This work has been accepted to CVPR2021
链接：https://arxiv.org/abs/2104.12464
摘要：Wide-angle portraits often enjoy expanded views. However, they contain perspective distortions, especially noticeable when capturing group portrait photos, where the background is skewed and faces are stretched. This paper introduces the first deep learning based approach to remove such artifacts from freely-shot photos. Specifically, given a wide-angle portrait as input, we build a cascaded network consisting of a LineNet, a ShapeNet, and a transition module (TM), which corrects perspective distortions on the background, adapts to the stereographic projection on facial regions, and achieves smooth transitions between these two projections, accordingly. To train our network, we build the first perspective portrait dataset with a large diversity in identities, scenes and camera modules. For the quantitative evaluation, we introduce two novel metrics, line consistency and face congruence. Compared to the previous state-of-the-art approach, our method does not require camera distortion parameters. We demonstrate that our approach significantly outperforms the previous state-of-the-art approach both qualitatively and quantitatively.[14]：Heterogeneous-Agent Trajectory Forecasting Incorporating Class Uncertainty
标题：考虑类不确定性的异构Agent轨迹预测
作者：Boris Ivanovic, Kuan-Hui Lee, Pavel Tokmakov, Blake Wulfe, Rowan McAllister, Adrien Gaidon, Marco Pavone
备注：17 pages, 12 figures, 5 tables
链接：https://arxiv.org/abs/2104.12446
摘要：Reasoning about the future behavior of other agents is critical to safe robot navigation. The multiplicity of plausible futures is further amplified by the uncertainty inherent to agent state estimation from data, including positions, velocities, and semantic class. Forecasting methods, however, typically neglect class uncertainty, conditioning instead only on the agent's most likely class, even though perception models often return full class distributions. To exploit this information, we present HAICU, a method for heterogeneous-agent trajectory forecasting that explicitly incorporates agents' class probabilities. We additionally present PUP, a new challenging real-world autonomous driving dataset, to investigate the impact of Perceptual Uncertainty in Prediction. It contains challenging crowded scenes with unfiltered agent class probabilities that reflect the long-tail of current state-of-the-art perception systems. We demonstrate that incorporating class probabilities in trajectory forecasting significantly improves performance in the face of uncertainty, and enables new forecasting capabilities such as counterfactual predictions.[15]：ECLIPSE : Envisioning Cloud Induced Perturbations in Solar Energy
标题：日食：想象云层引起的太阳能量扰动
作者：Quentin Paletta, Anthony Hu, Guillaume Arbod, Joan Lasenby
链接：https://arxiv.org/abs/2104.12419
摘要：Efficient integration of solar energy into the electricity mix depends on a reliable anticipation of its intermittency. A promising approach to forecast the temporal variability of solar irradiance resulting from the cloud cover dynamics, is based on the analysis of sequences of ground-taken sky images. Despite encouraging results, a recurrent limitation of current Deep Learning approaches lies in the ubiquitous tendency of reacting to past observations rather than actively anticipating future events. This leads to a systematic temporal lag and little ability to predict sudden events. To address this challenge, we introduce ECLIPSE, a spatio-temporal neural network architecture that models cloud motion from sky images to predict both future segmented images and corresponding irradiance levels. We show that ECLIPSE anticipates critical events and considerably reduces temporal delay while generating visually realistic futures.[16]：Delving into Data: Effectively Substitute Training for Black-box Attack
标题：数据挖掘：黑匣子攻击的有效替代训练
作者：Wenxuan Wang, Bangjie Yin, Taiping Yao, Li Zhang, Yanwei Fu, Shouhong Ding, Jilin Li, Feiyue Huang, Xiangyang Xue
备注：10 pages, 6 figures, 6 tables, 1 algorithm, To appear in CVPR 2021 as a poster paper
链接：https://arxiv.org/abs/2104.12378
摘要：Deep models have shown their vulnerability when processing adversarial samples. As for the black-box attack, without access to the architecture and weights of the attacked model, training a substitute model for adversarial attacks has attracted wide attention. Previous substitute training approaches focus on stealing the knowledge of the target model based on real training data or synthetic data, without exploring what kind of data can further improve the transferability between the substitute and target models. In this paper, we propose a novel perspective substitute training that focuses on designing the distribution of data used in the knowledge stealing process. More specifically, a diverse data generation module is proposed to synthesize large-scale data with wide distribution. And adversarial substitute training strategy is introduced to focus on the data distributed near the decision boundary. The combination of these two modules can further boost the consistency of the substitute model and target model, which greatly improves the effectiveness of adversarial attack. Extensive experiments demonstrate the efficacy of our method against state-of-the-art competitors under non-target and target attack settings. Detailed visualization and analysis are also provided to help understand the advantage of our method.[17]：Dynamic Degradation for Image Restoration and Fusion
标题：基于动态退化的图像恢复与融合
作者：Aiqing Fang, Xinbo Zhao, Jiaqi Yang, Yanning Zhang
链接：https://arxiv.org/abs/2104.12347
摘要：The deep-learning-based image restoration and fusion methods have achieved remarkable results. However, the existing restoration and fusion methods paid little research attention to the robustness problem caused by dynamic degradation. In this paper, we propose a novel dynamic image restoration and fusion neural network, termed as DDRF-Net, which is capable of solving two problems, i.e., static restoration and fusion, dynamic degradation. In order to solve the static fusion problem of existing methods, dynamic convolution is introduced to learn dynamic restoration and fusion weights. In addition, a dynamic degradation kernel is proposed to improve the robustness of image restoration and fusion. Our network framework can effectively combine image degradation with image fusion tasks, provide more detailed information for image fusion tasks through image restoration loss, and optimize image restoration tasks through image fusion loss. Therefore, the stumbling blocks of deep learning in image fusion, e.g., static fusion weight and specifically designed network architecture, are greatly mitigated. Extensive experiments show that our method is more superior compared with the state-of-the-art methods.[18]：Diverse Image Inpainting with Bidirectional and Autoregressive Transformers
标题：基于双向自回归变换的多样图像修复
作者：Yingchen Yu, Fangneng Zhan, Rongliang Wu, Jianxiong Pan, Kaiwen Cui, Shijian Lu, Feiying Ma, Xuansong Xie, Chunyan Miao
链接：https://arxiv.org/abs/2104.12335
摘要：Image inpainting is an underdetermined inverse problem, it naturally allows diverse contents that fill up the missing or corrupted regions reasonably and realistically. Prevalent approaches using convolutional neural networks (CNNs) can synthesize visually pleasant contents, but CNNs suffer from limited perception fields for capturing global features. With image-level attention, transformers enable to model long-range dependencies and generate diverse contents with autoregressive modeling of pixel-sequence distributions. However, the unidirectional attention in transformers is suboptimal as corrupted regions can have arbitrary shapes with contexts from arbitrary directions. We propose BAT-Fill, an image inpainting framework with a novel bidirectional autoregressive transformer (BAT) that models deep bidirectional contexts for autoregressive generation of diverse inpainting contents. BAT-Fill inherits the merits of transformers and CNNs in a two-stage manner, which allows to generate high-resolution contents without being constrained by the quadratic complexity of attention in transformers. Specifically, it first generates pluralistic image structures of low resolution by adapting transformers and then synthesizes realistic texture details of high resolutions with a CNN-based up-sampling network. Extensive experiments over multiple datasets show that BAT-Fill achieves superior diversity and fidelity in image inpainting qualitatively and quantitatively.[19]：StegaPos: Preventing Crops and Splices with Imperceptible Positional Encodings
标题：StegaPos：用不易察觉的位置编码防止裁剪和拼接
作者：Gokhan Egri, Todd Zickler
备注：For NeurIPS 2021 submission, 8 pages (main), 18 pages (total)
链接：https://arxiv.org/abs/2104.12290
摘要：We present a model for differentiating between images that are authentic copies of ones published by photographers, and images that have been manipulated by cropping, splicing or downsampling after publication. The model comprises an encoder that resides with the photographer and a matching decoder that is available to observers. The encoder learns to embed imperceptible positional signatures into image values prior to publication. The decoder learns to use these steganographic positional ("stegapos") signatures to determine, for each small image patch, the 2D positional coordinates that were held by the patch in its originally-published image. Crop, splice and downsample edits become detectable by the inconsistencies they cause in the hidden positional signatures. We find that training the encoder and decoder together produces a model that imperceptibly encodes position, and that enables superior performance on established benchmarks for splice detection and high accuracy on a new benchmark for crop detection.[20]：Class Equilibrium using Coulomb's Law
标题：利用库仑定律的类平衡
作者：Saheb Chhabra, Puspita Majumdar, Mayank Vatsa, Richa Singh
备注：Accepted at IJCNN 2021
链接：https://arxiv.org/abs/2104.12287
摘要：Projection algorithms learn a transformation function to project the data from input space to the feature space, with the objective of increasing the inter-class distance. However, increasing the inter-class distance can affect the intra-class distance. Maintaining an optimal inter-class separation among the classes without affecting the intra-class distance of the data distribution is a challenging task. In this paper, inspired by the Coulomb's law of Electrostatics, we propose a new algorithm to compute the equilibrium space of any data distribution where the separation among the classes is optimal. The algorithm further learns the transformation between the input space and equilibrium space to perform classification in the equilibrium space. The performance of the proposed algorithm is evaluated on four publicly available datasets at three different resolutions. It is observed that the proposed algorithm performs well for low-resolution images.[21]：The 5th AI City Challenge
标题：第五届AI城市挑战赛
作者：Milind Naphade, Shuo Wang, David C. Anastasiu, Zheng Tang, Ming-Ching Chang, Xiaodong Yang, Yue Yao, Liang Zheng, Pranamesh Chakraborty, Anuj Sharma, Qi Feng, Vitaly Ablavsky, Stan Sclaroff
备注：Summary of the 5th AI City Challenge Workshop in conjunction with CVPR 2021
链接：https://arxiv.org/abs/2104.12233
摘要：The AI City Challenge was created with two goals in mind: (1) pushing the boundaries of research and development in intelligent video analysis for smarter cities use cases, and (2) assessing tasks where the level of performance is enough to cause real-world adoption. Transportation is a segment ripe for such adoption. The fifth AI City Challenge attracted 305 participating teams across 38 countries, who leveraged city-scale real traffic data and high-quality synthetic data to compete in five challenge tracks. Track 1 addressed video-based automatic vehicle counting, where the evaluation being conducted on both algorithmic effectiveness and computational efficiency. Track 2 addressed city-scale vehicle re-identification with augmented synthetic data to substantially increase the training set for the task. Track 3 addressed city-scale multi-target multi-camera vehicle tracking. Track 4 addressed traffic anomaly detection. Track 5 was a new track addressing vehicle retrieval using natural language descriptions. The evaluation system shows a general leader board of all submitted results, and a public leader board of results limited to the contest participation rules, where teams are not allowed to use external data in their work. The public leader board shows results more close to real-world situations where annotated data is limited. Results show the promise of AI in Smarter Transportation. State-of-the-art performance for some tasks shows that these technologies are ready for adoption in real-world systems.[22]：Visual Saliency Transformer
标题：视觉显著变换器
作者：Nian Liu, Ni Zhang, Kaiyuan Wan, Junwei Han, Ling Shao
链接：https://arxiv.org/abs/2104.12099
摘要：Recently, massive saliency detection methods have achieved promising results by relying on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Apart from the traditional transformer architecture used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models.[23]：Image Inpainting with Edge-guided Learnable Bidirectional Attention Maps
标题：基于边缘引导可学习双向注意图的图像修复
作者：Dongsheng Wang, Chaohao Xie, Shaohui Liu, Zhenxing Niu, Wangmeng Zuo
备注：16 pages,13 figures
链接：https://arxiv.org/abs/2104.12087
摘要：For image inpainting, the convolutional neural networks (CNN) in previous methods often adopt standard convolutional operator, which treats valid pixels and holes indistinguishably. As a result, they are limited in handling irregular holes and tend to produce color-discrepant and blurry inpainting result. Partial convolution (PConv) copes with this issue by conducting masked convolution and feature re-normalization conditioned only on valid pixels, but the mask-updating is handcrafted and independent with image structural information. In this paper, we present an edge-guided learnable bidirectional attention map (Edge-LBAM) for improving image inpainting of irregular holes with several distinct merits. Instead of using a hard 0-1 mask, a learnable attention map module is introduced for learning feature re-normalization and mask-updating in an end-to-end manner. Learnable reverse attention maps are further proposed in the decoder for emphasizing on filling in unknown pixels instead of reconstructing all pixels. Motivated by that the filling-in order is crucial to inpainting results and largely depends on image structures in exemplar-based methods, we further suggest a multi-scale edge completion network to predict coherent edges. Our Edge-LBAM method contains dual procedures,including structure-aware mask-updating guided by predict edges and attention maps generated by masks for feature re-normalization.Extensive experiments show that our Edge-LBAM is effective in generating coherent image structures and preventing color discrepancy and blurriness, and performs favorably against the state-of-the-art methods in terms of qualitative metrics and visual quality.[24]：Making GAN-Generated Images Difficult To Spot: A New Attack Against Synthetic Image Detectors
标题：使GAN生成的图像难以识别：针对合成图像探测器的新攻击
作者：Xinwei Zhao, Matthew C. Stamm
链接：https://arxiv.org/abs/2104.12069
摘要：Visually realistic GAN-generated images have recently emerged as an important misinformation threat. Research has shown that these synthetic images contain forensic traces that are readily identifiable by forensic detectors. Unfortunately, these detectors are built upon neural networks, which are vulnerable to recently developed adversarial attacks. In this paper, we propose a new anti-forensic attack capable of fooling GAN-generated image detectors. Our attack uses an adversarially trained generator to synthesize traces that these detectors associate with real images. Furthermore, we propose a technique to train our attack so that it can achieve transferability, i.e. it can fool unknown CNNs that it was not explicitly trained against. We demonstrate the performance of our attack through an extensive set of experiments, where we show that our attack can fool eight state-of-the-art detection CNNs with synthetic images created using seven different GANs.[25]：3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head
标题：3D-TalkEmo：学习合成3D情感对话头
作者：Qianyun Wang, Zhenfeng Fan, Shihong Xia
链接：https://arxiv.org/abs/2104.12051
摘要：Impressive progress has been made in audio-driven 3D facial animation recently, but synthesizing 3D talking-head with rich emotion is still unsolved. This is due to the lack of 3D generative models and available 3D emotional dataset with synchronized audios. To address this, we introduce 3D-TalkEmo, a deep neural network that generates 3D talking head animation with various emotions. We also create a large 3D dataset with synchronized audios and videos, rich corpus, as well as various emotion states of different persons with the sophisticated 3D face reconstruction methods. In the emotion generation network, we propose a novel 3D face representation structure - geometry map by classical multi-dimensional scaling analysis. It maps the coordinates of vertices on a 3D face to a canonical image plane, while preserving the vertex-to-vertex geodesic distance metric in a least-square sense. This maintains the adjacency relationship of each vertex and holds the effective convolutional structure for the 3D facial surface. Taking a neutral 3D mesh and a speech signal as inputs, the 3D-TalkEmo is able to generate vivid facial animations. Moreover, it provides access to change the emotion state of the animated speaker.
We present extensive quantitative and qualitative evaluation of our method, in addition to user studies, demonstrating the generated talking-heads of significantly higher quality compared to previous state-of-the-art methods.[26]：Spatial-Spectral Clustering with Anchor Graph for Hyperspectral Image
标题：基于锚图的高光谱图像空间谱聚类
作者：Qi Wang, Yanling Miao, Mulin Chen, Xuelong Li
链接：https://arxiv.org/abs/2104.11904
摘要：Hyperspectral image (HSI) clustering, which aims at dividing hyperspectral pixels into clusters, has drawn significant attention in practical applications. Recently, many graph-based clustering methods, which construct an adjacent graph to model the data relationship, have shown dominant performance. However, the high dimensionality of HSI data makes it hard to construct the pairwise adjacent graph. Besides, abundant spatial structures are often overlooked during the clustering procedure. In order to better handle the high dimensionality problem and preserve the spatial structures, this paper proposes a novel unsupervised approach called spatial-spectral clustering with anchor graph (SSCAG) for HSI data clustering. The SSCAG has the following contributions: 1) the anchor graph-based strategy is used to construct a tractable large graph for HSI data, which effectively exploits all data points and reduces the computational complexity; 2) a new similarity metric is presented to embed the spatial-spectral information into the combined adjacent graph, which can mine the intrinsic property structure of HSI data; 3) an effective neighbors assignment strategy is adopted in the optimization, which performs the singular value decomposition (SVD) on the adjacent graph to get solutions efficiently. Extensive experiments on three public HSI datasets show that the proposed SSCAG is competitive against the state-of-the-art approaches.[27]：Carrying out CNN Channel Pruning in a White Box
标题：在白盒中执行CNN频道修剪
作者：Yuxin Zhang, Mingbao Lin, Chia-Wen Lin, Jie Chen, Feiyue Huang, Yongjian Wu, Yonghong Tian, Rongrong Ji
备注：11 pages, 4 figures
链接：https://arxiv.org/abs/2104.11883
摘要：Channel Pruning has been long adopted for compressing CNNs, which significantly reduces the overall computation. Prior works implement channel pruning in an unexplainable manner, which tends to reduce the final classification errors while failing to consider the internal influence of each channel. In this paper, we conduct channel pruning in a white box. Through deep visualization of feature maps activated by different channels, we observe that different channels have a varying contribution to different categories in image classification. Inspired by this, we choose to preserve channels contributing to most categories. Specifically, to model the contribution of each channel to differentiating categories, we develop a class-wise mask for each channel, implemented in a dynamic training manner w.r.t. the input image's category. On the basis of the learned class-wise mask, we perform a global voting mechanism to remove channels with less category discrimination. Lastly, a fine-tuning process is conducted to recover the performance of the pruned model. To our best knowledge, it is the first time that CNN interpretability theory is considered to guide channel pruning. Extensive experiments demonstrate the superiority of our White-Box over many state-of-the-arts. For instance, on CIFAR-10, it reduces 65.23% FLOPs with even 0.62% accuracy improvement for ResNet-110. On ILSVRC-2012, White-Box achieves a 45.6% FLOPs reduction with only a small loss of 0.83% in the top-1 accuracy for ResNet-50. Code, training logs and pruned models are anonymously atthis https URL.[28]：Playing Lottery Tickets with Vision and Language
标题：用视觉和语言玩彩票
作者：Zhe Gan, Yen-Chun Chen, Linjie Li, Tianlong Chen, Yu Cheng, Shuohang Wang, Jingjing Liu
链接：https://arxiv.org/abs/2104.11832
摘要：Large-scale transformer-based pre-training has recently revolutionized vision-and-language (V+L) research. Models such as LXMERT, ViLBERT and UNITER have significantly lifted the state of the art over a wide range of V+L tasks. However, the large number of parameters in such models hinders their application in practice. In parallel, work on the lottery ticket hypothesis has shown that deep neural networks contain small matching subnetworks that can achieve on par or even better performance than the dense networks when trained in isolation. In this work, we perform the first empirical study to assess whether such trainable subnetworks also exist in pre-trained V+L models. We use UNITER, one of the best-performing V+L models, as the testbed, and consolidate 7 representative V+L tasks for experiments, including visual question answering, visual commonsense reasoning, visual entailment, referring expression comprehension, image-text retrieval, GQA, and NLVR$^2$. Through comprehensive analysis, we summarize our main findings as follows. ($i$) It is difficult to find subnetworks (i.e., the tickets) that strictly match the performance of the full UNITER model. However, it is encouraging to confirm that we can find "relaxed" winning tickets at 50%-70% sparsity that maintain 99% of the full accuracy. ($ii$) Subnetworks found by task-specific pruning transfer reasonably well to the other tasks, while those found on the pre-training tasks at 60%/70% sparsity transfer universally, matching 98%/96% of the full accuracy on average over all the tasks. ($iii$) Adversarial training can be further used to enhance the performance of the found lottery tickets.[29]：UnrealROX+: An Improved Tool for Acquiring Synthetic Data from Virtual 3D Environments
标题：None
作者：Pablo Martinez-Gonzalez, Sergiu Oprea, John Alejandro Castro-Vargas, Alberto Garcia-Garcia, Sergio Orts-Escolano, Jose Garcia-Rodriguez, Markus Vincze
备注：Accepted at International Joint Conference on Neural Networks (IJCNN) 2021
链接：https://arxiv.org/abs/2104.11776
摘要：Synthetic data generation has become essential in last years for feeding data-driven algorithms, which surpassed traditional techniques performance in almost every computer vision problem. Gathering and labelling the amount of data needed for these data-hungry models in the real world may become unfeasible and error-prone, while synthetic data give us the possibility of generating huge amounts of data with pixel-perfect annotations. However, most synthetic datasets lack from enough realism in their rendered images. In that context UnrealROX generation tool was presented in 2019, allowing to generate highly realistic data, at high resolutions and framerates, with an efficient pipeline based on Unreal Engine, a cutting-edge videogame engine. UnrealROX enabled robotic vision researchers to generate realistic and visually plausible data with full ground truth for a wide variety of problems such as class and instance semantic segmentation, object detection, depth estimation, visual grasping, and navigation. Nevertheless, its workflow was very tied to generate image sequences from a robotic on-board camera, making hard to generate data for other purposes. In this work, we present UnrealROX+, an improved version of UnrealROX where its decoupled and easy-to-use data acquisition system allows to quickly design and generate data in a much more flexible and customizable way. Moreover, it is packaged as an Unreal plug-in, which makes it more comfortable to use with already existing Unreal projects, and it also includes new features such as generating albedo or a Python API for interacting with the virtual environment from Deep Learning frameworks.[30]：A Novel Interaction-based Methodology Towards Explainable AI with Better Understanding of Pneumonia Chest X-ray Images
标题：一种新的基于交互的人工智能解释方法能更好地理解肺炎胸部x线图像
作者：Shaw-Hwa Lo, Yiqiao Yin
链接：https://arxiv.org/abs/2104.12672
摘要：In the field of eXplainable AI (XAI), robust ``blackbox'' algorithms such as Convolutional Neural Networks (CNNs) are known for making high prediction performance. However, the ability to explain and interpret these algorithms still require innovation in the understanding of influential and, more importantly, explainable features that directly or indirectly impact the performance of predictivity. A number of methods existing in literature focus on visualization techniques but the concepts of explainability and interpretability still require rigorous definition. In view of the above needs, this paper proposes an interaction-based methodology -- Influence Score (I-score) -- to screen out the noisy and non-informative variables in the images hence it nourishes an environment with explainable and interpretable features that are directly associated to feature predictivity. We apply the proposed method on a real world application in Pneumonia Chest X-ray Image data set and produced state-of-the-art results. We demonstrate how to apply the proposed approach for more general big data problems by improving the explainability and interpretability without sacrificing the prediction performance. The contribution of this paper opens a novel angle that moves the community closer to the future pipelines of XAI problems.[31]：Clean Images are Hard to Reblur: A New Clue for Deblurring
标题：清晰的图像很难重新模糊：消除模糊的新线索
作者：Seungjun Nah, Sanghyun Son, Jaerin Lee, Kyoung Mu Lee
链接：https://arxiv.org/abs/2104.12665
摘要：The goal of dynamic scene deblurring is to remove the motion blur present in a given image. Most learning-based approaches implement their solutions by minimizing the L1 or L2 distance between the output and reference sharp image. Recent attempts improve the perceptual quality of the deblurred image by using features learned from visual recognition tasks. However, those features are originally designed to capture the high-level contexts rather than the low-level structures of the given image, such as blurriness. We propose a novel low-level perceptual loss to make image sharper. To better focus on image blurriness, we train a reblurring module amplifying the unremoved motion blur. Motivated that a well-deblurred clean image should contain zero-magnitude motion blur that is hard to be amplified, we design two types of reblurring loss functions. The supervised reblurring loss at training stage compares the amplified blur between the deblurred image and the reference sharp image. The self-supervised reblurring loss at inference stage inspects if the deblurred image still contains noticeable blur to be amplified. Our experimental results demonstrate the proposed reblurring losses improve the perceptual quality of the deblurred images in terms of NIQE and LPIPS scores as well as visual sharpness.[32]：Vision-based Driver Assistance Systems: Survey, Taxonomy and Advances
标题：基于视觉的驾驶员辅助系统：综述、分类和进展
作者：Jonathan Horgan, Ciarán Hughes, John McDonald, Senthil Yogamani
链接：https://arxiv.org/abs/2104.12583
摘要：Vision-based driver assistance systems is one of the rapidly growing research areas of ITS, due to various factors such as the increased level of safety requirements in automotive, computational power in embedded systems, and desire to get closer to autonomous driving. It is a cross disciplinary area encompassing specialised fields like computer vision, machine learning, robotic navigation, embedded systems, automotive electronics and safety critical software. In this paper, we survey the list of vision based advanced driver assistance systems with a consistent terminology and propose a taxonomy. We also propose an abstract model in an attempt to formalize a top-down view of application development to scale towards autonomous driving system.[33]：dualFace:Two-Stage Drawing Guidance for Freehand Portrait Sketching
标题：杜尔夫ace:Two-Stage Drawing 写意肖像素描指南
作者：Zhengyu Huang, Yichen Peng, Tomohiro Hibino, Chunqi Zhao, Haoran Xie, Tsukasa Fukusato, Kazunori Miyata
备注：Accepted in the Journal of Computational Visual Media Conference 2021. 13 pages, 12 figures
链接：https://arxiv.org/abs/2104.12297
摘要：In this paper, we propose dualFace, a portrait drawing interface to assist users with different levels of drawing skills to complete recognizable and authentic face sketches. dualFace consists of two-stage drawing assistance to provide global and local visual guidance: global guidance, which helps users draw contour lines of portraits (i.e., geometric structure), and local guidance, which helps users draws details of facial parts (which conform to user-drawn contour lines), inspired by traditional artist workflows in portrait drawing. In the stage of global guidance, the user draws several contour lines, and dualFace then searches several relevant images from an internal database and displays the suggested face contour lines over the background of the canvas. In the stage of local guidance, we synthesize detailed portrait images with a deep generative model from user-drawn contour lines, but use the synthesized results as detailed drawing guidance. We conducted a user study to verify the effectiveness of dualFace, and we confirmed that dualFace significantly helps achieve a detailed portrait sketch. seethis http URL[34]：Multi-Scale Hourglass Hierarchical Fusion Network for Single Image Deraining
标题：用于单幅图像去噪的多尺度沙漏层次融合网络
作者：Xiang Chen, Yufeng Huang, Lei Xu
备注：Accepted in CVPRW 2021
链接：https://arxiv.org/abs/2104.12100
摘要：Rain streaks bring serious blurring and visual quality degradation, which often vary in size, direction and density. Current CNN-based methods achieve encouraging performance, while are limited to depict rain characteristics and recover image details in the poor visibility environment. To address these issues, we present a Multi-scale Hourglass Hierarchical Fusion Network (MH2F-Net) in end-to-end manner, to exactly captures rain streak features with multi-scale extraction, hierarchical distillation and information aggregation. For better extracting the features, a novel Multi-scale Hourglass Extraction Block (MHEB) is proposed to get local and global features across different scales through down- and up-sample process. Besides, a Hierarchical Attentive Distillation Block (HADB) then employs the dual attention feature responses to adaptively recalibrate the hierarchical features and eliminate the redundant ones. Further, we introduce a Residual Projected Feature Fusion (RPFF) strategy to progressively discriminate feature learning and aggregate different features instead of directly concatenating or adding. Extensive experiments on both synthetic and real rainy datasets demonstrate the effectiveness of the designed MH2F-Net by comparing with recent state-of-the-art deraining algorithms. Our source code will be available on the GitHub:this https URL.[35]：How Well Self-Supervised Pre-Training Performs with Streaming Data?
标题：自我监督预训练对流数据的处理效果如何？
作者：Dapeng Hu, Qizhengqiu Lu, Lanqing Hong, Hailin Hu, Yifan Zhang, Zhenguo Li, Alfred Shen, Jiashi Feng
备注：16 pages; 16figures
链接：https://arxiv.org/abs/2104.12081
摘要：The common self-supervised pre-training practice requires collecting massive unlabeled data together and then trains a representation model, dubbed \textbf{joint training}. However, in real-world scenarios where data are collected in a streaming fashion, the joint training scheme is usually storage-heavy and time-consuming. A more efficient alternative is to train a model continually with streaming data, dubbed \textbf{sequential training}. Nevertheless, it is unclear how well sequential self-supervised pre-training performs with streaming data. In this paper, we conduct thorough experiments to investigate self-supervised pre-training with streaming data. Specifically, we evaluate the transfer performance of sequential self-supervised pre-training with four different data sequences on three different downstream tasks and make comparisons with joint self-supervised pre-training. Surprisingly, we find sequential self-supervised learning exhibits almost the same performance as the joint training when the distribution shifts within streaming data are mild. Even for data sequences with large distribution shifts, sequential self-supervised training with simple techniques, e.g., parameter regularization or data replay, still performs comparably to joint training. Based on our findings, we recommend using sequential self-supervised training as a \textbf{more efficient yet performance-competitive} representation learning practice for real-world applications.[36]：Class-Incremental Experience Replay for Continual Learning under Concept Drift
标题：概念漂移下持续学习的课堂增量经验回放
作者：Łukasz Korycki, Bartosz Krawczyk
备注：IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021
链接：https://arxiv.org/abs/2104.11861
摘要：Modern machine learning systems need to be able to cope with constantly arriving and changing data. Two main areas of research dealing with such scenarios are continual learning and data stream mining. Continual learning focuses on accumulating knowledge and avoiding forgetting, assuming information once learned should be stored. Data stream mining focuses on adaptation to concept drift and discarding outdated information, assuming that only the most recent data is relevant. While these two areas are mainly being developed in separation, they offer complementary views on the problem of learning from dynamic data. There is a need for unifying them, by offering architectures capable of both learning and storing new information, as well as revisiting and adapting to changes in previously seen concepts. We propose a novel continual learning approach that can handle both tasks. Our experience replay method is fueled by a centroid-driven memory storing diverse instances of incrementally arriving classes. This is enhanced with a reactive subspace buffer that tracks concept drift occurrences in previously seen classes and adapts clusters accordingly. The proposed architecture is thus capable of both remembering valid and forgetting outdated information, offering a holistic framework for continual learning under concept drift.中文来自机器翻译，仅供参考。