计算机视觉学术速递[5.18]-LingLab

计算机视觉学术速递[5.18]

2020 阅读 2021-05-24 10:38:15 上传

以下文章来源于浙江语言学

cs.CV 方向，今日共计105篇

Transformer(5篇)

【1】 Rethinking the Design Principles of Robust Vision Transformer
标题：对鲁棒视觉转换器设计原则的再思考

作者：Xiaofeng Mao,Gege Qi,Yuefeng Chen,Xiaodan Li,Shaokai Ye,Yuan He,Hui Xue
机构：Alibaba Group, EPFL
链接：https://arxiv.org/abs/2105.07926
摘要：视觉变换器（ViT）的最新研究表明，基于自注意的神经网络在大多数视觉任务中都优于传统的卷积神经网络（CNNs），它利用了长程依赖建模的能力。为了进一步扩大计算机视觉的适用性，许多改进的变体被提出，通过考虑CNNs的优越性，即局部性、平移不变性，来重新设计Transformer的结构，以获得更好的性能。然而，这些方法只考虑模型的标准精度或计算成本。本文从鲁棒性的角度重新思考了ViTs的设计原则。我们发现一些设计组件极大地损害了ViTs的鲁棒性和泛化能力，而另一些则是有益的。通过结合鲁棒设计组件，提出了鲁棒视觉变换器（RVT）。RVT是一种新型的视觉变换器，具有优越的性能和较强的鲁棒性。我们进一步提出了两种新的即插即用技术，称为位置感知注意重缩放和贴片增强来训练我们的RVT。在ImageNet和六个鲁棒性测试平台上的实验结果表明，与现有的Transformer和CNNs相比，RVT具有更高的鲁棒性和泛化能力。我们的RVT-S*在包括ImageNet-C和ImageNet Sketch在内的多个稳健性排行榜上也达到了第一名。代码将在https://github.com/vtddggg/Robust-Vision-Transformer.
摘要：Recent advances on Vision Transformers (ViT) have shown that self-attention-based networks, which take advantage of long-range dependencies modeling ability, surpassed traditional convolution neural networks (CNNs) in most vision tasks. To further expand the applicability for computer vision, many improved variants are proposed to re-design the Transformer architecture by considering the superiority of CNNs, i.e., locality, translation invariance, for better performance. However, these methods only consider the standard accuracy or computation cost of the model. In this paper, we rethink the design principles of ViTs based on the robustness. We found some design components greatly harm the robustness and generalization ability of ViTs while some others are beneficial. By combining the robust design components, we propose Robust Vision Transformer (RVT). RVT is a new vision transformer, which has superior performance and strong robustness. We further propose two new plug-and-play techniques called position-aware attention rescaling and patch-wise augmentation to train our RVT. The experimental results on ImageNet and six robustness benchmarks show the advanced robustness and generalization ability of RVT compared with previous Transformers and state-of-the-art CNNs. Our RVT-S* also achieves Top-1 rank on multiple robustness leaderboards including ImageNet-C and ImageNet-Sketch. The code will be available at https://github.com/vtddggg/Robust-Vision-Transformer.

【2】 Vision Transformers are Robust Learners
标题：视觉Transformer是健壮的学习者

作者：Sayak Paul,Pin-Yu Chen
机构：PyImageSearch, IBM Research
链接：https://arxiv.org/abs/2105.07581
摘要：Transformer，由多个自我注意层组成，对一种适用于不同数据模式的通用学习原语有很强的前景，包括最近在计算机视觉方面的突破，以更好的参数效率实现最先进的（SOTA）标准精度。由于自我关注有助于模型系统地对齐输入数据中存在的不同组件，因此它为研究模型鲁棒性基准下的性能留下了基础。在这项工作中，我们研究了视觉Transformer（ViT）对常见的腐蚀和扰动、分布偏移和自然对抗的鲁棒性。我们使用六个不同的ImageNet数据集，对ViT模型和SOTA卷积神经网络（CNNs）进行了综合性能比较。通过一系列六个系统设计的实验，我们提出了分析，提供了定量和定性的迹象，以解释为什么ViTs确实是更强大的学习者。例如，在参数较少、数据集和预训练组合相似的情况下，ViT在ImageNet-a上给出了28.10%的top-1精度，比BiT的可比变量高4.3倍。我们对图像掩蔽、傅立叶谱灵敏度和离散余弦能谱扩展的分析揭示了ViT的有趣特性，这归因于它提高了鲁棒性。复制我们实验的代码可以在这里找到：https://git.io/J3VO0.
摘要：Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy with better parameter efficiency. Since self-attention helps a model systematically align different components present inside the input data, it leaves grounds to investigate its performance under model robustness benchmarks. In this work, we study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT models and SOTA convolutional neural networks (CNNs), Big-Transfer. Through a series of six systematically designed experiments, we then present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners. For example, with fewer parameters and similar dataset and pre-training combinations, ViT gives a top-1 accuracy of 28.10% on ImageNet-A which is 4.3x higher than a comparable variant of BiT. Our analyses on image masking, Fourier spectrum sensitivity, and spread on discrete cosine energy spectrum reveal intriguing properties of ViT attributing to improved robustness. Code for reproducing our experiments is available here: https://git.io/J3VO0.

【3】 Is Image Size Important? A Robustness Comparison of Deep Learning Methods for Multi-scale Cell Image Classification Tasks: from Convolutional Neural Networks to Visual Transformers
标题：图像大小重要吗？深度学习方法在多尺度细胞图像分类任务中的鲁棒性比较：从卷积神经网络到视觉变换器

作者：Wanli Liu,Chen Li,Hongzan Sun,Weiming Hu,Haoyuan Chen,Changhao Sun,Marcin Grzegorzek
机构：Grzegorzekc, Microscopic Image and Medical Image Analysis Group, MBIE College, Northeastern University, Shenyang, PR, China Medical University, Shenyang, PR China, University of L¨ubeck, Germany
链接：https://arxiv.org/abs/2105.07402
摘要：宫颈癌是女性常见的致命癌症，但通过早期检查和治疗是可以预防的。细胞病理学图像通常用于筛查癌症。然后，针对这种方法的大量存在可能造成人为错误的问题，开发了基于深度学习的计算机辅助诊断系统。深度学习方法所需的图像输入通常是一致的，但临床医学图像的大小是不一致的。直接调整图像大小后内部信息丢失，不合理。很多研究都是直接调整图像的大小，而且结果还是稳健的。为了找到合理的解释，本文采用22种深度学习模型对不同尺度的图像进行了处理，并在SIPaKMeD数据集上进行了实验。结果表明，深度学习方法对图像的大小变化具有很强的鲁棒性。这一结论在Herlev数据集上也得到了验证。
摘要：Cervical cancer is a very common and fatal cancer in women, but it can be prevented through early examination and treatment. Cytopathology images are often used to screen for cancer. Then, because of the possibility of artificial errors due to the large number of this method, the computer-aided diagnosis system based on deep learning is developed. The image input required by the deep learning method is usually consistent, but the size of the clinical medical image is inconsistent. The internal information is lost after resizing the image directly, so it is unreasonable. A lot of research is to directly resize the image, and the results are still robust. In order to find a reasonable explanation, 22 deep learning models are used to process images of different scales, and experiments are conducted on the SIPaKMeD dataset. The conclusion is that the deep learning method is very robust to the size changes of images. This conclusion is also validated on the Herlev dataset.

【4】 Rethinking Skip Connection with Layer Normalization in Transformers and ResNets
标题：对Transformer和ResNet中层归一化跳过连接的再思考

作者：Fenglin Liu,Xuancheng Ren,Zhiyuan Zhang,Xu Sun,Yuexian Zou
机构：ADSPLAB, School of ECE, Peking University, China, MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University
备注：Accepted by COLING2020 (The 28th International Conference on Computational Linguistics (COLING 2020))
链接：https://arxiv.org/abs/2105.07205
摘要：跳转连接是一种广泛应用的提高深层神经网络性能和收敛性的技术，它通过在神经网络层中传播一个线性分量来减轻由于非线性而导致的优化困难。然而，从另一个角度来看，它也可以被看作是输入和输出之间的一种调制机制，输入按预定义的值1进行缩放。在这项工作中，我们研究了尺度因子如何影响跳跃连接的有效性，并揭示了尺度的微小调整将导致虚假梯度爆炸或消失，这与模型的深度一致，可以通过标准化，特别是层标准化来解决，与普通跳过连接相比，这会带来一致的改进。受这些发现的启发，我们进一步提出通过递归地应用跳转连接和层规范化来自适应地调整输入的规模，这大大提高了性能，并且很好地适用于包括机器翻译和图像分类数据集在内的各种任务。
摘要：Skip connection, is a widely-used technique to improve the performance and the convergence of deep neural networks, which is believed to relieve the difficulty in optimization due to non-linearity by propagating a linear component through the neural network layers. However, from another point of view, it can also be seen as a modulating mechanism between the input and the output, with the input scaled by a pre-defined value one. In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could be addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection. Inspired by the findings, we further propose to adaptively adjust the scale of the input by recursively applying skip connection with layer normalization, which promotes the performance substantially and generalizes well across diverse tasks including both machine translation and image classification datasets.

【5】 Are Convolutional Neural Networks or Transformers more like human vision?
标题：卷积神经网络或Transformer更像人类视觉吗？

作者：Shikhar Tuli,Ishita Dasgupta,Erin Grant,Thomas L. Griffiths
机构：Department of Electrical and Computer Engineering, Princeton University, DeepMind, New York, Department of Electrical Engineering and Computer Sciences, UC Berkeley, Departments of Psychology and Computer Science, Princeton University
备注：Accepted at CogSci 2021
链接：https://arxiv.org/abs/2105.07197
摘要：现代计算机视觉机器学习模型在特定的视觉识别任务上的准确率超过了人类，特别是在像ImageNet这样的数据集上。然而，高精度可以在许多方面实现。机器学习系统发现的特定决策函数不仅取决于系统所暴露的数据，而且还取决于模型的归纳偏差，这些偏差通常更难描述。在这项工作中，我们跟踪了神经网络模型的深入行为分析的最新趋势，通过观察错误模式，这些模型超越了作为评估指标的准确性。我们的重点是比较一套标准的卷积神经网络（CNNs）和最近提出的基于注意的网络，视觉Transformer（ViT），它放松了CNNs的平移不变性约束，因此代表了一个具有较弱的诱导偏差集的模型。基于注意的网络在视觉任务上比CNNs具有更高的准确度，我们使用新的度量标准来检验更细粒度的错误一致性，证明了它们的错误也更符合人类的错误。这些结果对于建立更像人类的视觉模型，以及理解人类的视觉对象识别都有意义。
摘要：Modern machine learning models for computer vision exceed humans in accuracy on specific visual recognition tasks, notably on datasets like ImageNet. However, high accuracy can be achieved in many ways. The particular decision function found by a machine learning system is determined not only by the data to which the system is exposed, but also the inductive biases of the model, which are typically harder to characterize. In this work, we follow a recent trend of in-depth behavioral analyses of neural network models that go beyond accuracy as an evaluation metric by looking at patterns of errors. Our focus is on comparing a suite of standard Convolutional Neural Networks (CNNs) and a recently-proposed attention-based network, the Vision Transformer (ViT), which relaxes the translation-invariance constraint of CNNs and therefore represents a model with a weaker set of inductive biases. Attention-based networks have previously been shown to achieve higher accuracy than CNNs on vision tasks, and we demonstrate, using new metrics for examining error consistency with more granularity, that their errors are also more consistent with those of humans. These results have implications both for building more human-like vision models, as well as for understanding visual object recognition in humans.

检测相关(7篇)

【1】 A Cloud-based Deep Learning Framework for Remote Detection of Diabetic Foot Ulcers
标题：一种基于云的糖尿病足溃疡远程检测深度学习框架

作者：Bill Cassidy,Neil D. Reeves,Joseph M. Pappachan,Naseer Ahmad,Samantha Haycocks,David Gillespie,Moi Hoon Yap
机构：A Cloud-based Deep Learning, Framework for Remote, Detection of Diabetic Foot, Ulcers, Manchester Metropolitan University, UK, Lancashire Teaching Hospitals NHS Foundation Trust, Manchester Metropolitan University and The, University of Manchester, UK
备注：10 pages, 2 figures, 1 table
链接：https://arxiv.org/abs/2105.07763
摘要：本研究提出了一个基于移动和云端的糖尿病足溃疡自动检测框架，并对其性能进行了研究。该系统使用了一个跨平台的移动框架，可以使用一个TypeScript代码库将移动应用部署到多个平台。一个深度卷积神经网络被部署到一个基于云的平台上，在这个平台上，移动应用程序可以发送患者足部的照片，以便推断是否存在糖尿病足部溃疡。该系统的功能和可用性在两个临床环境中进行了测试：索尔福德皇家NHS基金会信托基金会和兰开夏郡教学医院NHS基金会信托基金会。该系统的好处，如潜在的应用程序使用的病人识别和监测他们的情况进行了讨论。
摘要：This research proposes a mobile and cloud-based framework for the automatic detection of diabetic foot ulcers and conducts an investigation of its performance. The system uses a cross-platform mobile framework which enables the deployment of mobile apps to multiple platforms using a single TypeScript code base. A deep convolutional neural network was deployed to a cloud-based platform where the mobile app could send photographs of patient's feet for inference to detect the presence of diabetic foot ulcers. The functionality and usability of the system were tested in two clinical settings: Salford Royal NHS Foundation Trust and Lancashire Teaching Hospitals NHS Foundation Trust. The benefits of the system, such as the potential use of the app by patients to identify and monitor their condition are discussed.

【2】 FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection
标题：FGR：基于Frustum感知的弱监督三维车辆检测几何推理

作者：Yi Wei,Shang Su,Jiwen Lu,Jie Zhou
机构： Tsinghua University, and theTsinghua Shenzhen International Graduate School
备注：Accepted to ICRA 2021
链接：https://arxiv.org/abs/2105.07647
摘要：本文研究了弱监督三维车辆检测问题。传统的三维目标检测方法需要大量的人工标注的三维数据作为监控信号。然而，注释大型数据集需要大量的人力，特别是对于三维区域。为了解决这个问题，我们提出了基于截头体的几何推理（FGR）方法来检测点云中的车辆，而不需要任何三维标注。该方法分为两个阶段：粗三维分割和三维边界盒估计。第一阶段，设计了一种基于二维边界盒的上下文感知自适应区域生长算法来分割目标。利用预测的分割模板，我们在第二阶段提出了一种估计3D边界盒的抗噪声方法。最后利用该方法生成的三维伪标签训练三维探测器。FGR独立于任何3D地面真实感，在KITTI数据集上达到了与完全监督方法相当的性能。研究结果表明，该方法仅需二维边界盒和稀疏点云，就可以准确地检测出三维空间中的目标。
摘要：In this paper, we investigate the problem of weakly supervised 3D vehicle detection. Conventional methods for 3D object detection need vast amounts of manually labelled 3D data as supervision signals. However, annotating large datasets requires huge human efforts, especially for 3D area. To tackle this problem, we propose frustum-aware geometric reasoning (FGR) to detect vehicles in point clouds without any 3D annotations. Our method consists of two stages: coarse 3D segmentation and 3D bounding box estimation. For the first stage, a context-aware adaptive region growing algorithm is designed to segment objects based on 2D bounding boxes. Leveraging predicted segmentation masks, we develop an anti-noise approach to estimate 3D bounding boxes in the second stage. Finally 3D pseudo labels generated by our method are utilized to train a 3D detector. Independent of any 3D groundtruth, FGR reaches comparable performance with fully supervised methods on the KITTI dataset. The findings indicate that it is able to accurately detect objects in 3D space with only 2D bounding boxes and sparse point clouds.

【3】 Class-Incremental Few-Shot Object Detection
标题：基于类增量的Few-Shot目标检测

作者：Pengyang Li,Yanan Li,Donghui Wang
机构：Zhejiang University, Zhejiang Lab
链接：https://arxiv.org/abs/2105.07637
摘要：传统的检测网络通常需要大量的有标记的训练样本，而人类只需要少量的样本就可以逐步学习新的概念。本文主要研究一类更具挑战性但更现实的增量Few-Shot目标检测问题（iFSD）。它的目标是在不灾难性地忘记先前学习的模型的情况下，从仅有的几个注释样本中增量地传递新对象的模型。针对这一问题，我们提出了一种新的最小转移方法，该方法能够以较少的遗忘、较少的训练资源和较强的转移能力进行转移。具体来说，我们首先提出了iFSD的传输策略，以减少不必要的权值自适应，提高iFSD的传输能力。在此基础上，我们结合知识提取技术，使用资源消耗较少的方法来减轻遗忘，并提出一种新的基于聚类的样本选择过程，以保留先前学习到的更具鉴别能力的特征。作为一种通用的、有效的方法，LEAST可以在很大程度上提高iFSD在各种基准上的性能。
摘要：Conventional detection networks usually need abundant labeled training samples, while humans can learn new concepts incrementally with just a few examples. This paper focuses on a more challenging but realistic class-incremental few-shot object detection problem (iFSD). It aims to incrementally transfer the model for novel objects from only a few annotated samples without catastrophically forgetting the previously learned ones. To tackle this problem, we propose a novel method LEAST, which can transfer with Less forgetting, fEwer training resources, And Stronger Transfer capability. Specifically, we first present the transfer strategy to reduce unnecessary weight adaptation and improve the transfer capability for iFSD. On this basis, we then integrate the knowledge distillation technique using a less resource-consuming approach to alleviate forgetting and propose a novel clustering-based exemplar selection process to preserve more discriminative features previously learned. Being a generic and effective method, LEAST can largely improve the iFSD performance on various benchmarks.

【4】 Open-set Recognition based on the Combination of Deep Learning and Ensemble Method for Detecting Unknown Traffic Scenarios
标题：基于深度学习和集成方法相结合的开集识别未知交通场景检测

作者：Lakshman Balasubramanian,Friedrich Kruber,Michael Botsch,Ke Deng
机构：©, IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media
备注：Accepted for IEEE Intelligent Vehicles 2021
链接：https://arxiv.org/abs/2105.07635
摘要：对驾驶场景的理解和分类对于测试和开发自动驾驶功能非常重要。机器学习模型对于场景分类很有用，但大多数模型都假设测试过程中接收到的数据来自于训练中使用的一个类。这种假设并不总是正确的，因为车辆运行的环境是开放的。这是由一个新的机器学习范式称为开集识别。开集识别是将测试样本分配给训练中使用的类或未知类的问题。提出了一种卷积神经网络（CNN）和随机森林（RF）相结合的交通场景开集识别方法。CNNs用于特征生成，RF算法与极值理论一起用于已知和未知类的检测。提出的解决方案的特点是在RF中探索树的投票模式，而不仅仅是多数投票。通过继承RF的集合性质，所有树的投票模式结合极值理论被证明非常适合检测未知类。该方法已经在high和OpenTraffic数据集上进行了测试，与现有的解决方案相比，在各个方面都表现出了优越的性能。
摘要：An understanding and classification of driving scenarios are important for testing and development of autonomous driving functionalities. Machine learning models are useful for scenario classification but most of them assume that data received during the testing are from one of the classes used in the training. This assumption is not true always because of the open environment where vehicles operate. This is addressed by a new machine learning paradigm called open-set recognition. Open-set recognition is the problem of assigning test samples to one of the classes used in training or to an unknown class. This work proposes a combination of Convolutional Neural Networks (CNN) and Random Forest (RF) for open set recognition of traffic scenarios. CNNs are used for the feature generation and the RF algorithm along with extreme value theory for the detection of known and unknown classes. The proposed solution is featured by exploring the vote patterns of trees in RF instead of just majority voting. By inheriting the ensemble nature of RF, the vote pattern of all trees combined with extreme value theory is shown to be well suited for detecting unknown classes. The proposed method has been tested on the highD and OpenTraffic datasets and has demonstrated superior performance in various aspects compared to existing solutions.

【5】 Real-time Detection of Practical Universal Adversarial Perturbations
标题：实用通用对抗性扰动的实时检测

作者：Kenneth T. Co,Luis Muñoz-González,Leslie Kanthan,Emil C. Lupu
机构：Lupu,[,−,−,−,], Imperial College London, London SW,AZ, United Kingdom, DataSpartan, London EC,Y ,ST, United Kingdom
链接：https://arxiv.org/abs/2105.07334
摘要：普遍对抗性扰动（UAPs）是一类突出的对抗性例子，它利用了系统的弱点，实现了对深度神经网络（DNNs）的物理可实现和鲁棒性攻击。UAP可以跨多个不同的输入进行概括；这导致了可以大规模应用的现实有效的攻击。在这篇论文中，我们提出了一个高效且可扩展的算法HyperNeuron，它允许通过识别可疑的神经元超激活来实时检测UAPs。我们的结果显示了超神经元在多任务（图像分类，目标检测）中的有效性，对抗各种通用攻击，以及在现实场景中，如感知广告阻断和对抗补丁。HyperNeuron能够同时检测对抗性掩膜和补丁UAP，其性能与现有的UAP防御相当或更好，同时显著减少了每幅图像0.86毫秒的延迟。这表明，许多现实和实用的通用攻击可以实时可靠地缓解，这显示了机器学习系统的健壮部署的前景。
摘要：Universal Adversarial Perturbations (UAPs) are a prominent class of adversarial examples that exploit the systemic vulnerabilities and enable physically realizable and robust attacks against Deep Neural Networks (DNNs). UAPs generalize across many different inputs; this leads to realistic and effective attacks that can be applied at scale. In this paper we propose HyperNeuron, an efficient and scalable algorithm that allows for the real-time detection of UAPs by identifying suspicious neuron hyper-activations. Our results show the effectiveness of HyperNeuron on multiple tasks (image classification, object detection), against a wide variety of universal attacks, and in realistic scenarios, like perceptual ad-blocking and adversarial patches. HyperNeuron is able to simultaneously detect both adversarial mask and patch UAPs with comparable or better performance than existing UAP defenses whilst introducing a significantly reduced latency of only 0.86 milliseconds per image. This suggests that many realistic and practical universal attacks can be reliably mitigated in real-time, which shows promise for the robust deployment of machine learning systems.

【6】 Fast and Accurate Camera Scene Detection on Smartphones
标题：智能手机上快速准确的摄像机场景检测

作者：Angeline Pouget,Sidharth Ramesh,Maximilian Giang,Ramithan Chandrapalan,Toni Tanner,Moritz Prussing,Radu Timofte,Andrey Ignatov
机构：ETH Zurich, Switzerland
链接：https://arxiv.org/abs/2105.07869
摘要：人工智能驱动的自动相机场景检测模式现在几乎可以在任何现代智能手机中使用，尽管精确的场景预测问题还没有被研究界解决。本文首次提出了一种新的摄像机场景检测数据集（CamSDD），该数据集包含了30个不同场景类别的11K多幅手动抓取的图像。我们提出了一个高效的NPU友好的CNN模型，在这个数据集上显示了99.5%的前三名准确率，并且在最近的移动soc上达到了200fps以上。对所获得的解决方案进行了额外的野外评估，以分析其在现实场景中的性能和局限性。本文使用的数据集和预训练模型可在项目网站上获得。
摘要：AI-powered automatic camera scene detection mode is nowadays available in nearly any modern smartphone, though the problem of accurate scene prediction has not yet been addressed by the research community. This paper for the first time carefully defines this problem and proposes a novel Camera Scene Detection Dataset (CamSDD) containing more than 11K manually crawled images belonging to 30 different scene categories. We propose an efficient and NPU-friendly CNN model for this task that demonstrates a top-3 accuracy of 99.5% on this dataset and achieves more than 200 FPS on the recent mobile SoCs. An additional in-the-wild evaluation of the obtained solution is performed to analyze its performance and limitation in the real-world scenarios. The dataset and pre-trained models used in this paper are available on the project website.

【7】 Deep learning for detecting pulmonary tuberculosis via chest radiography: an international study across 10 countries
标题：通过胸片检测肺结核的深度学习：一项涉及10个国家的国际研究

作者：Sahar Kazemzadeh,Jin Yu,Shahar Jamshy,Rory Pilgrim,Zaid Nabulsi,Christina Chen,Neeral Beladia,Charles Lau,Scott Mayer McKinney,Thad Hughes,Atilla Kiraly,Sreenivasa Raju Kalidindi,Monde Muyoyeta,Jameson Malemela,Ting Shih,Greg S. Corrado,Lily Peng,Katherine Chou,Po-Hsuan Cameron Chen,Yun Liu,Krish Eswaran,Daniel Tse,Shravya Shetty,Shruthi Prabhakara
机构：Prabhakara,‡, Affiliations, Google Health, Palo Alto, CA, USA, Work done at Google via Advanced Clinical, Deerfield, IL, USA, Apollo Radiology International, Hyderabad, India, TB department,Center of Infectious Disease Research in Zambia, Lusaka, Zambia
链接：https://arxiv.org/abs/2105.07540
摘要：肺结核（TB）是全球十大死因之一。尽管世卫组织建议胸片（CXR）用于结核病筛查，但有限的CXR解释是一个障碍。我们使用非洲、亚洲和欧洲9个国家的CXR训练了一个深度学习系统（DLS）来检测活动性肺结核，并使用了大规模CXR预训练、注意力集中和噪声学生半监督学习。评估是基于（1）一个横跨中国、印度、美国和赞比亚的综合测试集，以及（2）南非的一个独立采矿人口。鉴于世卫组织的目标是90%的敏感性和70%的特异性，DLS的操作点被预先指定为敏感性高于特异性。在综合测试数据集上，DLS的ROC曲线高于所有9位印度放射科医生，AUC为0.90（95%可信区间0.87-0.92）。DLS的敏感性（88%）高于印度放射科医生（75%）的平均敏感性（p<0.001）；其特异性（79%）不低于放射科医生（84%平均特异性），p=0.004。在HIV阳性和痰涂片阳性的亚组中，以及在南非的测试集中，也观察到了类似的趋势。我们发现5名美国放射科医生（结核病不是地方病）比印度放射科医生（结核病是地方病）更敏感，特异性更低。DLS也不逊于美国的放射科医生。在模拟实验中，使用DLS作为验证性测试的优先工具，与单独使用验证性测试相比，每个阳性病例的成本降低了40-80%。总之，我们的DLS已推广到5个国家，值得进行前瞻性评估，以帮助在放射科医生有限的环境中进行成本效益高的筛查工作。操作点的灵活性可能允许定制DLS，以考虑特定地点的因素，如结核病流行率、人口统计学、临床资源和习惯做法模式。
摘要：Tuberculosis (TB) is a top-10 cause of death worldwide. Though the WHO recommends chest radiographs (CXRs) for TB screening, the limited availability of CXR interpretation is a barrier. We trained a deep learning system (DLS) to detect active pulmonary TB using CXRs from 9 countries across Africa, Asia, and Europe, and utilized large-scale CXR pretraining, attention pooling, and noisy student semi-supervised learning. Evaluation was on (1) a combined test set spanning China, India, US, and Zambia, and (2) an independent mining population in South Africa. Given WHO targets of 90% sensitivity and 70% specificity, the DLS's operating point was prespecified to favor sensitivity over specificity. On the combined test set, the DLS's ROC curve was above all 9 India-based radiologists, with an AUC of 0.90 (95%CI 0.87-0.92). The DLS's sensitivity (88%) was higher than the India-based radiologists (75% mean sensitivity), p<0.001 for superiority; and its specificity (79%) was non-inferior to the radiologists (84% mean specificity), p=0.004. Similar trends were observed within HIV positive and sputum smear positive sub-groups, and in the South Africa test set. We found that 5 US-based radiologists (where TB isn't endemic) were more sensitive and less specific than the India-based radiologists (where TB is endemic). The DLS also remained non-inferior to the US-based radiologists. In simulations, using the DLS as a prioritization tool for confirmatory testing reduced the cost per positive case detected by 40-80% compared to using confirmatory testing alone. To conclude, our DLS generalized to 5 countries, and merits prospective evaluation to assist cost-effective screening efforts in radiologist-limited settings. Operating point flexibility may permit customization of the DLS to account for site-specific factors such as TB prevalence, demographics, clinical resources, and customary practice patterns.

分类|识别相关(15篇)

【1】 Unknown-box Approximation to Improve Optical Character Recognition Performance
标题：提高光学字符识别性能的未知盒近似方法

作者：Ayantha Randika,Nilanjan Ray,Xiao Xiao,Allegra Latimer
机构： University of Alberta, St and , Ave, Edmonton, AB, Canada, Intuit Inc., Coast Ave, Mountain View, CA USA
链接：https://arxiv.org/abs/2105.07983
摘要：光学字符识别（OCR）是一种应用广泛的模式识别技术。有几种功能丰富、通用的OCR解决方案可供消费者使用，这些解决方案可以提供中等到极好的精度级别。但是，对于困难和不常见的文档域，准确性可能会降低。对文档图像进行预处理可以最大限度地减小域偏移的影响。本文提出了一种为给定的OCR引擎定制预处理器的新方法。与以往的OCR不可知预处理技术不同，该方法通过逼近特定OCR引擎的梯度来训练预处理模块。在两个数据集和两个OCR引擎上的实验表明，该预处理器通过对文档图像进行像素级处理，OCR的准确率比基线提高了46%。该方法的实现和增强的公共数据集可供下载。
摘要：Optical character recognition (OCR) is a widely used pattern recognition application in numerous domains. There are several feature-rich, general-purpose OCR solutions available for consumers, which can provide moderate to excellent accuracy levels. However, accuracy can diminish with difficult and uncommon document domains. Preprocessing of document images can be used to minimize the effect of domain shift. In this paper, a novel approach is presented for creating a customized preprocessor for a given OCR engine. Unlike the previous OCR agnostic preprocessing techniques, the proposed approach approximates the gradient of a particular OCR engine to train a preprocessor module. Experiments with two datasets and two OCR engines show that the presented preprocessor is able to improve the accuracy of the OCR up to 46% from the baseline by applying pixel-level manipulations to the document image. The implementation of the proposed method and the enhanced public datasets are available for download.

【2】 BigEarthNet-MM: A Large Scale Multi-Modal Multi-Label Benchmark Archive for Remote Sensing Image Classification and Retrieval
标题：BigEarthNet-MM：面向遥感图像分类检索的大规模多模态多标签基准档案库

作者：Gencer Sumbul,Arne de Wall,Tristan Kreuziger,Filipe Marcelino,Hugo Costa,Pedro Benevides,Mário Caetano,Begüm Demir,Volker Markl
机构： Universidade Nova Lisboa
备注：The paper is under review. Our code is available online at this https URL arXiv admin note: substantial text overlap with arXiv:2001.06372
链接：https://arxiv.org/abs/2105.07921
摘要：本文提出了由590326对Sentinel-1和Sentinel-2图像块组成的多模式bigtearthnet（bigtearthnet-MM）基准文档，以支持多模式多标签遥感（RS）图像检索和分类的深度学习（DL）研究。BigEarthNet MM中的每一对斑块都由CORINE Land Cover（CLC）地图根据其主题上最详细的3级分类命名法提供的多个标签进行注释。我们的初步研究表明，一些CLC类很难通过只考虑（单一日期）BigEarthNet MM图像来准确描述。在本文中，我们还介绍了一个替代类命名法作为一个演变的原始CLC标签来解决这个问题。这是通过解释和安排基于BigEarthNet MM图像特性的CLC 3级命名法来实现的，新命名法有19个类别。在我们的实验中，我们通过考虑一些最新的DL模型，展示了bigtearthnet-MM在多模态多标签图像检索和分类问题中的潜力。我们还证明了在BigEarthNet-MM上从头开始训练的DL模型比在ImageNet上预先训练的DL模型有更好的性能，特别是在一些复杂的类中，包括农业和其他植被和自然环境。我们让所有的数据和DL模型在https://bigearth.net为支持遥感多模态图像场景分类与检索问题的研究提供了重要的资源。
摘要：This paper presents the multi-modal BigEarthNet (BigEarthNet-MM) benchmark archive made up of 590,326 pairs of Sentinel-1 and Sentinel-2 image patches to support the deep learning (DL) studies in multi-modal multi-label remote sensing (RS) image retrieval and classification. Each pair of patches in BigEarthNet-MM is annotated with multi-labels provided by the CORINE Land Cover (CLC) map of 2018 based on its thematically most detailed Level-3 class nomenclature. Our initial research demonstrates that some CLC classes are challenging to be accurately described by only considering (single-date) BigEarthNet-MM images. In this paper, we also introduce an alternative class-nomenclature as an evolution of the original CLC labels to address this problem. This is achieved by interpreting and arranging the CLC Level-3 nomenclature based on the properties of BigEarthNet-MM images in a new nomenclature of 19 classes. In our experiments, we show the potential of BigEarthNet-MM for multi-modal multi-label image retrieval and classification problems by considering several state-of-the-art DL models. We also demonstrate that the DL models trained from scratch on BigEarthNet-MM outperform those pre-trained on ImageNet, especially in relation to some complex classes, including agriculture and other vegetated and natural environments. We make all the data and the DL models publicly available at https://bigearth.net, offering an important resource to support studies on multi-modal image scene classification and retrieval problems in RS.

【3】 CNN-based Approaches For Cross-Subject Classification in Motor Imagery: From The State-of-The-Art to DynamicNet
标题：基于CNN的运动图像跨主题分类方法：从最新研究到动态网络

作者：Alberto Zancanaro,Giulia Cisotto,João Ruivo Paulo,Gabriel Pires,Urbano J. Nunes
机构：Dept. Information Engineering, University of Padova, Italy, National Centre for Neurology and Psychiatry, Tokyo, Japan, National Inter-University Consortium for Telecommunications (CNIT), Padova, Italy
链接：https://arxiv.org/abs/2105.07917
摘要：基于运动想象（MI）的脑-机接口（BCI）系统正越来越多地被用于为患有神经运动障碍的人提供替代的通信和控制手段，特别是努力使这些系统脱离受控的实验室环境。因此，从脑信号（如脑电图）中准确分类心肌梗死对于获得可靠的脑机接口系统至关重要。然而，MI分类仍然是一个具有挑战性的任务，因为信号具有信噪比低、被试者内和被试者间变异性大的特点。深度学习方法已经开始作为标准机器学习技术的有效替代方法出现，例如滤波器组公共空间模式（FBCSP），以提取与主题无关的特征并提高MI-BCI系统的跨主题分类性能。在这篇论文中，我们首先回顾了最近使用深度学习进行MI分类的研究，特别是对他们的跨学科表现的关注。其次，我们提出了DynamicNet，这是一个基于Python的工具，用于快速灵活地实现基于卷积神经网络的深度学习模型。通过EEGNet的实现，我们展示了DynamicNet的潜力，EEGNet是一个完善的EEG分类体系结构。最后，我们比较了它与FBCSP在公共数据集的4类MI分类中的性能。为了探索它的跨学科分类能力，我们采用了三种不同的交叉验证方案。从我们的结果来看，我们证明了DynamicNet实现的EEGNet比FBCSP的性能提高了25%左右，当应用跨主题验证方案时，这种差异具有统计学意义。
摘要：Motor imagery (MI)-based brain-computer interface (BCI) systems are being increasingly employed to provide alternative means of communication and control for people suffering from neuro-motor impairments, with a special effort to bring these systems out of the controlled lab environments. Hence, accurately classifying MI from brain signals, e.g., from electroencephalography (EEG), is essential to obtain reliable BCI systems. However, MI classification is still a challenging task, because the signals are characterized by poor SNR, high intra-subject and cross-subject variability. Deep learning approaches have started to emerge as valid alternatives to standard machine learning techniques, e.g., filter bank common spatial pattern (FBCSP), to extract subject-independent features and to increase the cross-subject classification performance of MI BCI systems. In this paper, we first present a review of the most recent studies using deep learning for MI classification, with particular attention to their cross-subject performance. Second, we propose DynamicNet, a Python-based tool for quick and flexible implementations of deep learning models based on convolutional neural networks. We show-case the potentiality of DynamicNet by implementing EEGNet, a well-established architecture for effective EEG classification. Finally, we compare its performance with FBCSP in a 4-class MI classification over public datasets. To explore its cross-subject classification ability, we applied three different cross-validation schemes. From our results, we demonstrate that DynamicNet-implemented EEGNet outperforms FBCSP by about 25%, with a statistically significant difference when cross-subject validation schemes are applied.

【4】 Large-Scale Unsupervised Person Re-Identification with Contrastive Learning
标题：基于对比学习的大规模无人再识别

作者：Weiquan Huang,Yan Bai,Qiuyu Ren,Xinbo Zhao,Ming Feng,Yin Wang
机构：Tongji University, Shanghai, China
链接：https://arxiv.org/abs/2105.07914
摘要：现有的公众人物再识别（ReID）数据集在现代意义上很小，因为难以标注。尽管未标记的监控录像非常丰富，而且相对容易获得，但如何利用这些录像来学习有意义的里德表现尚不清楚。特别是，大多数现有的无监督和领域适应里德方法只利用他们的实验中的公共数据集，标签删除。另外，由于数据量小，这些方法通常依靠测试域中未标记的训练数据进行微调来获得良好的性能。受近年来基于对比学习的大规模自监督图像分类研究进展的启发，本文提出了从大规模无标记监控视频中学习ReID表示的方法。在现有行人检测工具的帮助下，我们将对比损失应用于图像和轨迹级别。结合使用免费摄像机标签的主成分分析步骤，我们使用大规模未标记数据集进行的评估显示，在不使用测试域中任何训练数据的无监督方法中，其性能要优越得多。此外，随着数据量的增大，精度也随之提高，因此我们的方法对于更大更多样化的数据集具有很大的潜力。
摘要：Existing public person Re-Identification~(ReID) datasets are small in modern terms because of labeling difficulty. Although unlabeled surveillance video is abundant and relatively easy to obtain, it is unclear how to leverage these footage to learn meaningful ReID representations. In particular, most existing unsupervised and domain adaptation ReID methods utilize only the public datasets in their experiments, with labels removed. In addition, due to small data sizes, these methods usually rely on fine tuning by the unlabeled training data in the testing domain to achieve good performance. Inspired by the recent progress of large-scale self-supervised image classification using contrastive learning, we propose to learn ReID representation from large-scale unlabeled surveillance video alone. Assisted by off-the-shelf pedestrian detection tools, we apply the contrastive loss at both the image and the tracklet levels. Together with a principal component analysis step using camera labels freely available, our evaluation using a large-scale unlabeled dataset shows far superior performance among unsupervised methods that do not use any training data in the testing domain. Furthermore, the accuracy improves with the data size and therefore our method has great potential with even larger and more diversified datasets.

【5】 Multi-modal Visual Place Recognition in Dynamics-Invariant Perception Space
标题：动态不变感知空间中的多模态视觉位置识别

作者：Lin Wu,Teng Wang,Changyin Sun
机构：Southeast University
链接：https://arxiv.org/abs/2105.07800
摘要：视觉位置识别是机器人领域的一个重要而富有挑战性的问题。在这封信中，我们首次探讨了在动态不变空间中使用语义和视觉模态的多模态融合来改善动态环境中的位置识别。为此，我们首先设计了一种新的深度学习结构来生成静态语义分割，并直接从相应的动态图像中恢复静态图像。然后创新地利用空间金字塔匹配模型将静态语义分割编码为特征向量。同时，静态图像采用流行的词袋模型进行编码。在上述多模态特征的基础上，通过语义和视觉编码的联合相似度来度量查询图像与目标地标的相似度。大量实验证明了该方法的有效性和鲁棒性。
摘要：Visual place recognition is one of the essential and challenging problems in the fields of robotics. In this letter, we for the first time explore the use of multi-modal fusion of semantic and visual modalities in dynamics-invariant space to improve place recognition in dynamic environments. We achieve this by first designing a novel deep learning architecture to generate the static semantic segmentation and recover the static image directly from the corresponding dynamic image. We then innovatively leverage the spatial-pyramid-matching model to encode the static semantic segmentation into feature vectors. In parallel, the static image is encoded using the popular Bag-of-words model. On the basis of the above multi-modal features, we finally measure the similarity between the query image and target landmark by the joint similarity of their semantic and visual codes. Extensive experiments demonstrate the effectiveness and robustness of the proposed approach for place recognition in dynamic environments.

【6】 STRIDE : Scene Text Recognition In-Device
标题：STRIDE：设备内场景文本识别

作者：Rachit S Munjal,Arun D Prabhu,Nikhil Arora,Sukumar Moharana,Gopi Ramena
机构：On-Device AI, Samsung R&D Institute, Bangalore, India
备注：accepted in IJCNN 2021
链接：https://arxiv.org/abs/2105.07795
摘要：光学字符识别（OCR）系统广泛应用于从图像中提取语义信息的各种应用中。为了给用户更多的隐私控制权，需要一个设备上的解决方案。目前最先进的模型过于繁重和复杂，无法部署在设备上。我们开发了一个高效的轻量级场景文本识别（STR）系统，该系统只有0.88M的参数，可以进行实时文本识别。注意模块有助于提高STR网络的准确性，但通常速度较慢，且不适合设备推理。因此，我们提出在文本识别网络中使用卷积注意模块，通过增加最小的计算开销，为LSTM模块提供通道和空间注意信息。它将我们在ICDAR 13数据集上的单词准确率提高了近2%。我们还介绍了一个新的方向分类器模块，以支持同时识别水平和垂直文本。所提出的模型在推理时间和内存占用的设备度量上超过了现有的商用和其他开源OCR引擎，并且达到了相当的精度。我们在Exynos 990芯片组设备上以每字2.44ms的推理速度将系统部署在设备上，在ICDAR-13数据集上达到88.4%的准确率。
摘要：Optical Character Recognition (OCR) systems have been widely used in various applications for extracting semantic information from images. To give the user more control over their privacy, an on-device solution is needed. The current state-of-the-art models are too heavy and complex to be deployed on-device. We develop an efficient lightweight scene text recognition (STR) system, which has only 0.88M parameters and performs real-time text recognition. Attention modules tend to boost the accuracy of STR networks but are generally slow and not optimized for device inference. So, we propose the use of convolution attention modules to the text recognition networks, which aims to provide channel and spatial attention information to the LSTM module by adding very minimal computational cost. It boosts our word accuracy on ICDAR 13 dataset by almost 2\%. We also introduce a novel orientation classifier module, to support the simultaneous recognition of both horizontal and vertical text. The proposed model surpasses on-device metrics of inference time and memory footprint and achieves comparable accuracy when compared to the leading commercial and other open-source OCR engines. We deploy the system on-device with an inference speed of 2.44 ms per word on the Exynos 990 chipset device and achieve an accuracy of 88.4\% on ICDAR-13 dataset.

【7】 DOC3-Deep One Class Classification using Contradictions
标题：DOC3-基于矛盾的深度一类分类

作者：Sauptik Dhar,Bernardo Gonzalez Torres
机构： USA 2Universityof California
备注：Deep Learning, Anomaly Detection, Visual Inspection, Learning from Contradictions, Outlier Exposure, 18 pages, 14 tables, 6 Figures
链接：https://arxiv.org/abs/2105.07636
摘要：本文介绍了从矛盾中学习的概念（又称普遍学习）来解决深层一类分类问题。针对广泛采用的一类大边际损失问题，提出了基于矛盾的深层一类分类（DOC3）算法。通过比较DOC3的经验雷达复杂度（ERC）和传统的归纳学习方法，我们发现从矛盾中学习会产生较低的泛化误差。我们的实验结果表明，与归纳学习算法相比，DOC3算法在测试AUC中对CIFAR-10和MV-Tec-AD数据集的有效性分别达到>30%和>50%，并且在许多情况下提高了异常检测的最新水平。
摘要：This paper introduces the notion of learning from contradictions (a.k.a Universum learning) for deep one class classification problems. We formalize this notion for the widely adopted one class large-margin loss, and propose the Deep One Class Classification using Contradictions (DOC3) algorithm. We show that learning from contradictions incurs lower generalization error by comparing the Empirical Radamacher Complexity (ERC) of DOC3 against its traditional inductive learning counterpart. Our empirical results demonstrate the efficacy of DOC3 algorithm achieving > 30% for CIFAR-10 and >50% for MV-Tec AD data sets in test AUCs compared to its inductive learning counterpart and in many cases improving the state-of-the-art in anomaly detection.

【8】 A Fine-Grained Visual Attention Approach for Fingerspelling Recognition in the Wild
标题：一种用于野外指纹识别的细粒度视觉注意方法

作者：Kamala Gajurel,Cuncong Zhong,Guanghui Wang
机构：† Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence KS, USA, ‡ Department of Computer Science, Ryerson University, Toronto ON, Canada, M,B ,K
备注：7 pages, 3 figures
链接：https://arxiv.org/abs/2105.07625
摘要：手语中的手指拼写一直是在没有专用手语手势的情况下交流专业术语和专有名词的手段。自动识别手指拼写有助于解决与聋人交流时的沟通障碍。手指拼写识别的主要挑战是手势的模糊性和手的强清晰度。自动识别模型应该解决高类间视觉相似性和高类内变化的手势。现有的手指拼写识别研究大多集中在受控环境下采集的数据集上。最近从社交媒体和在线平台收集了一个大规模的野外带注释的fingerspelling数据集，抓住了现实世界中的挑战。在这项工作中，我们提出了一种细粒度的视觉注意机制，使用Transformer模型来处理野生数据集中的序列到序列预测任务。在基于上下文的连续注意中，利用视频帧（光流）的运动变化以及变换器-编码器模型来实现细粒度注意。通过平衡连接主义时间分类（CTC）损失和最大熵损失，对未分割的连续视频数据集进行联合训练。该方法可以在一次迭代中获得更好的细粒度注意力。实验结果表明，该方法优于现有的方法。
摘要：Fingerspelling in sign language has been the means of communicating technical terms and proper nouns when they do not have dedicated sign language gestures. Automatic recognition of fingerspelling can help resolve communication barriers when interacting with deaf people. The main challenges prevalent in fingerspelling recognition are the ambiguity in the gestures and strong articulation of the hands. The automatic recognition model should address high inter-class visual similarity and high intra-class variation in the gestures. Most of the existing research in fingerspelling recognition has focused on the dataset collected in a controlled environment. The recent collection of a large-scale annotated fingerspelling dataset in the wild, from social media and online platforms, captures the challenges in a real-world scenario. In this work, we propose a fine-grained visual attention mechanism using the Transformer model for the sequence-to-sequence prediction task in the wild dataset. The fine-grained attention is achieved by utilizing the change in motion of the video frames (optical flow) in sequential context-based attention along with a Transformer encoder model. The unsegmented continuous video dataset is jointly trained by balancing the Connectionist Temporal Classification (CTC) loss and the maximum-entropy loss. The proposed approach can capture better fine-grained attention in a single iteration. Experiment evaluations show that it outperforms the state-of-the-art approaches.

【9】 Towards Unsupervised Domain Adaptation for Deep Face Recognition under Privacy Constraints via Federated Learning
标题：基于联合学习的隐私约束下深度人脸识别的无监督区域自适应

作者：Weiming Zhuang,Xin Gan,Yonggang Wen,Xuesen Zhang,Shuai Zhang,Shuai Yi
机构：Nanyang Technological University, Singapore, SenseTime Research, China
链接：https://arxiv.org/abs/2105.07606
摘要：无监督域自适应被广泛地应用于对目标域中的未标记数据、给定源域中的标记数据（其数据分布与目标域不同）的模型进行泛化。然而，现有的工作不适用于隐私约束下的人脸识别，因为它们需要在两个域之间共享敏感的人脸图像。为了解决这个问题，我们提出了一种新的无监督联合人脸识别方法（FedFR）。FedFR通过联邦学习迭代地聚集源领域的知识，提高了目标领域的性能。它通过在域之间传输模型而不是原始数据来保护数据隐私。此外，本文还提出了一种新的域约束丢失（DCL）方法来正则化源域训练。DCL抑制了源域的数据量优势。我们还改进了一种分层聚类算法来准确预测未标记目标域的伪标签。为此，FedFR形成了一个端到端的训练管道：（1）在源域进行预训练(2）在目标域进行聚类预测伪标签(3）跨两个域进行域约束联合学习。在两个新构造的基准上进行了大量的实验和分析，证明了FedFR的有效性。在更真实的基准上，它比目标域中的基线方法和经典方法的性能提高了4%以上。我们相信FedFR将为在隐私限制下将联合学习应用于更多的计算机视觉任务提供启示。
摘要：Unsupervised domain adaptation has been widely adopted to generalize models for unlabeled data in a target domain, given labeled data in a source domain, whose data distributions differ from the target domain. However, existing works are inapplicable to face recognition under privacy constraints because they require sharing sensitive face images between two domains. To address this problem, we propose a novel unsupervised federated face recognition approach (FedFR). FedFR improves the performance in the target domain by iteratively aggregating knowledge from the source domain through federated learning. It protects data privacy by transferring models instead of raw data between domains. Besides, we propose a new domain constraint loss (DCL) to regularize source domain training. DCL suppresses the data volume dominance of the source domain. We also enhance a hierarchical clustering algorithm to predict pseudo labels for the unlabeled target domain accurately. To this end, FedFR forms an end-to-end training pipeline: (1) pre-train in the source domain; (2) predict pseudo labels by clustering in the target domain; (3) conduct domain-constrained federated learning across two domains. Extensive experiments and analysis on two newly constructed benchmarks demonstrate the effectiveness of FedFR. It outperforms the baseline and classic methods in the target domain by over 4% on the more realistic benchmark. We believe that FedFR will shed light on applying federated learning to more computer vision tasks under privacy constraints.

【10】 Leveraging Semantic Scene Characteristics and Multi-Stream Convolutional Architectures in a Contextual Approach for Video-Based Visual Emotion Recognition in the Wild
标题：利用语义场景特征和多流卷积结构的上下文方法在野外基于视频的视觉情感识别中的应用

作者：Ioannis Pikoulis,Panagiotis P. Filntisis,Petros Maragos
机构：School of ECE, National Technical University of Athens, Athens , Greece
备注：9 pages, 4 figures, 5 tables, submitted to the 16th IEEE International Conference on Automatic Face and Gesture Recognition
链接：https://arxiv.org/abs/2105.07484
摘要：在这项工作中，我们处理的任务，基于视频的视觉情感识别在野外。仅依赖于身体和面部特征提取的标准方法在上述情感信息源由于头部/身体方向、低分辨率和光照差而无法访问的情况下，往往无法准确预测情感。作为更广泛的情感识别框架的一部分，我们希望通过利用场景特征和属性形式的视觉上下文来缓解这个问题。时间段网络（TSN）构成了我们提出的模型的主干。除了RGB输入模式外，我们还利用密集光流，遵循直观的多流方法对运动进行更有效的编码。此外，我们将注意力转移到基于骨架的学习上，并利用以动作为中心的数据作为预训练时空图卷积网络（ST-GCN）的手段来完成情绪识别任务。我们在具有挑战性的身体语言数据集（BoLD）上进行了大量实验，验证了我们的方法优于现有方法，同时通过将上述所有模块适当地整合到一个网络集成中，我们成功地超越了以前公布的最佳认识分数。
摘要：In this work we tackle the task of video-based visual emotion recognition in the wild. Standard methodologies that rely solely on the extraction of bodily and facial features often fall short of accurate emotion prediction in cases where the aforementioned sources of affective information are inaccessible due to head/body orientation, low resolution and poor illumination. We aspire to alleviate this problem by leveraging visual context in the form of scene characteristics and attributes, as part of a broader emotion recognition framework. Temporal Segment Networks (TSN) constitute the backbone of our proposed model. Apart from the RGB input modality, we make use of dense Optical Flow, following an intuitive multi-stream approach for a more effective encoding of motion. Furthermore, we shift our attention towards skeleton-based learning and leverage action-centric data as means of pre-training a Spatial-Temporal Graph Convolutional Network (ST-GCN) for the task of emotion recognition. Our extensive experiments on the challenging Body Language Dataset (BoLD) verify the superiority of our methods over existing approaches, while by properly incorporating all of the aforementioned modules in a network ensemble, we manage to surpass the previous best published recognition scores, by a large margin.

【11】 Neighbourhood-guided Feature Reconstruction for Occluded Person Re-Identification
标题：邻域引导特征重建用于被遮挡人员再识别

作者：Shijie Yu,Dapeng Chen,Rui Zhao,Haobin Chen,Yu Qiao
机构：ShenZhen Key Lab of Computer Vision and Pattern Recognition, SIAT-SenseTime Joint Lab, Shenzhen, Institute of Advanced Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences, China, SenseTime Group Limited
链接：https://arxiv.org/abs/2105.07345
摘要：监控摄像机拍摄到的人体图像往往会被各种障碍物遮挡，导致特征表示的缺陷，影响人体再识别（re-ID）的性能。为了解决这一问题，我们提出通过充分利用图库图像中被遮挡部分的邻域信息来重构被遮挡部分的特征表示。具体来说，我们首先介绍了一个基于可见部分的特征，通过身体面具为每个人的形象。然后利用可见特征识别其相邻样本，并以所有相邻样本为输入，利用离群点可移动图神经网络重构出人体的表示。大量实验表明，该方法取得了显著的改进。在大规模的闭塞DukeMTMC基准测试中，我们的方法获得了64.2%的mAP和67.6%的rank-1精度，分别比现有方法高出20.4%和12.5%，表明了我们的方法对闭塞Re-ID问题的有效性。
摘要：Person images captured by surveillance cameras are often occluded by various obstacles, which lead to defective feature representation and harm person re-identification (Re-ID) performance. To tackle this challenge, we propose to reconstruct the feature representation of occluded parts by fully exploiting the information of its neighborhood in a gallery image set. Specifically, we first introduce a visible part-based feature by body mask for each person image. Then we identify its neighboring samples using the visible features and reconstruct the representation of the full body by an outlier-removable graph neural network with all the neighboring samples as input. Extensive experiments show that the proposed approach obtains significant improvements. In the large-scale Occluded-DukeMTMC benchmark, our approach achieves 64.2% mAP and 67.6% rank-1 accuracy which outperforms the state-of-the-art approaches by large margins, i.e.,20.4% and 12.5%, respectively, indicating the effectiveness of our method on occluded Re-ID problem.

【12】 Brain Inspired Object Recognition System
标题：一种受大脑启发的物体识别系统

作者：Pinaki Roy Chowdhury,Angad Wadhwa,Antariksha Kar,Nikhil Tyagi
机构：T
备注：24 Pages, 26 Tables, 12 Figures
链接：https://arxiv.org/abs/2105.07237
摘要：本文提出了一种新的有效的人脸和物体识别计算模型，该模型利用大脑的分布式人脸和物体识别机制中的线索，并从现有文献中收集这些线索的工程等价物。三个不同的和广泛使用的特征，有向梯度直方图，局部二值模式，以及从目标图像中提取的主成分，以一种简单而有效的方式使用。我们的模型使用多层感知器（MLP）对这三个特征进行分类，并在决策层使用求和规则进行融合。计算理论首先是利用大脑信息处理机制的概念发展起来的。利用15个公开数据集进行了大量实验，验证了该模型在光照、姿态角、表情和背景变化极大的人脸和物体识别中的性能。与其他人脸和目标识别算法（包括CNN和基于深度学习的方法）相比，所获得的结果是非常有希望的。这突出表明，简单的计算过程，如果俱乐部适当，可以产生与最佳算法竞争的性能。
摘要：This paper presents a new proposal of an efficient computational model of face and object recognition which uses cues from the distributed face and object recognition mechanism of the brain, and by gathering engineering equivalent of these cues from existing literature. Three distinct and widely used features, Histogram of Oriented Gradients, Local Binary Patterns, and Principal components extracted from target images are used in a manner which is simple, and yet effective. Our model uses multi-layer perceptrons (MLP) to classify these three features and fuse them at the decision level using sum rule. A computational theory is first developed by using concepts from the information processing mechanism of the brain. Extensive experiments are carried out using fifteen publicly available datasets to validate the performance of our proposed model in recognizing faces and objects with extreme variation of illumination, pose angle, expression, and background. Results obtained are extremely promising when compared with other face and object recognition algorithms including CNN and deep learning based methods. This highlights that simple computational processes, if clubbed properly, can produce competing performance with best algorithms.

【13】 One for All: An End-to-End Compact Solution for Hand Gesture Recognition
标题：人人共享：端到端的手势识别紧凑型解决方案

作者：Monu Verma,Ayushi Gupta,santosh kumar Vipparthi
机构：Computer Science and Engineering, Malaviya National Institute of Technology, Jaipur, India, olx people, Bangalore, Karnataka, Santosh K. Vipparthi
链接：https://arxiv.org/abs/2105.07143
摘要：HGR是一项非常具有挑战性的任务，因为它的性能受到各种因素的影响，如光照变化、背景混乱、自发捕获等。传统的CNN HGR网络采用两级流水线来处理各种挑战：复杂的信号、光照变化、光照变化和，复杂杂乱的背景。现有的方法需要专家的专业知识以及第一阶段的辅助计算来消除输入图像的复杂性。因此，本文提出了一种新颖的端到端紧凑CNN框架：细粒度特征注意网络手势识别（Fit-hand）来解决上述问题。该结构的流水线由两个主要单元组成：FineFeat模块和扩展卷积（Conv）层。FineFeat模块利用多尺度感受野的注意机制提取细粒度特征图。引入注意机制，通过扩大多尺度响应的平均行为来捕捉有效特征。此外，扩张卷积通过更大的感受野提供了手势的全局特征。此外，集成层还结合了FineFeat模块和扩展层的特点，通过捕获手部姿势的互补上下文信息，增强了网络的可分辨性。通过使用受试者相关（SD）和受试者独立（SI）验证设置，对7个基准数据集（分别为MUGD-I、MUGD-II、MUGD-III、MUGD-IV、MUGD-V、手指拼写和OUHANDS）的Fit-Hand有效性进行了评估。此外，我们进行了十项烧蚀研究，以探讨所提出的适手框架的深刻见解。
摘要：The HGR is a quite challenging task as its performance is influenced by various aspects such as illumination variations, cluttered backgrounds, spontaneous capture, etc. The conventional CNN networks for HGR are following two stage pipeline to deal with the various challenges: complex signs, illumination variations, complex and cluttered backgrounds. The existing approaches needs expert expertise as well as auxiliary computation at stage 1 to remove the complexities from the input images. Therefore, in this paper, we proposes an novel end-to-end compact CNN framework: fine grained feature attentive network for hand gesture recognition (Fit-Hand) to solve the challenges as discussed above. The pipeline of the proposed architecture consists of two main units: FineFeat module and dilated convolutional (Conv) layer. The FineFeat module extracts fine grained feature maps by employing attention mechanism over multiscale receptive fields. The attention mechanism is introduced to capture effective features by enlarging the average behaviour of multi-scale responses. Moreover, dilated convolution provides global features of hand gestures through a larger receptive field. In addition, integrated layer is also utilized to combine the features of FineFeat module and dilated layer which enhances the discriminability of the network by capturing complementary context information of hand postures. The effectiveness of Fit- Hand is evaluated by using subject dependent (SD) and subject independent (SI) validation setup over seven benchmark datasets: MUGD-I, MUGD-II, MUGD-III, MUGD-IV, MUGD-V, Finger Spelling and OUHANDS, respectively. Furthermore, to investigate the deep insights of the proposed Fit-Hand framework, we performed ten ablation study.

【14】 Face Attributes as Cues for Deep Face Recognition Understanding
标题：人脸属性作为深度人脸识别理解的线索

作者：Matheus Alves Diniz,William Robson Schwartz
机构：Smart Sense Laboratory, Department of Computer Science, Federal University of Minas Gerais, Brazil
备注：7 pages, 5 figures, published at automatic face and gesture recognition 2020
链接：https://arxiv.org/abs/2105.07054
摘要：深入学习的表征是用于人脸识别方法的最先进的描述符。这些表示编码了难以解释的潜在特征，损害了其预测的可信度和可解释性。大多数解释深层特征的尝试都是可视化技术，这些技术通常是开放的。我们使用隐藏层的输出来预测人脸属性，而不是仅仅依赖于可视化。所获得的性能是在网络的该层中隐式地学习属性的程度的指标。利用变量选择技术，我们还分析了这些语义概念是如何分布在每一层中的，为每个属性建立了相关神经元的精确位置。根据我们的实验，即使只使用一个神经输出来预测每个属性，性别、眼镜和帽子的使用率也可以以96%以上的准确率进行预测。这些性能比深度监督的人脸属性网络低不到3个百分点。总之，我们的实验表明，在为人脸识别优化的DCNNs中，存在着编码人脸属性的潜在神经元，几乎和为这些属性优化的DCNNs一样精确。
摘要：Deeply learned representations are the state-of-the-art descriptors for face recognition methods. These representations encode latent features that are difficult to explain, compromising the confidence and interpretability of their predictions. Most attempts to explain deep features are visualization techniques that are often open to interpretation. Instead of relying only on visualizations, we use the outputs of hidden layers to predict face attributes. The obtained performance is an indicator of how well the attribute is implicitly learned in that layer of the network. Using a variable selection technique, we also analyze how these semantic concepts are distributed inside each layer, establishing the precise location of relevant neurons for each attribute. According to our experiments, gender, eyeglasses and hat usage can be predicted with over 96% accuracy even when only a single neural output is used to predict each attribute. These performances are less than 3 percentage points lower than the ones achieved by deep supervised face attribute networks. In summary, our experiments show that, inside DCNNs optimized for face identification, there exists latent neurons encoding face attributes almost as accurately as DCNNs optimized for these attributes.

【15】 Dermoscopic Image Classification with Neural Style Transfer
标题：基于神经样式转移的皮肤镜图像分类

作者：Yutong Li,Ruoqing Zhu,Annie Qu,Mike Yeh
机构： Department of Statistics, Universityof Illinois at Urbana-Champaign
备注：32 pages, 11 figures
链接：https://arxiv.org/abs/2105.07592
摘要：皮肤癌是人类最常见的恶性肿瘤，主要通过皮肤镜分析、活检和组织病理学检查进行视觉诊断。然而，与其他类型的癌症不同，由于皮损外观的不规则性和变异性，皮损的自动图像分类被认为更具挑战性。在这项工作中，我们提出了一个适应的神经风格转移（NST）作为一个新的图像预处理步骤皮肤病变分类问题。我们将每个皮肤镜图像表示为样式图像，并将病变的样式转换为同质内容图像。这将每个病灶的主要变异性转移到同一个局部区域，使我们能够将生成的图像整合在一起，并通过张量分解提取潜在的低秩特征。我们在从国际皮肤成像协作（ISIC）数据库收集和预处理的皮肤镜数据集上训练和交叉验证我们的模型。结果表明，基于提取的张量特征的分类性能明显优于原始图像10%以上的分类性能，并且通过迁移学习，与经过充分研究、预训练的CNN模型具有一定的竞争力。此外，张量分解进一步确定潜在的风格群，这可能提供临床解释和见解。
摘要：Skin cancer, the most commonly found human malignancy, is primarily diagnosed visually via dermoscopic analysis, biopsy, and histopathological examination. However, unlike other types of cancer, automated image classification of skin lesions is deemed more challenging due to the irregularity and variability in the lesions' appearances. In this work, we propose an adaptation of the Neural Style Transfer (NST) as a novel image pre-processing step for skin lesion classification problems. We represent each dermoscopic image as the style image and transfer the style of the lesion onto a homogeneous content image. This transfers the main variability of each lesion onto the same localized region, which allows us to integrate the generated images together and extract latent, low-rank style features via tensor decomposition. We train and cross-validate our model on a dermoscopic data set collected and preprocessed from the International Skin Imaging Collaboration (ISIC) database. We show that the classification performance based on the extracted tensor features using the style-transferred images significantly outperforms that of the raw images by more than 10%, and is also competitive with well-studied, pre-trained CNN models through transfer learning. Additionally, the tensor decomposition further identifies latent style clusters, which may provide clinical interpretation and insights.

分割|语义相关(11篇)

【1】 Pseudo-Label Ensemble-based Semi-supervised Learning for Handling Noisy Soiling Segmentation Annotations
标题：基于伪标签集成的半监督学习在噪声污垢分割标注中的应用

作者：Michal Uricar,Ganesh Sistu,Lucie Yahiaoui,Senthil Yogamani
机构：Independent Researcher, Czech Republic, Valeo Vision Systems, Ireland
链接：https://arxiv.org/abs/2105.07930
摘要：手动注释全方位摄像头上的污渍是一项非常具有挑战性和昂贵的任务。由于水滴或泥粒等各种污染类型的边界不清，通常导致注释质量差异较大。因此，在这种注释不好的数据上训练的模型远远不是最优的。在本文中，我们着重于处理这种噪声注释通过伪标签驱动的集成模型，使我们能够快速发现有问题的注释，并在大多数情况下也充分修复它们。我们训练了一个污渍分割模型对噪声和精炼标签，并证明了显着改善使用精炼注释。它还说明了可以有效地细化成本较低的粗注释。
摘要：Manual annotation of soiling on surround view cameras is a very challenging and expensive task. The unclear boundary for various soiling categories like water drops or mud particles usually results in a large variance in the annotation quality. As a result, the models trained on such poorly annotated data are far from being optimal. In this paper, we focus on handling such noisy annotations via pseudo-label driven ensemble model which allow us to quickly spot problematic annotations and in most cases also sufficiently fixing them. We train a soiling segmentation model on both noisy and refined labels and demonstrate significant improvements using the refined annotations. It also illustrates that it is possible to effectively refine lower cost coarse annotations.

【2】 Cross-Modality Brain Tumor Segmentation via Bidirectional Global-to-Local Unsupervised Domain Adaptation
标题：基于全局到局部双向无监督区域自适应的跨模态脑肿瘤分割

作者：Kelei He,Wen Ji,Tao Zhou,Zhuoyuan Li,Jing Huo,Xin Zhang,Yang Gao,Dinggang Shen,Bing Zhang,Junfeng Zhang
机构： NanjingUniversity of Science and Technology, Zhang are with Department of Radiology, Nanjing University Medical School
链接：https://arxiv.org/abs/2105.07715
摘要：从多模态磁共振（MR）图像中准确分割脑肿瘤是脑肿瘤诊断和治疗的关键。然而，由于不同模态之间存在域偏移，当在一种模态上训练并在另一种模态上执行时，网络的性能急剧下降，例如在T1图像上训练而在T2图像上执行，这在临床应用中是经常需要的。这也禁止网络在标记数据上进行训练，然后从不同的域传输到未标记的数据。为了克服这一问题，无监督域自适应（UDA）方法提供了有效的解决方案，以减轻标记源数据和未标记目标数据之间的域转移。在本文中，我们提出了一个新的双向全局到局部（biglobal-to-Local，BiGL）自适应框架。具体地说，本文提出了一种双向图像合成与分割模块，该模块利用两个区域的中间数据分布来分割脑肿瘤，包括图像到图像转换器和共享加权分割网络。在此基础上，提出了一个全局到局部的一致性学习模块，以集成的方式建立鲁棒的表示对齐。在多模态脑MR基准数据集上进行的大量实验表明，该方法的性能大大优于几种最新的无监督域自适应方法，而一项全面的消融研究验证了每个关键组件的有效性。我们方法的实现代码将在\url发布{https://github.com/KeleiHe/BiGL}.
摘要：Accurate segmentation of brain tumors from multi-modal Magnetic Resonance (MR) images is essential in brain tumor diagnosis and treatment. However, due to the existence of domain shifts among different modalities, the performance of networks decreases dramatically when training on one modality and performing on another, e.g., train on T1 image while performing on T2 image, which is often required in clinical applications. This also prohibits a network from being trained on labeled data and then transferred to unlabeled data from a different domain. To overcome this, unsupervised domain adaptation (UDA) methods provide effective solutions to alleviate the domain shift between labeled source data and unlabeled target data. In this paper, we propose a novel Bidirectional Global-to-Local (BiGL) adaptation framework under a UDA scheme. Specifically, a bidirectional image synthesis and segmentation module is proposed to segment the brain tumor using the intermediate data distributions generated for the two domains, which includes an image-to-image translator and a shared-weighted segmentation network. Further, a global-to-local consistency learning module is proposed to build robust representation alignments in an integrated way. Extensive experiments on a multi-modal brain MR benchmark dataset demonstrate that the proposed method outperforms several state-of-the-art unsupervised domain adaptation methods by a large margin, while a comprehensive ablation study validates the effectiveness of each key component. The implementation code of our method will be released at \url{https://github.com/KeleiHe/BiGL}.

【3】 Voxel-level Siamese Representation Learning for Abdominal Multi-Organ Segmentation
标题：用于腹部多器官分割的体素级暹罗表示学习

作者：Chae Eun Lee,Minyoung Chung,Yeong-Gil Shin
机构： Shin are with the Department of Computer Sci-ence and Engineering, Seoul National University, Chung is with the School of Software, Soongsil University
链接：https://arxiv.org/abs/2105.07672
摘要：由于图像标注的局限性，近年来医学图像分割领域的研究者们积极探索各种深度学习结构或目标函数来对体数据中的高级特征进行编码。然而，大多数现有的方法倾向于忽略跨卷全局上下文，在决策空间中定义上下文关系。在这项工作中，我们提出了一种新的体素水平的连体体表示学习方法，用于腹部多器官分割，以改善表示空间。该方法在表示空间中加强了体素特征关系，从而更全面地利用有限的数据集来获得更好的性能。受对比学习最新进展的启发，我们在不使用负样本的情况下，抑制了同一类中的体素关系，使其投影到同一点。此外，我们还提出了一种多分辨率的上下文聚合方法，该方法将多个隐藏层的特征进行聚合，同时对全局和局部上下文进行编码以进行分割。我们在多器官数据集上的实验在骰子得分系数上比现有方法提高了2%。表示空间的定性可视化表明，改进主要是通过一个分离的特征空间获得的。
摘要：Recent works in medical image segmentation have actively explored various deep learning architectures or objective functions to encode high-level features from volumetric data owing to limited image annotations. However, most existing approaches tend to ignore cross-volume global context and define context relations in the decision space. In this work, we propose a novel voxel-level Siamese representation learning method for abdominal multi-organ segmentation to improve representation space. The proposed method enforces voxel-wise feature relations in the representation space for leveraging limited datasets more comprehensively to achieve better performance. Inspired by recent progress in contrastive learning, we suppressed voxel-wise relations from the same class to be projected to the same point without using negative samples. Moreover, we introduce a multi-resolution context aggregation method that aggregates features from multiple hidden layers, which encodes both the global and local contexts for segmentation. Our experiments on the multi-organ dataset outperformed the existing approaches by 2% in Dice score coefficient. The qualitative visualizations of the representation spaces demonstrate that the improvements were gained primarily by a disentangled feature space.

【4】 Uncertainty in Minimum Cost Multicuts for Image and Motion Segmentation
标题：图像和运动分割中最小代价多路径的不确定性

作者：Amirhossein Kardoost,Margret Keuper
机构：Data and Web Science Group, University of Mannheim, Germany
备注：Accepted in the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021)
链接：https://arxiv.org/abs/2105.07469
摘要：最小代价提升多切分方法在图像分解、网格分割、多目标跟踪和运动分割等领域有着广泛的应用。它在一个基于图的模型中解决了这样的问题，在这个模型中，实体之间的边被分配了实值代价，使得最小割将图分解成一个最优的段数。在最小成本多切口概率公式的驱动下，我们提供了优化过程中决策不确定性的度量。我们认为，获取这些不确定性对于许多实际应用至关重要，并在图像分解（BSDS-500）和运动分割（DAVIS2016和FBMS59）的背景下，根据信息变化（VI）和兰德指数（RI）对三种不同的、广泛使用的数据集进行稀疏化评估。
摘要：The minimum cost lifted multicut approach has proven practically good performance in a wide range of applications such as image decomposition, mesh segmentation, multiple object tracking, and motion segmentation. It addresses such problems in a graph-based model, where real-valued costs are assigned to the edges between entities such that the minimum cut decomposes the graph into an optimal number of segments. Driven by a probabilistic formulation of minimum cost multicuts, we provide a measure for the uncertainties of the decisions made during the optimization. We argue that access to such uncertainties is crucial for many practical applications and conduct an evaluation by means of sparsifications on three different, widely used datasets in the context of image decomposition (BSDS-500) and motion segmentation (DAVIS2016 and FBMS59) in terms of variation of information (VI) and Rand index (RI).

【5】 Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval
标题：面向Zero-Shot图像检索的视觉语义嵌入方法综述

作者：Kazuya Ueki
机构：Department of Information Science, Meisei University, Tokyo, Jappan
链接：https://arxiv.org/abs/2105.07391
摘要：视觉语义嵌入是一个有趣的研究课题，因为它适用于各种任务，如视觉问答（VQA）、图像文本检索、图像字幕和场景图生成。在本文中，我们重点研究了以句子为查询的Zero-Shot图像检索，并对该领域的技术发展趋势进行了综述。首先，我们对这项技术的历史进行了全面的概述，首先讨论了图像到文本匹配的早期研究以及这项技术是如何随着时间的推移而发展的。此外，对实验中常用的数据集进行了描述，并对各种方法的评价结果进行了比较。我们还介绍了github上的实现，用于确认实验的准确性和进一步改进。希望本文能鼓励研究者进一步发展对图像和语言的桥梁研究。
摘要：Visual-semantic embedding is an interesting research topic because it is useful for various tasks, such as visual question answering (VQA), image-text retrieval, image captioning, and scene graph generation. In this paper, we focus on zero-shot image retrieval using sentences as queries and present a survey of the technological trends in this area. First, we provide a comprehensive overview of the history of the technology, starting with a discussion of the early studies of image-to-text matching and how the technology has evolved over time. In addition, a description of the datasets commonly used in experiments and a comparison of the evaluation results of each method are presented. We also introduce the implementation available on github for use in confirming the accuracy of experiments and for further improvement. We hope that this survey paper will encourage researchers to further develop their research on bridging images and languages.

【6】 Mask-Guided Discovery of Semantic Manifolds in Generative Models
标题：生成模型中掩码引导的语义流形发现

作者：Mengyu Yang,David Rokeby,Xavier Snelgrove
机构：BMO Lab for Creative Research, University of Toronto
备注：In the 4th Workshop on Machine Learning for Creativity and Design at NeurIPS 2020, Vancouver, Canada
链接：https://arxiv.org/abs/2105.07273
摘要：生成性对抗网络（GANs）领域的进步导致了能够产生惊人逼真图像的架构，例如StyleGAN2，当在FFHQ数据集上训练时，它从低维潜在空间中的随机向量生成人脸图像。不幸的是，这个空间是纠缠的-沿着它的轴平移一个潜在的向量并不对应于输出空间中的一个有意义的变换（例如，微笑的嘴，眯着的眼睛）。该模型的行为就像一个黑匣子，既不提供对其输出的控制，也不提供对它从数据中学到的结构的洞察。我们提出了一种方法来探索流形变化的空间局部区域的脸。我们的方法沿着这些流形发现了适合于创建动画的平滑变化的潜在向量序列。与现有的需要标记数据或显式改变内部模型参数的解纠缠方法不同，我们的方法是一种基于优化的方法，由自定义损失函数和手动定义的变化区域引导。我们的代码是开源的，可以在我们的项目页面上找到这些代码以及补充结果：https://github.com/bmolab/masked-gan-manifold
摘要：Advances in the realm of Generative Adversarial Networks (GANs) have led to architectures capable of producing amazingly realistic images such as StyleGAN2, which, when trained on the FFHQ dataset, generates images of human faces from random vectors in a lower-dimensional latent space. Unfortunately, this space is entangled - translating a latent vector along its axes does not correspond to a meaningful transformation in the output space (e.g., smiling mouth, squinting eyes). The model behaves as a black box, providing neither control over its output nor insight into the structures it has learned from the data. We present a method to explore the manifolds of changes of spatially localized regions of the face. Our method discovers smoothly varying sequences of latent vectors along these manifolds suitable for creating animations. Unlike existing disentanglement methods that either require labelled data or explicitly alter internal model parameters, our method is an optimization-based approach guided by a custom loss function and manually defined region of change. Our code is open-sourced, which can be found, along with supplementary results, on our project page: https://github.com/bmolab/masked-gan-manifold

【7】 Aerial-PASS: Panoramic Annular Scene Segmentation in Drone Videos
标题：空中通道：无人机视频中的全景环形场景分割

作者：Lei Sun,Jia Wang,Kailun Yang,Kaikai Wu,Xiangdong Zhou,Kaiwei Wang,Jian Bai
机构： 1Authors are with State Key Laboratory of Optical Instrumentation, Zhejiang University, cn 2The author is with Institute for Anthropomatics and Robotics, KarlsruheInstitute of Technology
备注：Our dataset will be made publicly available at: this http URL
链接：https://arxiv.org/abs/2105.07209
摘要：无人机（UAVs）的一项重要任务是对周围环境进行航空像素级的场景感知。以往的研究工作主要采用传统的针孔相机或鱼眼相机作为成像器件。然而，这些成像系统不能同时实现大视场、小尺寸和轻量化。为此，我们设计了一种具有体积小、重量轻、360度环形视场的全景环形透镜无人机系统。设计了一种轻量级全景环形语义分割神经网络模型，实现了高精度、实时的场景解析。此外，我们提出了第一个无人机透视全景场景分割数据集空中通行证，与轨道，领域，和其他注释标签。综合实验表明，所设计的系统在空中全景场景解析中取得了令人满意的效果。特别是，我们提出的模型在分割性能和推理速度之间取得了很好的折衷，并在公共街道场景和我们建立的空中场景数据集上进行了验证。
摘要：Aerial pixel-wise scene perception of the surrounding environment is an important task for UAVs (Unmanned Aerial Vehicles). Previous research works mainly adopt conventional pinhole cameras or fisheye cameras as the imaging device. However, these imaging systems cannot achieve large Field of View (FoV), small size, and lightweight at the same time. To this end, we design a UAV system with a Panoramic Annular Lens (PAL), which has the characteristics of small size, low weight, and a 360-degree annular FoV. A lightweight panoramic annular semantic segmentation neural network model is designed to achieve high-accuracy and real-time scene parsing. In addition, we present the first drone-perspective panoramic scene segmentation dataset Aerial-PASS, with annotated labels of track, field, and others. A comprehensive variety of experiments shows that the designed system performs satisfactorily in aerial panoramic scene parsing. In particular, our proposed model strikes an excellent trade-off between segmentation performance and inference speed suitable, validated on both public street-scene and our established aerial-scene datasets.

【8】 Cross-Modal Progressive Comprehension for Referring Segmentation
标题：指代分词的跨模态递进理解

作者：Si Liu,Tianrui Hui,Shaofei Huang,Yunchao Wei,Bo Li,Guanbin Li
机构： Tianrui Hui and ShaofeiHuang are with Institute of Information Engineering, and also with School of Cyber Security, University of Chi-nese Academy of Sciences
备注：Accepted by TPAMI 2021
链接：https://arxiv.org/abs/2105.07175
摘要：给定一个自然语言表达式和一个图像/视频，引用分割的目标是产生由表达式主体描述的实体的像素级掩码。以往的方法都是通过隐式的特征交互和视觉与语言形式的融合来解决这个问题。然而，人类倾向于根据表达式中的信息词逐步解决指称问题，即首先粗略定位候选实体，然后区分目标实体。本文提出了一种有效模拟人类行为的跨模式渐进理解（CMPC）方案，并将其实现为CMPC-I（图像）模块和CMPC-V（视频）模块，以改进参考图像和视频分割模型。对于图像数据，我们的CMPC-I模块首先使用实体词和属性词来感知表达式可能考虑的所有相关实体。然后，通过空间图推理，利用关系词突出目标实体，同时抑制其他无关实体。对于视频数据，我们的CMPC-V模块进一步利用基于CMPC-I的动作词，通过时态图推理突出与动作线索匹配的正确实体。除了CMPC之外，我们还引入了一个简单而有效的文本引导特征交换（TGFE）模块，在文本信息的引导下，将视觉主干中对应于不同层次的合理多模态特征进行集成。这样，多层次特征就可以在文本语境的基础上相互交流、相互提炼。将CMPC-I或CMPC-V与TGFE相结合，可以形成我们的图像或视频版本参考分割框架，并且我们的框架分别在四个参考图像分割基准和三个参考视频分割基准上取得了最新的性能。
摘要：Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressive manner based on informative words in the expression, i.e., first roughly locating candidate entities and then distinguishing the target one. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models. For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the target entity as well as suppress other irrelevant ones by spatial graph reasoning. For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning. In addition to the CMPC, we also introduce a simple yet effective Text-Guided Feature Exchange (TGFE) module to integrate the reasoned multimodal features corresponding to different levels in the visual backbone under the guidance of textual information. In this way, multi-level features can communicate with each other and be mutually refined based on the textual context. Combining CMPC-I or CMPC-V with TGFE can form our image or video version referring segmentation frameworks and our frameworks achieve new state-of-the-art performances on four referring image segmentation benchmarks and three referring video segmentation benchmarks respectively.

【9】 Momentum Contrastive Voxel-wise Representation Learning for Semi-supervised Volumetric Medical Image Segmentation
标题：动量对比体素表示学习在半监督体医学图像分割中的应用

作者：Chenyu You,Ruihan Zhao,Lawrence Staib,James S. Duncan
机构：Yale University, University of California, Berkeley
链接：https://arxiv.org/abs/2105.07059
摘要：医学图像分析中的自动分割是一项具有挑战性的任务，需要大量的人工标注数据。然而，人工标注医学数据往往是一项艰巨的任务，而现有的基于学习的方法大多无法在没有有效几何约束的情况下准确地描绘物体的边界。对比学习是自监督学习的一个分支，近年来在多个领域得到了广泛的应用。在这项工作中，我们提出了一种新的基于几何约束的对比体素表示学习（CVRL）方法来学习有限标注的体医学图像分割的全局局部视觉表示。该框架通过获取三维空间背景和丰富的解剖信息，可以有效地学习全局和局部特征。具体地说，我们引入了一种体素到体素的对比算法来从三维图像中学习全局信息，并提出了局部体素到体素的对比来明确地利用嵌入空间中的局部线索。此外，我们整合了一个基于弹性交互的活动轮廓模型作为几何正则化项，以端到端的学习方式实现快速可靠的目标描绘。在心房分割挑战数据集上的实验结果表明了该方法的优越性，特别是在注释数据量非常有限的情况下。
摘要：Automated segmentation in medical image analysis is a challenging task that requires a large amount of manually labeled data. However, manually annotating medical data is often laborious, and most existing learning-based approaches fail to accurately delineate object boundaries without effective geometric constraints. Contrastive learning, a sub-area of self-supervised learning, has recently been noted as a promising direction in multiple application fields. In this work, we present a novel Contrastive Voxel-wise Representation Learning (CVRL) method with geometric constraints to learn global-local visual representations for volumetric medical image segmentation with limited annotations. Our framework can effectively learn global and local features by capturing 3D spatial context and rich anatomical information. Specifically, we introduce a voxel-to-volume contrastive algorithm to learn global information from 3D images, and propose to perform local voxel-to-voxel contrast to explicitly make use of local cues in the embedding space. Moreover, we integrate an elastic interaction-based active contour model as a geometric regularization term to enable fast and reliable object delineations in an end-to-end learning manner. Results on the Atrial Segmentation Challenge dataset demonstrate superiority of our proposed scheme, especially in a setting with a very limited number of annotated data.

【10】 DFENet: A Novel Dimension Fusion Edge Guided Network for Brain MRI Segmentation
标题：DFENet：一种新的用于脑MRI分割的维度融合边缘引导网络

作者：Hritam Basak,Rukhshanda Hussain,Ajay Rana
机构：Received: date Accepted: date
备注：Submitted at SN Computer Science
链接：https://arxiv.org/abs/2105.07962
摘要：近几年来，脑卒中发病率的迅速增加，已成为快速、准确地从脑MRI图像中分割脑卒中病灶的驱动力。随着近年来深度学习的发展，计算机辅助和分割缺血性脑卒中病变的方法已成为临床医生早期诊断和治疗规划的有效手段。然而，这些方法大多存在分割结果不准确和不可靠的问题，因为它们无法从MRI图像中获取足够的上下文特征。为了满足这些要求，人们提出了三维卷积神经网络，但其计算量很大。为了缓解这些问题，我们提出了一种融合二维和三维cnn特征的新的维度融合边缘引导网络（DFENet）。与其他方法不同的是，我们提出的网络使用了一个并行部分解码器（PPD）模块来聚集和上采样所选的特征，其中包含丰富的重要上下文信息。此外，我们使用边缘引导和增强的混合损失来不断监督和改进网络的学习过程。在公开的脑卒中后病变解剖追踪（ATLAS）数据集上对该方法进行了评价，得出平均DSC、IoU、精确度和召回值分别为0.5457、0.4015、0.6371和0.4969。当与其他最先进的方法相比较时，这个结果比它们好很多。因此，该模型具有鲁棒性强、精度高、优于现有方法等优点，可用于生物医学领域。
摘要：The rapid increment of morbidity of brain stroke in the last few years have been a driving force towards fast and accurate segmentation of stroke lesions from brain MRI images. With the recent development of deep-learning, computer-aided and segmentation methods of ischemic stroke lesions have been useful for clinicians in early diagnosis and treatment planning. However, most of these methods suffer from inaccurate and unreliable segmentation results because of their inability to capture sufficient contextual features from the MRI volumes. To meet these requirements, 3D convolutional neural networks have been proposed, which, however, suffer from huge computational requirements. To mitigate these problems, we propose a novel Dimension Fusion Edge-guided network (DFENet) that can meet both of these requirements by fusing the features of 2D and 3D CNNs. Unlike other methods, our proposed network uses a parallel partial decoder (PPD) module for aggregating and upsampling selected features, rich in important contextual information. Additionally, we use an edge-guidance and enhanced mixing loss for constantly supervising and improvising the learning process of the network. The proposed method is evaluated on publicly available Anatomical Tracings of Lesions After Stroke (ATLAS) dataset, resulting in mean DSC, IoU, Precision and Recall values of 0.5457, 0.4015, 0.6371, and 0.4969 respectively. The results, when compared to other state-of-the-art methods, outperforms them by a significant margin. Therefore, the proposed model is robust, accurate, superior to the existing methods, and can be relied upon for biomedical applications.

【11】 MSRF-Net: A Multi-Scale Residual Fusion Network for Biomedical Image Segmentation
标题：MSRF-Net：一种用于生物医学图像分割的多尺度残差融合网络

作者：Abhishek Srivastava,Debesh Jha,Sukalpa Chanda,Umapada Pal,Håvard D. Johansen,Dag Johansen,Michael A. Riegler,Sharib Ali,Pål Halvorsen
机构： Johansen are with UiT The Arctic University ofNorway
链接：https://arxiv.org/abs/2105.07451
摘要：基于卷积神经网络的方法提高了生物医学图像分割的性能。然而，这些方法大多不能有效地分割大小可变的对象，也不能在生物医学用例中常见的小而有偏差的数据集上进行训练。虽然已有的方法结合了多尺度融合方法来解决可变尺寸带来的挑战，但它们通常使用更适合于一般语义分割计算机视觉问题的复杂模型。本文提出了一种专门用于医学图像分割的MSRF-Net结构。提出的MSRF网络能够利用双尺度密集融合块（DSDF）交换不同感受野的多尺度特征。我们的DSDF块可以在两个不同分辨率的尺度上严格交换信息，MSRF子网使用多个DSDF块依次进行多尺度融合。这允许保留分辨率、改进信息流以及传播高级和低级特征以获得准确的分割图。提出的MSRF网络可以捕获对象的变化，并在不同的生物医学数据集上提供改进的结果。在MSRF网络上进行的大量实验表明，该方法的性能优于目前最先进的医学图像分割方法。MSRF-Net提高了在四个公开数据集上的性能，并且与最新的方法相比，MSRF-Net更具通用性。
摘要：Methods based on convolutional neural networks have improved the performance of biomedical image segmentation. However, most of these methods cannot efficiently segment objects of variable sizes and train on small and biased datasets, which are common in biomedical use cases. While methods exist that incorporate multi-scale fusion approaches to address the challenges arising with variable sizes, they usually use complex models that are more suitable for general semantic segmentation computer vision problems. In this paper, we propose a novel architecture called MSRF-Net, which is specially designed for medical image segmentation tasks. The proposed MSRF-Net is able to exchange multi-scale features of varying receptive fields using a dual-scale dense fusion block (DSDF). Our DSDF block can exchange information rigorously across two different resolution scales, and our MSRF sub-network uses multiple DSDF blocks in sequence to perform multi-scale fusion. This allows the preservation of resolution, improved information flow, and propagation of both high- and low-level features to obtain accurate segmentation maps. The proposed MSRF-Net allows to capture object variabilities and provides improved results on different biomedical datasets. Extensive experiments on MSRF-Net demonstrate that the proposed method outperforms most of the cutting-edge medical image segmentation state-of-the-art methods. MSRF-Net advances the performance on four publicly available datasets, and also, MSRF-Net is more generalizable as compared to state-of-the-art methods.

Zero/Few Shot|迁移|域适配|自适应(4篇)

【1】 Learning to Relate Depth and Semantics for Unsupervised Domain Adaptation
标题：学习深度和语义之间的关联以实现无监督领域自适应

作者：Suman Saha,Anton Obukhov,Danda Pani Paudel,Menelaos Kanakis,Yuhua Chen,Stamatios Georgoulis,Luc Van Gool
机构：ETH Zurich, KU Leuven
备注：Accepted at CVPR 2021
链接：https://arxiv.org/abs/2105.07830
摘要：我们提出了一种编码视觉任务关系的方法，以提高模型在无监督领域适应（UDA）设置的性能。语义分割和单目深度估计是相辅相成的任务；在多任务学习环境中，对它们之间的关系进行适当的编码可以进一步提高两个任务的性能。基于这种观察，我们提出了一种新的跨任务关系层（CTRL），它编码了语义预测和深度预测之间的任务依赖关系。为了捕捉跨任务之间的关系，我们提出了一种包含特定任务和跨任务细化头的神经网络结构。此外，我们提出了一种迭代自学习（ISL）训练方案，该方案利用语义伪标签对目标域进行额外的监督。我们在实验中观察到两个任务的表现都有所改善，因为这些任务中存在的互补信息被更好地捕获。具体地说，我们的研究表明：（1）当所有的任务是互补的和相互依赖的时，我们的方法提高了它们的性能(2） CTRL有助于提高语义分割和深度估计任务在具有挑战性的UDA环境中的性能(3）提出的ISL训练方案进一步提高了语义分割的性能。可在https://github.com/susaha/ctrl-uda.
摘要：We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting. Semantic segmentation and monocular depth estimation are shown to be complementary tasks; in a multi-task learning setting, a proper encoding of their relationships can further improve performance on both tasks. Motivated by this observation, we propose a novel Cross-Task Relation Layer (CTRL), which encodes task dependencies between the semantic and depth predictions. To capture the cross-task relationships, we propose a neural network architecture that contains task-specific and cross-task refinement heads. Furthermore, we propose an Iterative Self-Learning (ISL) training scheme, which exploits semantic pseudo-labels to provide extra supervision on the target domain. We experimentally observe improvements in both tasks' performance because the complementary information present in these tasks is better captured. Specifically, we show that: (1) our approach improves performance on all tasks when they are complementary and mutually dependent; (2) the CTRL helps to improve both semantic segmentation and depth estimation tasks performance in the challenging UDA setting; (3) the proposed ISL training scheme further improves the semantic segmentation performance. The implementation is available at https://github.com/susaha/ctrl-uda.

【2】 Is In-Domain Data Really Needed? A Pilot Study on Cross-Domain Calibration for Network Quantization
标题：真的需要域内数据吗？网络量化跨域校准的初步研究

作者：Haichao Yu,Linjie Yang,Humphrey Shi
机构：University of Illinois at Urbana-Champaign,ByteDance Inc.
链接：https://arxiv.org/abs/2105.07331
摘要：训练后量化方法使用一组校准数据来计算网络参数和激活的量化范围。校准数据通常来自训练数据集，由于数据的敏感性，这些数据集可能无法访问。在这项工作中，我们要研究这样一个问题：在不知道原始数据集的情况下，是否可以使用域外数据来校准训练好的网络？具体来说，我们超越了自然图像的领域，包括了截然不同的领域，如X射线图像、卫星图像和超声波图像。我们发现跨域校正使得量化模型在13个不同的校正数据集的10个不同图像域的任务上具有惊人的稳定性能。我们还发现，量化模型的性能与源域和校准域之间的Gram矩阵的相似性有关，这可以作为选择更好性能的校准集的标准。我们相信我们的研究打开了一扇大门，借用跨领域的知识进行网络量化和压缩。
摘要：Post-training quantization methods use a set of calibration data to compute quantization ranges for network parameters and activations. The calibration data usually comes from the training dataset which could be inaccessible due to sensitivity of the data. In this work, we want to study such a problem: can we use out-of-domain data to calibrate the trained networks without knowledge of the original dataset? Specifically, we go beyond the domain of natural images to include drastically different domains such as X-ray images, satellite images and ultrasound images. We find cross-domain calibration leads to surprisingly stable performance of quantized models on 10 tasks in different image domains with 13 different calibration datasets. We also find that the performance of quantized models is correlated with the similarity of the Gram matrices between the source and calibration domains, which can be used as a criterion to choose calibration set for better performance. We believe our research opens the door to borrow cross-domain knowledge for network quantization and compression.

【3】 MutualNet: Adaptive ConvNet via Mutual Learning from Different Model Configurations
标题：MutualNet：基于不同模型配置相互学习的自适应ConvNet

作者：Taojiannan Yang,Sijie Zhu,Matias Mendieta,Pu Wang,Ravikumar Balakrishnan,Minwoo Lee,Tao Han,Mubarak Shah,Chen Chen
机构： Lee are with the Department of Computer Science, University of North Carolina at Charlotte
备注：Extended version of arXiv:1909.12978. More experiments on 3D networks (SlowFast, X3D) and analyses on training cost
链接：https://arxiv.org/abs/2105.07085
摘要：现有的深度神经网络大多是静态的，这意味着它们只能以固定的复杂度进行推理。但不同设备的资源预算可能会有很大差异。即使是在一台设备上，可负担的预算也会随着不同的场景而变化，而且为每个所需的预算反复训练网络将非常昂贵。因此，在这项工作中，我们提出了一种通用的方法称为MutualNet来训练一个可以在不同资源约束下运行的单一网络。我们的方法训练了一组具有不同网络宽度和输入分辨率的模型配置。这种互学习方案不仅可以使模型在不同的宽度分辨率配置下运行，而且可以在这些配置之间传递独特的知识，帮助模型从整体上学习更强的表示。MutualNet是一种通用的训练方法，可应用于各种网络结构（例如，2D网络：MobileNets、ResNet、3D网络：SlowFast、X3D）和各种任务（例如，图像分类、目标检测、分割和动作识别），并被证明可在各种数据集上实现一致的改进。由于我们只训练一次模型，与单独训练多个模型相比，大大降低了训练成本。令人惊讶的是，如果不考虑动态资源约束，MutualNet还可以用来显著提高单个网络的性能。总之，MutualNet是静态和自适应、二维和三维网络的统一方法。代码和预先训练的模型可从\url获得{https://github.com/taoyang1122/MutualNet}.
摘要：Most existing deep neural networks are static, which means they can only do inference at a fixed complexity. But the resource budget can vary substantially across different devices. Even on a single device, the affordable budget can change with different scenarios, and repeatedly training networks for each required budget would be incredibly expensive. Therefore, in this work, we propose a general method called MutualNet to train a single network that can run at a diverse set of resource constraints. Our method trains a cohort of model configurations with various network widths and input resolutions. This mutual learning scheme not only allows the model to run at different width-resolution configurations but also transfers the unique knowledge among these configurations, helping the model to learn stronger representations overall. MutualNet is a general training methodology that can be applied to various network structures (e.g., 2D networks: MobileNets, ResNet, 3D networks: SlowFast, X3D) and various tasks (e.g., image classification, object detection, segmentation, and action recognition), and is demonstrated to achieve consistent improvements on a variety of datasets. Since we only train the model once, it also greatly reduces the training cost compared to independently training several models. Surprisingly, MutualNet can also be used to significantly boost the performance of a single network, if dynamic resource constraint is not a concern. In summary, MutualNet is a unified method for both static and adaptive, 2D and 3D networks. Codes and pre-trained models are available at \url{https://github.com/taoyang1122/MutualNet}.

【4】 Learning a Universal Template for Few-shot Dataset Generalization
标题：学习用于少发数据集综合的通用模板

作者：Eleni Triantafillou,Hugo Larochelle,Richard Zemel,Vincent Dumoulin
机构： recent 1University of Toronto, Vector Institute 2Google Research
链接：https://arxiv.org/abs/2105.07029
摘要：Few-Shot数据集的泛化是一个具有挑战性的变种，研究了Few-Shot分类问题，其中给出了多个数据集的不同训练集，目的是训练一个可适应的模型，然后可以从新的数据集学习类只使用几个例子。为此，我们建议利用不同的训练集来构造一个通用模板：一个局部模型，通过插入适当的组件，可以定义一系列数据集专用模型。因此，对于每个新的少数镜头分类问题，我们的方法只需要推断少量的参数就可以插入到通用模板中。我们设计了一个单独的网络，为每个给定的任务生成这些参数的初始化，然后通过几个步骤的梯度下降来微调其建议的初始化。与以前的方法相比，我们的方法具有更高的参数效率、可扩展性和适应性，并且在具有挑战性的元数据集基准上达到了最先进的水平。
摘要：Few-shot dataset generalization is a challenging variant of the well-studied few-shot classification problem where a diverse training set of several datasets is given, for the purpose of training an adaptable model that can then learn classes from new datasets using only a few examples. To this end, we propose to utilize the diverse training set to construct a universal template: a partial model that can define a wide array of dataset-specialized models, by plugging in appropriate components. For each new few-shot classification problem, our approach therefore only requires inferring a small number of parameters to insert into the universal template. We design a separate network that produces an initialization of those parameters for each given task, and we then fine-tune its proposed initialization via a few steps of gradient descent. Our approach is more parameter-efficient, scalable and adaptable compared to previous methods, and achieves the state-of-the-art on the challenging Meta-Dataset benchmark.

半弱无监督|主动学习|不确定性(10篇)

【1】 Divide and Contrast: Self-supervised Learning from Uncurated Data
标题：区别与对比：从未经整理的数据中进行自我监督学习

作者：Yonglong Tian,Olivier J. Henaff,Aaron van den Oord
机构：MIT, Olivier J. H´enaff, DeepMind
链接：https://arxiv.org/abs/2105.08054
摘要：自监督学习在利用大量未标记数据方面有希望，但迄今为止，它的大部分进展仅限于高度管理的预训练数据，如ImageNet。我们研究了对比学习在较大的、不太精确的图像数据集（如YFCC）中的效果，发现在结果的表示质量上确实存在很大的差异。我们假设，这一策展差距是由于图像类别分布的变化造成的——图像类别更为多样化和重尾——导致可供学习的相关负样本较少。我们用一种新的方法来检验这一假设，即区分与对比（Divide-and-Contrast，DnC），这种方法在对比学习和基于聚类的硬负挖掘之间交替进行。当对不太精确的数据集进行预训练时，DnC极大地提高了下游任务的自监督学习性能，同时在精确的数据集上与当前最先进的技术保持竞争。
摘要：Self-supervised learning holds promise in leveraging large amounts of unlabeled data, however much of its progress has thus far been limited to highly curated pre-training data such as ImageNet. We explore the effects of contrastive learning from larger, less-curated image datasets such as YFCC, and find there is indeed a large difference in the resulting representation quality. We hypothesize that this curation gap is due to a shift in the distribution of image classes -- which is more diverse and heavy-tailed -- resulting in less relevant negative samples to learn from. We test this hypothesis with a new approach, Divide and Contrast (DnC), which alternates between contrastive learning and clustering-based hard negative mining. When pretrained on less curated datasets, DnC greatly improves the performance of self-supervised learning on downstream tasks, while remaining competitive with the current state-of-the-art on curated datasets.

【2】 Traffic Scenario Clustering by Iterative Optimisation of Self-Supervised Networks Using a Random Forest Activation Pattern Similarity
标题：基于随机森林激活模式相似度的自监督网络迭代优化交通场景聚类

作者：Lakshman Balasubramanian,Jonas Wurst,Michael Botsch,Ke Deng
机构：©, IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media
备注：Accepted for IEEE Intelligent Vehicles 2021
链接：https://arxiv.org/abs/2105.07639
摘要：交通场景分类是自动驾驶的重要组成部分，例如动态规划算法及其验证。找到新的相关场景而无需手工制作步骤，大大减少了开发自动驾驶所需的资源。在这项工作中，提出了一种方法来解决这一挑战，通过引入一种基于新的数据自适应相似性度量的聚类技术，称为随机森林激活模式（RFAP）相似性。RFAP相似度是在随机森林算法中使用树编码方案生成的。本文提出的聚类方法考虑了有标记场景的存在，标记场景中的信息有助于指导未标记场景的聚类。它包括三个步骤。首先，使用一个定义的自监督目标，在所有可用的交通场景上训练一个自监督卷积神经网络（CNN）。第二，CNN对标记场景的分类进行了微调。第三，使用标记和未标记的场景，对聚类进行迭代优化。在迭代优化的第三步，CNN被用作无监督随机森林的特征发生器。训练后的森林，反过来，提供RFAP相似性，以适应由CNN实现的特征生成过程。在high数据集上进行了大量的实验和烧蚀研究。与基线聚类技术相比，该方法具有更好的性能。
摘要：Traffic scenario categorisation is an essential component of automated driving, for e.\,g., in motion planning algorithms and their validation. Finding new relevant scenarios without handcrafted steps reduce the required resources for the development of autonomous driving dramatically. In this work, a method is proposed to address this challenge by introducing a clustering technique based on a novel data-adaptive similarity measure, called Random Forest Activation Pattern (RFAP) similarity. The RFAP similarity is generated using a tree encoding scheme in a Random Forest algorithm. The clustering method proposed in this work takes into account that there are labelled scenarios available and the information from the labelled scenarios can help to guide the clustering of unlabelled scenarios. It consists of three steps. First, a self-supervised Convolutional Neural Network~(CNN) is trained on all available traffic scenarios using a defined self-supervised objective. Second, the CNN is fine-tuned for classification of the labelled scenarios. Third, using the labelled and unlabelled scenarios an iterative optimisation procedure is performed for clustering. In the third step at each epoch of the iterative optimisation, the CNN is used as a feature generator for an unsupervised Random Forest. The trained forest, in turn, provides the RFAP similarity to adapt iteratively the feature generation process implemented by the CNN. Extensive experiments and ablation studies have been done on the highD dataset. The proposed method shows superior performance compared to baseline clustering techniques.

【3】 Prototype-supervised Adversarial Network for Targeted Attack of Deep Hashing
标题：用于深度散列定向攻击的原型监督对抗网络

作者：Xunguang Wang,Zheng Zhang,Baoyuan Wu,Fumin Shen,Guangming Lu
机构：Harbin Institute of Technology, Shenzhen,Peng Cheng Laboratory, School of Data Science, The Chinese University of Hong Kong, Shenzhen, Secure Computing Lab of Big Data, Shenzhen Research Institute of Big Data
备注：This paper has been accepted by CVPR 2021, and the related codes could be available at this https URL
链接：https://arxiv.org/abs/2105.07553
摘要：由于其强大的表示学习能力和高效的计算能力，深度哈希算法在大规模图像检索中取得了重大进展。然而，深度散列网络容易受到攻击，这是一个实际的安全问题，但在基于散列的检索领域研究较少。在本文中，我们提出了一个新的原型监督对抗网络（ProS-GAN），它为高效和有效的目标哈希攻击提供了一个灵活的生成体系结构。据我们所知，这是第一代攻击深度哈希网络的方法。通常，我们提出的框架由三部分组成，即原型网、生成器和鉴别器。具体来说，所设计的原型网将目标标签嵌入到语义表示中，并学习原型代码作为目标标签的类别级表示。同时，将语义表示和原始图像联合输入到生成器中，实现灵活的目标攻击。特别地，采用原型码来监督生成器通过最小化敌方实例的hash码与原型码之间的Hamming距离来构造目标敌方实例。此外，生成器反对鉴别器，以同时鼓励对抗性示例的视觉真实性和语义表示的信息性。大量实验证明，该框架能有效地生成对抗性的攻击实例，比现有的深度散列攻击方法具有更好的攻击性能和可转移性。相关代码可在https://github.com/xunguangwang/ProS-GAN .
摘要：Due to its powerful capability of representation learning and high-efficiency computation, deep hashing has made significant progress in large-scale image retrieval. However, deep hashing networks are vulnerable to adversarial examples, which is a practical secure problem but seldom studied in hashing-based retrieval field. In this paper, we propose a novel prototype-supervised adversarial network (ProS-GAN), which formulates a flexible generative architecture for efficient and effective targeted hashing attack. To the best of our knowledge, this is the first generation-based method to attack deep hashing networks. Generally, our proposed framework consists of three parts, i.e., a PrototypeNet, a generator, and a discriminator. Specifically, the designed PrototypeNet embeds the target label into the semantic representation and learns the prototype code as the category-level representative of the target label. Moreover, the semantic representation and the original image are jointly fed into the generator for a flexible targeted attack. Particularly, the prototype code is adopted to supervise the generator to construct the targeted adversarial example by minimizing the Hamming distance between the hash code of the adversarial example and the prototype code. Furthermore, the generator is against the discriminator to simultaneously encourage the adversarial examples visually realistic and the semantic representation informative. Extensive experiments verify that the proposed framework can efficiently produce adversarial examples with better targeted attack performance and transferability over state-of-the-art targeted attack methods of deep hashing. The related codes could be available at https://github.com/xunguangwang/ProS-GAN .

【4】 Semi-supervised Contrastive Learning with Similarity Co-calibration
标题：相似共校准的半监督对比学习

作者：Yuhang Zhang,Xiaopeng Zhang,Robert. C. Qiu,Jie Li,Haohang Xu,Qi Tian
机构：Shanghai Jiao Tong University, Huawei Inc., Robert.C.Qiu
链接：https://arxiv.org/abs/2105.07387
摘要：半监督学习是利用大量未标记数据的有效方法。本文提出了一种新的训练策略，称为半监督对比学习（SsCL），它结合了自监督学习中著名的对比损失和半监督学习中的交叉熵损失，并以端到端的方式对两个目标进行联合优化。重点在于，与基于自训练的半监督学习不同的是，SsCL在相同的模型权重上进行预测和再训练，它在两个分支之间交换对未标记数据的预测，从而制定一个联合校正程序，这有利于更好的预测，避免陷入局部极小值。为了实现这一目标，对比损失分支利用交叉熵分支生成的最近邻域，对样本间的相似性进行成对建模，进而用对比相似性校正交叉熵分支的预测分布。我们发现，SsCL产生更多的区别性表征，有利于Few-Shot学习。值得注意的是，在以ResNet50为主干的ImageNet上，SsCL分别以1%和10%的标记样本达到60.2%和72.1%的top-1准确率，显著优于基线，优于以往的半监督和自监督方法。
摘要：Semi-supervised learning acts as an effective way to leverage massive unlabeled data. In this paper, we propose a novel training strategy, termed as Semi-supervised Contrastive Learning (SsCL), which combines the well-known contrastive loss in self-supervised learning with the cross entropy loss in semi-supervised learning, and jointly optimizes the two objectives in an end-to-end way. The highlight is that different from self-training based semi-supervised learning that conducts prediction and retraining over the same model weights, SsCL interchanges the predictions over the unlabeled data between the two branches, and thus formulates a co-calibration procedure, which we find is beneficial for better prediction and avoid being trapped in local minimum. Towards this goal, the contrastive loss branch models pairwise similarities among samples, using the nearest neighborhood generated from the cross entropy branch, and in turn calibrates the prediction distribution of the cross entropy branch with the contrastive similarity. We show that SsCL produces more discriminative representation and is beneficial to few shot learning. Notably, on ImageNet with ResNet50 as the backbone, SsCL achieves 60.2% and 72.1% top-1 accuracy with 1% and 10% labeled samples, respectively, which significantly outperforms the baseline, and is better than previous semi-supervised and self-supervised methods.

【5】 Unsupervised Super-Resolution of Satellite Imagery for High Fidelity Material Label Transfer
标题：用于高保真材料标签传递的卫星影像无监督超分辨率

作者：Arthita Ghosh,Max Ehrlich,Larry Davis,Rama Chellappa
机构：University of Maryland, College Park, MD, USA
备注：Published in the proceedings of the 2019 IEEE International Geoscience and Remote Sensing Symposium
链接：https://arxiv.org/abs/2105.07322
摘要：遥感图像中的城市物质识别是一个高度相关但极具挑战性的问题，尤其是在低分辨率卫星图像上，由于人类注释的获取非常困难。为此，我们提出了一种基于对抗学习的无监督领域自适应方法。我们的目标是从少量的高分辨率数据（源域）中获取信息，并利用这些数据对低分辨率图像（目标域）进行超分辨率处理。这可能有助于语义以及材料标签从带丰富注释的源到目标域的转移。
摘要：Urban material recognition in remote sensing imagery is a highly relevant, yet extremely challenging problem due to the difficulty of obtaining human annotations, especially on low resolution satellite images. To this end, we propose an unsupervised domain adaptation based approach using adversarial learning. We aim to harvest information from smaller quantities of high resolution data (source domain) and utilize the same to super-resolve low resolution imagery (target domain). This can potentially aid in semantic as well as material label transfer from a richly annotated source to a target domain.

【6】 Mean Shift for Self-Supervised Learning
标题：Mean Shift在自监督学习中的应用

作者：Soroush Abbasi Koohpayegani,Ajinkya Tejankar,Hamed Pirsiavash
机构：University of Maryland, Baltimore County
链接：https://arxiv.org/abs/2105.07269
摘要：最新的自监督学习（SSL）算法通过在图像实例之间进行对比或通过对图像进行聚类然后在图像聚类之间进行对比来学习特征。我们介绍了一种简单的mean-shift算法，该算法通过将图像分组来学习表示，而不需要在图像之间进行对比，也不需要在聚类结构上采用太多的先验知识。我们只需将每个图像的嵌入“移动”，使其接近其相邻图像的“平均值”。因为在我们的设置中，最近邻总是相同图像的另一个增强，所以当我们只使用一个最近邻而不是实验中使用的5个最近邻时，我们的模型将与BYOL相同。我们的模型在ImageNet线性评估中达到72.4%，ResNet50在200个时期的表现优于BYOL。我们的代码可在以下位置获得：https://github.com/UMBCvision/MSF
摘要：Most recent self-supervised learning (SSL) algorithms learn features by contrasting between instances of images or by clustering the images and then contrasting between the image clusters. We introduce a simple mean-shift algorithm that learns representations by grouping images together without contrasting between them or adopting much of prior on the structure of the clusters. We simply "shift" the embedding of each image to be close to the "mean" of its neighbors. Since in our setting, the closest neighbor is always another augmentation of the same image, our model will be identical to BYOL when using only one nearest neighbor instead of 5 as used in our experiments. Our model achieves 72.4% on ImageNet linear evaluation with ResNet50 at 200 epochs outperforming BYOL. Our code is available here: https://github.com/UMBCvision/MSF

【7】 SMURF: Self-Teaching Multi-Frame Unsupervised RAFT with Full-Image Warping
标题：蓝精灵：自学全像翘曲多帧无人监督筏子

作者：Austin Stone,Daniel Maurer,Alper Ayvaci,Anelia Angelova,Rico Jonschkowski
机构：Robotics at Google, Waymo
备注：Accepted at CVPR 2021, all code available at this https URL
链接：https://arxiv.org/abs/2105.07014
摘要：我们介绍了SMURF，一种用于光流无监督学习的方法，它在所有基准上提高了$36\%$到$40\%$（比以前最好的方法UFlow）的最新水平，甚至比PWC Net和FlowNet2等几种有监督方法都要好。我们的方法结合了有监督光流的结构改进，即RAFT模型，以及无监督学习的新思想，包括序列感知的自我监督丢失，处理帧外运动的技术，以及一种有效地从多帧视频数据中学习同时仍然只需要两帧进行推理的方法。
摘要：We present SMURF, a method for unsupervised learning of optical flow that improves state of the art on all benchmarks by $36\%$ to $40\%$ (over the prior best method UFlow) and even outperforms several supervised approaches such as PWC-Net and FlowNet2. Our method integrates architecture improvements from supervised optical flow, i.e. the RAFT model, with new ideas for unsupervised learning that include a sequence-aware self-supervision loss, a technique for handling out-of-frame motion, and an approach for learning effectively from multi-frame video data while still only requiring two frames for inference.

【8】 Unsupervised Deep Learning Methods for Biological Image Reconstruction
标题：生物图像重建的无监督深度学习方法

作者：Mehmet Akçakaya,Burhaneddin Yaman,Hyungjin Chung,Jong Chul Ye
机构： Yaman are with the Department of Electrical and Computer Engineering, University of Minnesota, Chung are with the Department of Bio and Brain Engineering
链接：https://arxiv.org/abs/2105.08040
摘要：近年来，深度学习方法以其高性能和超快的重建速度成为生物图像重建问题的主要研究前沿。然而，由于在有监督学习中很难获得匹配的参考数据，人们对不需要配对参考数据的无监督学习方法越来越感兴趣。特别是，自监督学习和生成模型已成功地应用于各种生物成像应用。本文以经典逆问题为背景，从相干的角度综述了这些方法，并讨论了它们在生物成像中的应用。
摘要：Recently, deep learning approaches have become the main research frontier for biological image reconstruction problems thanks to their high performance, along with their ultra-fast reconstruction times. However, due to the difficulty of obtaining matched reference data for supervised learning, there has been increasing interest in unsupervised learning approaches that do not need paired reference data. In particular, self-supervised learning and generative models have been successfully used for various biological imaging applications. In this paper, we overview these approaches from a coherent perspective in the context of classical inverse problems, and discuss their applications to biological imaging.

【9】 Deep regression for uncertainty-aware and interpretable analysis of large-scale body MRI
标题：深度回归在大规模人体MRI不确定性感知和可解释性分析中的应用

作者：Taro Langner,Robin Strand,Håkan Ahlström,Joel Kullberg
机构：∗Department of Surgical Sciences, Uppsala University, Uppsala, Sweden, †Department of Information Technology, Uppsala University, Uppsala, Sweden, ‡Antaros Medical AB, BioVenture Hub, M¨olndal, Sweden
备注：Presented at the Swedish Symposium on Deep Learning 2021
链接：https://arxiv.org/abs/2105.07797
摘要：像英国生物银行这样的大规模医学研究用医学成像技术检查了成千上万的志愿者。结合收集的大量元数据，这些图像中的解剖信息有可能以前所未有的规模进行医学分析。然而，他们的评估往往需要人工输入和长时间的处理，限制了生物标志物和其他可用于研究的测量的参考值的数量。最近的卷积神经网络回归方法可以自动执行这些评估。在超过40000名英国生物库受试者的磁共振成像（MRI）数据上，这些系统可以估计人的年龄、身体成分等。这种分析方式几乎完全是数据驱动的，不需要人工干预或人工分割地面真实图像的指导。这些网络通常密切仿效提供其训练数据的参考方法，并且可以达到与已建立的医疗金标准技术之间的预期可变性相当的一致性水平。无声失效的风险可以通过均值-方差准则和集合得到的预测不确定性来单独量化。显著性分析进一步解释了潜在的相关图像特征，并表明网络学习正确地瞄准特定的器官、肢体和感兴趣的区域。
摘要：Large-scale medical studies such as the UK Biobank examine thousands of volunteer participants with medical imaging techniques. Combined with the vast amount of collected metadata, anatomical information from these images has the potential for medical analyses at unprecedented scale. However, their evaluation often requires manual input and long processing times, limiting the amount of reference values for biomarkers and other measurements available for research. Recent approaches with convolutional neural networks for regression can perform these evaluations automatically. On magnetic resonance imaging (MRI) data of more than 40,000 UK Biobank subjects, these systems can estimate human age, body composition and more. This style of analysis is almost entirely data-driven and no manual intervention or guidance with manually segmented ground truth images is required. The networks often closely emulate the reference method that provided their training data and can reach levels of agreement comparable to the expected variability between established medical gold standard techniques. The risk of silent failure can be individually quantified by predictive uncertainty obtained from a mean-variance criterion and ensembling. Saliency analysis furthermore enables an interpretation of the underlying relevant image features and showed that the networks learned to correctly target specific organs, limbs, and regions of interest.

【10】 Unsupervised MMRegNet based on Spatially Encoded Gradient Information
标题：基于空间编码梯度信息的无监督MMRegNet

作者：Wangbin Ding,Lei Li,Xiahai Zhuang,Liqin Huang
机构： College of Physics and Information Engineering, Fuzhou University, Fuzhou, China, School of Data Science, Fudan University, Shanghai, China, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China
链接：https://arxiv.org/abs/2105.07392
摘要：多模态医学图像可以为目标（器官、肿瘤或组织）提供相关和互补的解剖学信息。将多模态图像配准到一个公共空间，可以融合这些综合信息，为临床应用带来方便。近年来，神经网络在增强配准方法方面得到了广泛的研究。然而，由于缺乏可靠的网络训练标准，开发一个多模态注册网络仍然是一个挑战。此外，现有的图像配准网络大多集中于两两配准，难以适用于多种图像场景。在这项工作中，我们提出了一个多模态注册网络（MMRegNet），它可以联合注册多个不同模态的图像到一个目标图像。同时，我们提出空间编码的梯度资讯，以无监督的方式训练MMRegNet。在两个数据集（即MM-WHS 2017和CHAOS 2019）上对拟建网络进行了评估。结果表明，该网络在左心室和肝脏的配准中具有良好的性能。源代码在github上公开发布。
摘要：Multi-modality medical images can provide relevant and complementary anatomical information for a target (organ, tumor or tissue). Registering the multi-modality images to a common space can fuse these comprehensive information, and bring convenience for clinical application. Recently, neural networks have been widely investigated to boost registration methods. However, it is still challenging to develop a multi-modality registration network due to the lack of robust criteria for network training. Besides, most existing registration networks mainly focus on pairwise registration, and can hardly be applicable for multiple image scenarios. In this work, we propose a multi-modality registration network (MMRegNet), which can jointly register multiple images with different modalities to a target image. Meanwhile, we present spatially encoded gradient information to train the MMRegNet in an unsupervised manner. The proposed network was evaluated on two datasets, i.e, MM-WHS 2017 and CHAOS 2019. The results show that the proposed network can achieve promising performance for cardiac left ventricle and liver registration tasks. Source code is released publicly on github.

时序|行为识别|姿态|视频|运动估计(4篇)

【1】 EA-Net: Edge-Aware Network for Flow-based Video Frame Interpolation
标题：EA-Net：基于流的视频帧插值边缘感知网络

作者：Bin Zhao,Xuelong Li
机构：)Bin Zhao is with School of Computer Science and Center for OPTicalIMagery Analysis and Learning (OPTIMAL), Northwestern PolytechnicalUniversity, Xuelong Li is with School of Computer Science and Center for OPTicalIMagery Analysis and Learning (OPTIMAL)
链接：https://arxiv.org/abs/2105.07673
摘要：视频帧插值可以提高帧速率，提高视频质量。近年来，虽然插值算法取得了很大的成功，但由于运动较大，图像模糊通常发生在物体边界处。这是一个长期存在的问题，尚未得到解决。在本文中，我们提出通过在插值帧中保留边缘来减少图像模糊并获得清晰的物体形状。为此，提出的边缘感知网络（EA-Net）将边缘信息集成到帧插值任务中。它遵循端到端的体系结构，可分为两个阶段，即边缘引导流估计和边缘保护帧合成。具体地说，在流估计阶段，提出了三种边缘感知机制来强调流图估计中的帧边缘，从而将边缘图作为辅助信息来提供更多的指导，提高流图估计的精度。在帧合成阶段，设计了流细化模块对流图进行细化，并在合成中间帧时采用注意模块对双向流图进行自适应聚焦。此外，采用帧和边缘鉴别器进行对抗性训练策略，提高了合成帧的真实性和清晰度。在Vimeo90k、UCF101单帧内插和adobe240fps多帧内插三个测试平台上的实验结果表明，本文提出的EA网络用于视频帧内插的优越性。
摘要：Video frame interpolation can up-convert the frame rate and enhance the video quality. In recent years, although the interpolation performance has achieved great success, image blur usually occurs at the object boundaries owing to the large motion. It has been a long-standing problem, and has not been addressed yet. In this paper, we propose to reduce the image blur and get the clear shape of objects by preserving the edges in the interpolated frames. To this end, the proposed Edge-Aware Network (EA-Net) integrates the edge information into the frame interpolation task. It follows an end-to-end architecture and can be separated into two stages, \emph{i.e.}, edge-guided flow estimation and edge-protected frame synthesis. Specifically, in the flow estimation stage, three edge-aware mechanisms are developed to emphasize the frame edges in estimating flow maps, so that the edge-maps are taken as the auxiliary information to provide more guidance to boost the flow accuracy. In the frame synthesis stage, the flow refinement module is designed to refine the flow map, and the attention module is carried out to adaptively focus on the bidirectional flow maps when synthesizing the intermediate frames. Furthermore, the frame and edge discriminators are adopted to conduct the adversarial training strategy, so as to enhance the reality and clarity of synthesized frames. Experiments on three benchmarks, including Vimeo90k, UCF101 for single-frame interpolation and Adobe240-fps for multi-frame interpolation, have demonstrated the superiority of the proposed EA-Net for the video frame interpolation task.

【2】 AudioVisual Video Summarization
标题：视听视频摘要

作者：Bin Zhao,Maoguo Gong,Xuelong Li
机构： Xidian University
链接：https://arxiv.org/abs/2105.07667
摘要：音频和视觉是视频数据的两种主要形式。多模态学习，尤其是视听学习，近年来受到了广泛的关注，它可以提高各种计算机视觉任务的性能。然而，在视频摘要中，现有的方法只利用视觉信息而忽略了音频信息。在本文中，我们认为音频模态可以帮助视觉模态更好地理解视频的内容和结构，并进一步有利于摘要过程。基于此，我们提出联合开发视听信息用于视频摘要任务，并开发一个视听重现网络（AVRN）来实现这一目的。具体地说，本文提出的AVRN可以分为三个部分：1）利用双流LSTM通过捕获音频和视频特征的时间相关性，对音频和视频特征进行顺序编码。2）采用视听融合LSTM技术，通过探索两种模式之间潜在的一致性，实现了两种模式的融合。3）采用自关注视频编码器捕获视频中的全局相关性。最后，利用融合后的视听信息，结合时间相关性和全局相关性对视频摘要进行预测。实际上，在emph{SumMe和TVsum两个基准上的实验结果证明了每一部分的有效性，以及AVRN相对于单纯利用视觉信息进行视频摘要的优越性。
摘要：Audio and vision are two main modalities in video data. Multimodal learning, especially for audiovisual learning, has drawn considerable attention recently, which can boost the performance of various computer vision tasks. However, in video summarization, existing approaches just exploit the visual information while neglect the audio information. In this paper, we argue that the audio modality can assist vision modality to better understand the video content and structure, and further benefit the summarization process. Motivated by this, we propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this. Specifically, the proposed AVRN can be separated into three parts: 1) the two-stream LSTM is utilized to encode the audio and visual feature sequentially by capturing their temporal dependency. 2) the audiovisual fusion LSTM is employed to fuse the two modalities by exploring the latent consistency between them. 3) the self-attention video encoder is adopted to capture the global dependency in the video. Finally, the fused audiovisual information, and the integrated temporal and global dependencies are jointly used to predict the video summary. Practically, the experimental results on the two benchmarks, \emph{i.e.,} SumMe and TVsum, have demonstrated the effectiveness of each part, and the superiority of AVRN compared to those approaches just exploiting visual information for video summarization.

【3】 MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions
标题：多运动：时空局部化运动动作的多人视频数据集

作者：Yixuan Li,Lei Chen,Runyu He,Zhenzhi Wang,Gangshan Wu,Limin Wang
机构：State Key Laboratory for Novel Software Technology, Nanjing University, China
备注：One track of DeeperAction Workshop@ICCV2021. HomePage: this https URL
链接：https://arxiv.org/abs/2105.07404
摘要：时空动作检测是视频理解中一个重要而富有挑战性的问题。现有的动作检测基准局限于视频剪辑中的少量实例或相对低级的原子动作。本文提出了一种新的时空局部化运动行为多人数据集，即MultiSports。我们首先分析了构建一个真实且具有挑战性的时空动作检测数据集的重要因素，提出了三个标准：（1）运动相关的识别，（2）具有良好定义的边界，（3）相对高级的类。在此基础上，选取4个体育类，采集3200个视频片段，用907k边界框标注37790个动作实例，构建了多运动v1.0数据集。我们的数据集具有很强的多样性、详细的注释和高质量的重要特性。我们的多体育，其现实的设置和密集的注释，暴露了内在的挑战，行动本地化。为了验证这一点，我们将几种有代表性的方法应用到我们的数据集中，并对数据集中动作定位的困难进行了深入的分析。我们希望我们的多端口可以作为未来时空行为检测的标准基准。我们的数据集网站位于https://deeperaction.github.io/multisports/.
摘要：Spatio-temporal action detection is an important and challenging problem in video understanding. The existing action detection benchmarks are limited in aspects of small numbers of instances in a trimmed video or relatively low-level atomic actions. This paper aims to present a new multi-person dataset of spatio-temporal localized sports actions, coined as MultiSports. We first analyze the important ingredients of constructing a realistic and challenging dataset for spatio-temporal action detection by proposing three criteria: (1) motion dependent identification, (2) with well-defined boundaries, (3) relatively high-level classes. Based on these guidelines, we build the dataset of Multi-Sports v1.0 by selecting 4 sports classes, collecting around 3200 video clips, and annotating around 37790 action instances with 907k bounding boxes. Our datasets are characterized with important properties of strong diversity, detailed annotation, and high quality. Our MultiSports, with its realistic setting and dense annotations, exposes the intrinsic challenge of action localization. To benchmark this, we adapt several representative methods to our dataset and give an in-depth analysis on the difficulty of action localization in our dataset. We hope our MultiSports can serve as a standard benchmark for spatio-temporal action detection in the future. Our dataset website is at https://deeperaction.github.io/multisports/.

【4】 Composite Localization for Human Pose Estimation
标题：一种用于人体姿态估计的复合定位方法

作者：ZiFan Chen,Xin Qin,Chao Yang,Li Zhang
链接：https://arxiv.org/abs/2105.07245
摘要：由于学习目标的复杂性，现有的人体姿态估计方法存在着长距离回归不准确或计算量大的问题。本文提出了一种新的人体姿态估计深度学习框架，称为复合定位，将复杂的学习目标分为两个简单的目标：一个稀疏的热图来寻找关键点的近似位置，两个短距离的偏移图来获得最终的精确坐标。为了实现该框架，我们构造了两种复合定位网络：CLNet ResNet和CLNet沙漏。我们在三个基准数据集上评估了网络，包括Leeds运动姿势数据集、MPII人体姿势数据集和COCO关键点检测数据集。实验结果表明，我们的CLNet-ResNet50在约1/2 GFLOPs的情况下比SimpleBaseline高出1.14%。我们的CLNet沙漏优于原来的堆叠沙漏4.45%的可可。
摘要：The existing human pose estimation methods are confronted with inaccurate long-distance regression or high computational cost due to the complex learning objectives. This work proposes a novel deep learning framework for human pose estimation called composite localization to divide the complex learning objective into two simpler ones: a sparse heatmap to find the keypoint's approximate location and two short-distance offsetmaps to obtain its final precise coordinates. To realize the framework, we construct two types of composite localization networks: CLNet-ResNet and CLNet-Hourglass. We evaluate the networks on three benchmark datasets, including the Leeds Sports Pose dataset, the MPII Human Pose dataset, and the COCO keypoints detection dataset. The experimental results show that our CLNet-ResNet50 outperforms SimpleBaseline by 1.14% with about 1/2 GFLOPs. Our CLNet-Hourglass outperforms the original stacked-hourglass by 4.45% on COCO.

医学相关(1篇)

【1】 Joint Optimization of Hadamard Sensing and Reconstruction in Compressed Sensing Fluorescence Microscopy
标题：压缩传感荧光显微术中阿达玛传感与重建的联合优化

作者：Alan Q. Wang,Aaron K. LaViolette,Leo Moon,Chris Xu,Mert R. Sabuncu
机构： School of Electrical and Computer Engineering, Cornell University, Meinig School of Biomedical Engineering, Cornell University, School of Applied and Engineering Physics, Cornell University
备注：Accepted at MICCAI 2021
链接：https://arxiv.org/abs/2105.07961
摘要：压缩传感荧光显微镜（CS-FM）提出了一种方案，即在传感过程中收集较少的测量数据，然后进行重建以恢复图像。在分别优化传感和重建部分方面做了大量工作。我们提出了一种在总测量约束下端到端联合优化传感和重建的方法，使得最优传感方案的学习与基于神经网络的重建网络的参数同时进行。我们在由多种生物样本组成的共焦、双光子和广域显微镜图像的丰富数据集上训练我们的模型。我们证明了我们的方法优于几种基线检测方案和正则回归重建算法。
摘要：Compressed sensing fluorescence microscopy (CS-FM) proposes a scheme whereby less measurements are collected during sensing and reconstruction is performed to recover the image. Much work has gone into optimizing the sensing and reconstruction portions separately. We propose a method of jointly optimizing both sensing and reconstruction end-to-end under a total measurement constraint, enabling learning of the optimal sensing scheme concurrently with the parameters of a neural network-based reconstruction network. We train our model on a rich dataset of confocal, two-photon, and wide-field microscopy images comprising of a variety of biological samples. We show that our method outperforms several baseline sensing schemes and a regularized regression reconstruction algorithm.

GAN|对抗|攻击|生成相关(10篇)

【1】 Temporal Prediction and Evaluation of Brassica Growth in the Field using Conditional Generative Adversarial Networks
标题：基于条件生成对抗网络的油菜田间生长时间预测与评价

作者：Lukas Drees,Laura Verena Junker-Frohn,Jana Kierdorf,Ribana Roscher
机构：IGG, Remote Sensing, University of Bonn, Germany, IBG-, Plant Sciences, Forschungszentrum J¨ulich GmbH, Germany
备注：38 pages, 10 figures, 2 tables
链接：https://arxiv.org/abs/2105.07789
摘要：农民经常评估植物的生长和表现，作为决定何时在田间采取行动的基础，如施肥、杂草控制或收割。植物生长的预测是一个重大的挑战，因为它受到众多和高度可变的环境因素的影响。本文提出了一种新的监测方法，包括高通量成像传感器测量和自动分析，以预测未来的植物生长。该方法的核心是一种基于条件生成对抗网络的基于机器学习的生长模型，它能够预测单个植物的未来外观。在实验室生长的拟南芥和田间生长的菜花植物的RGB时间序列图像的实验中，我们表明我们的方法产生了真实、可靠和合理的未来生长阶段的图像。通过基于神经网络的实例分割对生成的图像进行自动解释，可以导出描述植物生长的各种表型特征。
摘要：Farmers frequently assess plant growth and performance as basis for making decisions when to take action in the field, such as fertilization, weed control, or harvesting. The prediction of plant growth is a major challenge, as it is affected by numerous and highly variable environmental factors. This paper proposes a novel monitoring approach that comprises high-throughput imaging sensor measurements and their automatic analysis to predict future plant growth. Our approach's core is a novel machine learning-based growth model based on conditional generative adversarial networks, which is able to predict the future appearance of individual plants. In experiments with RGB time-series images of laboratory-grown Arabidopsis thaliana and field-grown cauliflower plants, we show that our approach produces realistic, reliable, and reasonable images of future growth stages. The automatic interpretation of the generated images through neural network-based instance segmentation allows the derivation of various phenotypic traits that describe plant growth.

【2】 Shared and Private VAEs with Generative Replay for Continual Learning
标题：用于持续学习的具有生成性回放的共享和私有VAE

作者：Subhankar Ghosh
机构：Indian Institute of Science, Bengaluru, India
备注：10 pages, 10 figures
链接：https://arxiv.org/abs/2105.07627
摘要：持续学习试图学习新的任务而不忘记以前学过的任务。在现实中，大多数现有的人工神经网络（ANN）模型都失败了，而人类则是通过终生记住以前的工作来实现的。尽管简单地存储所有过去的数据可以缓解这个问题，但它需要大量内存，而且在最后一次数据访问受到限制的实际应用程序中往往不可行。我们假设持续学习解决每个任务的模型具有一些特定于任务的特性和一些任务不变的特性。我们提出了一个更适合于实际情况的混合连续学习模型，以解决具有任务不变的共享变分自动编码器和T任务特定的变分自动编码器的问题。我们的模型结合了生成性重放和体系结构增长来防止灾难性遗忘。我们展示了我们的混合模型有效地避免了遗忘，并在视觉连续学习基准上取得了最新的结果，如MNIST、置换MNIST（QMNIST）、CIFAR100和minimagenet数据集。我们将讨论更多数据集的结果，如SVHN、Fashion MNIST、EMNIST和CIFAR10。
摘要：Continual learning tries to learn new tasks without forgetting previously learned ones. In reality, most of the existing artificial neural network(ANN) models fail, while humans do the same by remembering previous works throughout their life. Although simply storing all past data can alleviate the problem, it needs large memory and often infeasible in real-world applications where last data access is limited. We hypothesize that the model that learns to solve each task continually has some task-specific properties and some task-invariant characteristics. We propose a hybrid continual learning model that is more suitable in real case scenarios to address the issues that has a task-invariant shared variational autoencoder and T task-specific variational autoencoders. Our model combines generative replay and architectural growth to prevent catastrophic forgetting. We show our hybrid model effectively avoids forgetting and achieves state-of-the-art results on visual continual learning benchmarks such as MNIST, Permuted MNIST(QMNIST), CIFAR100, and miniImageNet datasets. We discuss results on a few more datasets, such as SVHN, Fashion-MNIST, EMNIST, and CIFAR10.

【3】 Style-Restricted GAN: Multi-Modal Translation with Style Restriction Using Generative Adversarial Networks
标题：风格受限的GAN：基于生成对抗性网络的风格受限多模态翻译

作者：Sho Inoue,Tad Gonsalves
机构：Department of Information & Communication Sciences, Sophia University
备注：18 pages, 13 figures, 6 tables; This paper is submitted to IEEE Access; Our implementation is available at this https URL
链接：https://arxiv.org/abs/2105.07621
摘要：利用生成性对抗网络（GAN）进行非配对图像到图像的翻译，可以成功地实现图像在多个领域之间的转换。此外，最近的研究表明了一种使发电机输出多样化的方法。但是，由于生成器如何使结果多样化没有限制，因此很可能会转换一些意外的特性。在本文中，我们提出了一种新的方式限制GAN（SRGAN），将输入图像转换到不同风格的域中，改变与类相关的特征。此外，我们采用了3种新的损失来限制编码特征的分布，即批KL散度损失、相关损失和直方图模拟损失，而不是KL散度损失。该研究报告了定量和定性结果，包括精确性、召回率、密度和覆盖率。与传统的KL损耗相比，提出的3种损耗导致了分集水平的提高。特别地，SRGAN被发现在不改变CelebA人脸数据集中的类无关特征的情况下，以更高的多样性成功地进行翻译。我们的实现可在https://github.com/shinshoji01/Style-Restricted_GAN.
摘要：Unpaired image-to-image translation using Generative Adversarial Networks (GAN) is successful in converting images among multiple domains. Moreover, recent studies have shown a way to diversify the outputs of the generator. However, since there are no restrictions on how the generator diversifies the results, it is likely to translate some unexpected features. In this paper, we propose Style-Restricted GAN (SRGAN), a novel approach to transfer input images into different domains' with different styles, changing the exclusively class-related features. Additionally, instead of KL divergence loss, we adopt 3 new losses to restrict the distribution of the encoded features: batch KL divergence loss, correlation loss, and histogram imitation loss. The study reports quantitative as well as qualitative results with Precision, Recall, Density, and Coverage. The proposed 3 losses lead to the enhancement of the level of diversity compared to the conventional KL loss. In particular, SRGAN is found to be successful in translating with higher diversity and without changing the class-unrelated features in the CelebA face dataset. Our implementation is available at https://github.com/shinshoji01/Style-Restricted_GAN.

【4】 Fast-GANFIT: Generative Adversarial Network for High Fidelity 3D Face Reconstruction
标题：FAST-GANFIT：高保真三维人脸重建的生成式对抗性网络

作者：Baris Gecer,Stylianos Ploumpis,Irene Kotsia,Stefanos Zafeiriou
备注：TPAMI camera ready (submitted: 05-May-2020); Check project page: this https URL arXiv admin note: substantial text overlap with arXiv:1902.05978
链接：https://arxiv.org/abs/2105.07474
摘要：利用深度卷积神经网络（DCNNs）的强大功能，从单个图像中重建三维人脸结构已经做了大量的工作。在最近的工作中，纹理特征要么对应于线性纹理空间的组成部分，要么由自动编码器直接从原始图像中学习。在所有情况下，人脸纹理重建的质量仍然无法用高频细节来建模人脸纹理。在本文中，我们采取了一种完全不同的方法，利用生成性对抗网络（GANs）和DCNNs的力量，从单个图像中重建人脸的纹理和形状。也就是说，我们利用GANs来训练一个非常强大的面部纹理。然后，利用非线性优化方法对原始的三维变形模型（3DMMs）进行拟合，在新的视角下找到重建测试图像的最佳潜在参数。为了对初始化具有鲁棒性并加快拟合过程，我们提出了一种新的基于自监督回归的方法。我们在真实感和身份保持的三维人脸重建方面取得了很好的效果，并且据我们所知，首次实现了高频细节的人脸纹理重建。
摘要：A lot of work has been done towards reconstructing the 3D facial structure from single images by capitalizing on the power of Deep Convolutional Neural Networks (DCNNs). In the recent works, the texture features either correspond to components of a linear texture space or are learned by auto-encoders directly from in-the-wild images. In all cases, the quality of the facial texture reconstruction is still not capable of modeling facial texture with high-frequency details. In this paper, we take a radically different approach and harness the power of Generative Adversarial Networks (GANs) and DCNNs in order to reconstruct the facial texture and shape from single images. That is, we utilize GANs to train a very powerful facial texture prior \edit{from a large-scale 3D texture dataset}. Then, we revisit the original 3D Morphable Models (3DMMs) fitting making use of non-linear optimization to find the optimal latent parameters that best reconstruct the test image but under a new perspective. In order to be robust towards initialisation and expedite the fitting process, we propose a novel self-supervised regression based approach. We demonstrate excellent results in photorealistic and identity preserving 3D face reconstructions and achieve for the first time, to the best of our knowledge, facial texture reconstruction with high-frequency details.

【5】 3D to 4D Facial Expressions Generation Guided by Landmarks
标题：基于地标的3D到4D人脸表情生成

作者：Naima Otberdout,Claudio Ferrari,Mohamed Daoudi,Stefano Berretti,Alberto Del Bimbo
机构：Univ. Lille, CNRS, Centrale Lille, UMR , CRIStAL, F-, Lille, France, Media Integration ad Communication Center, University of Florence, Italy, IMT Lille Douai, Institut Mines-T´el´ecom, Univ. Lille, Centre for Digital Systems, F-, Lille, France
链接：https://arxiv.org/abs/2105.07463
摘要：近年来，基于深度学习的三维人脸生成技术取得了一定的进展，而动态三维人脸表情合成的研究却相对较少。本文针对以下问题提出了一种新的解决方案：给定一个输入的三维中性人脸，能从中生成动态的三维表情吗？为了解决这个问题，我们首先提出了一个网格编码-解码体系结构（Expr-ED），它利用一组三维地标从中性的对应物生成一个富有表现力的三维人脸。然后，我们通过使用流形值GAN（Motion3DGAN）建模面部表情的时间动力学，将其扩展到4D，该GAN能够从表情标签（Motion3DGAN）生成3D地标序列。生成的地标被输入到mesh编码器-解码器中，最终生成一系列3D表情人脸。通过解耦这两个步骤，我们分别解决了由网格变形和运动动力学引起的非线性问题。在CoMA数据集上的实验结果表明，与其他基于地标的三维拟合方法相比，本文提出的基于地标的mesh编码-解码器具有显著的改进，能够生成高质量的动态面部表情。该框架进一步使得3D表达强度能够从低强度到高强度连续地适应。最后，我们展示了我们的框架可以应用于其他任务，如二维-三维面部表情转换。
摘要：While deep learning-based 3D face generation has made a progress recently, the problem of dynamic 3D (4D) facial expression synthesis is less investigated. In this paper, we propose a novel solution to the following question: given one input 3D neutral face, can we generate dynamic 3D (4D) facial expressions from it? To tackle this problem, we first propose a mesh encoder-decoder architecture (Expr-ED) that exploits a set of 3D landmarks to generate an expressive 3D face from its neutral counterpart. Then, we extend it to 4D by modeling the temporal dynamics of facial expressions using a manifold-valued GAN capable of generating a sequence of 3D landmarks from an expression label (Motion3DGAN). The generated landmarks are fed into the mesh encoder-decoder, ultimately producing a sequence of 3D expressive faces. By decoupling the two steps, we separately address the non-linearity induced by the mesh deformation and motion dynamics. The experimental results on the CoMA dataset show that our mesh encoder-decoder guided by landmarks brings a significant improvement with respect to other landmark-based 3D fitting approaches, and that we can generate high quality dynamic facial expressions. This framework further enables the 3D expression intensity to be continuously adapted from low to high intensity. Finally, we show our framework can be applied to other tasks, such as 2D-3D facial expression transfer.

【6】 ExSinGAN: Learning an Explainable Generative Model from a Single Image
标题：ExsinGan：从单个图像中学习可解释的生成模式

作者：ZiCheng Zhang,CongYing Han,TianDe Guo
机构： Tian-De Guo 1 1University of Chinese Academy of Scienceszhangzicheng 19
链接：https://arxiv.org/abs/2105.07350
摘要：从单个样本生成图像作为图像合成的一个新兴分支，受到了广泛的关注。本文将此问题描述为从单个图像的条件分布中采样，并提出了一个分层框架，通过对结构、语义和纹理分布的连续学习，简化了对复杂条件分布的学习，使学习和生成的过程易于理解。在此基础上，我们设计了一个由三个级联的GANs组成的ExSinGAN，从给定的图像中学习一个可解释的生成模型，其中级联的GANs依次对结构、语义和纹理的分布进行建模。ExSinGAN不仅像以前的工作那样从给定图像的内部补丁中学习，而且从GAN反演技术获得的外部先验中学习。得益于内外部信息的恰当结合，ExSinGAN对于图像处理任务的生成能力和泛化能力都比以往的工作有更强的竞争力。
摘要：Generating images from a single sample, as a newly developing branch of image synthesis, has attracted extensive attention. In this paper, we formulate this problem as sampling from the conditional distribution of a single image, and propose a hierarchical framework that simplifies the learning of the intricate conditional distributions through the successive learning of the distributions about structure, semantics and texture, making the process of learning and generation comprehensible. On this basis, we design ExSinGAN composed of three cascaded GANs for learning an explainable generative model from a given image, where the cascaded GANs model the distributions about structure, semantics and texture successively. ExSinGAN is learned not only from the internal patches of the given image as the previous works did, but also from the external prior obtained by the GAN inversion technique. Benefiting from the appropriate combination of internal and external information, ExSinGAN has a more powerful capability of generation and competitive generalization ability for the image manipulation tasks compared with prior works.

【7】 Texture Generation with Neural Cellular Automata
标题：基于神经元胞自动机的纹理生成

作者：Alexander Mordvintsev,Eyvind Niklasson,Ettore Randazzo
机构：Google Research
备注：AI for Content Creation Workshop, CVPR 2021
链接：https://arxiv.org/abs/2105.07299
摘要：神经细胞自动机（neuralcellular Automata，NCA）在学习图像生长、形态分类、图像分割以及路径搜索等一般计算规则方面表现出了卓越的能力。我们相信它们引入的归纳先验有助于纹理的生成。自然界中的纹理通常是由局部相互作用的反应扩散系统的变体产生的。同样，人造纹理通常以局部方式（例如织物编织）或使用具有局部依赖性的规则（规则网格或几何图案）生成。我们演示了如何从单个模板图像中学习纹理生成器，其生成方法具有令人尴尬的并行性、收敛速度快、输出保真度高，并且只需要对底层状态流形进行一些最小的假设。此外，我们还研究了学习模型的一些有用和有趣的性质，如非平稳动力学和对损伤的固有鲁棒性。最后，我们定性地宣称NCA模型所表现出的行为是一种学习的、分布式的、局部的纹理生成算法，这使得我们的方法与现有的纹理生成工作不同。我们讨论这样一个范例的优点。
摘要：Neural Cellular Automata (NCA) have shown a remarkable ability to learn the required rules to "grow" images, classify morphologies, segment images, as well as to do general computation such as path-finding. We believe the inductive prior they introduce lends itself to the generation of textures. Textures in the natural world are often generated by variants of locally interacting reaction-diffusion systems. Human-made textures are likewise often generated in a local manner (textile weaving, for instance) or using rules with local dependencies (regular grids or geometric patterns). We demonstrate learning a texture generator from a single template image, with the generation method being embarrassingly parallel, exhibiting quick convergence and high fidelity of output, and requiring only some minimal assumptions around the underlying state manifold. Furthermore, we investigate properties of the learned models that are both useful and interesting, such as non-stationary dynamics and an inherent robustness to damage. Finally, we make qualitative claims that the behaviour exhibited by the NCA model is a learned, distributed, local algorithm to generate a texture, setting our method apart from existing work on texture generation. We discuss the advantages of such a paradigm.

【8】 Multi-scale super-resolution generation of low-resolution scanned pathological images
标题：低分辨率扫描病理图像的多尺度超分辨率生成

作者：Yanhua Gao,Ting Xie,Xun Wang,Qingqing Yang,Le Chen,Kai Sun,Youmin Guo,Gang Yu,Kuansong Wang
机构：. Department of Medical Imaging, The First Affiliated Hospital of Xi’an Jiaotong, University, #, Yanta West Road, Xi’an, China., . Department of Biomedical Engineering, School of Basic Medical Sciences, Central
备注：27 pages,12 figures
链接：https://arxiv.org/abs/2105.07200
摘要：数字化病理切片易于存储和管理，便于浏览和传输。然而，由于数字化过程中的高分辨率扫描（例如40倍放大率），每张幻灯片图像的文件大小超过1G字节，最终导致巨大的存储容量和非常缓慢的网络传输。我们设计了一种低分辨率（5X）幻灯片扫描策略，并提出了一种超分辨率方法来恢复诊断时的图像细节。该方法基于多尺度生成对抗网络，依次生成10X、20X和40X三幅高分辨率图像。在三种图像分辨率下比较了生成图像和真实图像的感知损失、发生器损失，并用鉴别器来评价最高分辨率生成图像和真实图像的差异。一个由10万张人体组织病理图像组成的数据集被用来训练和测试网络。生成的图像具有较高的峰值信噪比（PSNR）和结构相似性指数（SSIM）。10X-40X图像的峰值信噪比分别为24.16、22.27和20.44，SSIM分别为0.845、0.680和0.512，优于DBPN、ESPCN、RDN、EDSR和MDSR等超分辨率网络。此外，视觉检测表明，我们的网络生成的高分辨率图像具有足够的诊断细节，色彩再现性好，接近真实图像，而其他五个网络严重模糊，局部变形或遗漏重要细节。而且，基于生成的图像和真实图像的病理诊断没有显著差异。提出的多尺度网络能够生成高分辨率的病理图像，为数字病理学提供了一种低成本的存储（约15MB/5X图像），更快的图像共享方法。
摘要：Digital pathology slide is easy to store and manage, convenient to browse and transmit. However, because of the high-resolution scan for example 40 times magnification(40X) during the digitization, the file size of each whole slide image exceeds 1Gigabyte, which eventually leads to huge storage capacity and very slow network transmission. We design a strategy to scan slides with low resolution (5X) and a super-resolution method is proposed to restore the image details when in diagnosis. The method is based on a multi-scale generative adversarial network, which sequentially generate three high-resolution images such as 10X, 20X and 40X. The perceived loss, generator loss of the generated images and real images are compared on three image resolutions, and a discriminator is used to evaluate the difference of highest-resolution generated image and real image. A dataset consisting of 100,000 pathological images from 10 types of human tissues is performed for training and testing the network. The generated images have high peak-signal-to-noise-ratio (PSNR) and structural-similarity-index (SSIM). The PSNR of 10X to 40X image are 24.16, 22.27 and 20.44, and the SSIM are 0.845, 0.680 and 0.512, which are better than other super-resolution networks such as DBPN, ESPCN, RDN, EDSR and MDSR. Moreover, visual inspections show that the generated high-resolution images by our network have enough details for diagnosis, good color reproduction and close to real images, while other five networks are severely blurred, local deformation or miss important details. Moreover, no significant differences can be found on pathological diagnosis based on the generated and real images. The proposed multi-scale network can generate good high-resolution pathological images, and will provide a low-cost storage (about 15MB/image on 5X), faster image sharing method for digital pathology.

【9】 NeuroGen: activation optimized image synthesis for discovery neuroscience
标题：Neurgen：发现神经科学的激活优化图像合成

作者：Zijin Gu,Keith W. Jamison,Meenakshi Khosla,Emily J. Allen,Yihan Wu,Thomas Naselaris,Kendrick Kay,Mert R. Sabuncu,Amy Kuceyeski
机构：Department of Electrical and Computer Engineering, Cornell University, Ithaca, New York, USA, Weill Cornell Medicine, New York, New York, USA, Department of Neuroscience, University of Minnesota, Minneapolis, Minnesota, USA
链接：https://arxiv.org/abs/2105.07140
摘要：功能磁共振成像（fMRI）是一种强大的技术，它使我们能够描述视觉皮层对刺激的反应，然而这种实验本质上是建立在先验假设的基础上的，仅限于当个体在扫描仪中时呈现给他们的一组图像，在观察到的大脑反应中会受到噪声的影响，而且可能因个体而异。在这项工作中，我们提出了一种新的计算策略，我们称之为NeuroGen，以克服这些限制，并开发一个强大的工具，为人类视觉神经科学的发现。NeuroGen将fMRI训练的人类视觉神经编码模型与深度生成网络相结合，合成预测的图像，以实现宏观脑激活的目标模式。我们证明了编码模型提供的噪声降低，加上生成网络产生高保真图像的能力，为视觉神经科学提供了一个健壮的发现架构。通过使用NeuroGen创建的少量合成图像，我们证明了我们可以检测和放大人类大脑对视觉刺激的区域和个体反应模式的差异。然后，我们验证了这些发现反映在几千个观察到的图像反应与功能磁共振测量。我们进一步证明，NeuroGen可以创建预测的合成图像，以实现最佳匹配自然图像无法实现的区域响应模式。神经原框架扩展了大脑编码模型的实用性，并为探索和可能精确控制人类视觉系统开辟了一条新的途径。
摘要：Functional MRI (fMRI) is a powerful technique that has allowed us to characterize visual cortex responses to stimuli, yet such experiments are by nature constructed based on a priori hypotheses, limited to the set of images presented to the individual while they are in the scanner, are subject to noise in the observed brain responses, and may vary widely across individuals. In this work, we propose a novel computational strategy, which we call NeuroGen, to overcome these limitations and develop a powerful tool for human vision neuroscience discovery. NeuroGen combines an fMRI-trained neural encoding model of human vision with a deep generative network to synthesize images predicted to achieve a target pattern of macro-scale brain activation. We demonstrate that the reduction of noise that the encoding model provides, coupled with the generative network's ability to produce images of high fidelity, results in a robust discovery architecture for visual neuroscience. By using only a small number of synthetic images created by NeuroGen, we demonstrate that we can detect and amplify differences in regional and individual human brain response patterns to visual stimuli. We then verify that these discoveries are reflected in the several thousand observed image responses measured with fMRI. We further demonstrate that NeuroGen can create synthetic images predicted to achieve regional response patterns not achievable by the best-matching natural images. The NeuroGen framework extends the utility of brain encoding models and opens up a new avenue for exploring, and possibly precisely controlling, the human visual system.

【10】 SA-GAN: Structure-Aware Generative Adversarial Network for Shape-Preserving Synthetic CT Generation
标题：SA-GAN：保形合成CT生成的结构感知生成性对抗网络

作者：Hajar Emami,Ming Dong,Siamak Nejad-Davarani,Carri Glide-Hurst
机构： Wayne State University, MI, USA, University of Michigan, MI, USA, University of Wisconsin-Madison, WI, USA
备注：Accepted to MICCAI 2021
链接：https://arxiv.org/abs/2105.07044
摘要：在医学图像合成中，由于不同模式的图像之间的不一致性（即使是同一个患者），模型训练可能具有挑战性，这通常是由内部状态/组织变化引起的，因为不同的模式通常在不同的时间获得。提出了一种新的深度学习方法，即结构感知生成对抗网络（SA-GAN），该方法在生成医学图像时能保持图像在一致结构中的形状和位置。SA-GAN用于从磁共振成像（MRI）生成合成计算机断层扫描（synCT）图像，具有两个并行流：全局流将MRI输入转换到CT域，而局部流自动分割不一致的器官，保持其在MRI中的位置和形状，把器官强度转换成CT。通过在骨盆数据集上的大量实验，我们证明SA-GAN在同步和器官分割方面提供了临床上可接受的准确性，并且支持在有内部器官状态变化的疾病部位仅进行MR治疗计划。
摘要：In medical image synthesis, model training could be challenging due to the inconsistencies between images of different modalities even with the same patient, typically caused by internal status/tissue changes as different modalities are usually obtained at a different time. This paper proposes a novel deep learning method, Structure-aware Generative Adversarial Network (SA-GAN), that preserves the shapes and locations of in-consistent structures when generating medical images. SA-GAN is employed to generate synthetic computed tomography (synCT) images from magnetic resonance imaging (MRI) with two parallel streams: the global stream translates the input from the MRI to the CT domain while the local stream automatically segments the inconsistent organs, maintains their locations and shapes in MRI, and translates the organ intensities to CT. Through extensive experiments on a pelvic dataset, we demonstrate that SA-GAN provides clinically acceptable accuracy on both synCTs and organ segmentation and supports MR-only treatment planning in disease sites with internal organ status changes.

Attention注意力(3篇)

【1】 Pay Attention to MLPs
标题：关注MLP

作者：Hanxiao Liu,Zihang Dai,David R. So,Quoc V. Le
机构：Google Research, Brain Team
链接：https://arxiv.org/abs/2105.08050
摘要：Transformer已经成为深度学习中最重要的架构创新之一，并在过去几年中实现了许多突破。在这里，我们提出了一个简单的无注意网络结构gMLP，它完全基于带选通的MLPs，并且证明了它在关键的语言和视觉应用中的性能。我们的比较表明，自我注意对视觉转换器来说并不重要，因为gMLP可以达到同样的精确度。对于BERT，我们的模型在训练前的复杂度上达到了与Transformers等价的水平，并且在一些下游任务上效果更好。在gMLP性能较差的微调任务中，使gMLP模型大得多可以缩小与Transformer的差距。一般来说，我们的实验表明，gMLP可以随着数据和计算量的增加而扩展。
摘要：Transformers have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple attention-free network architecture, gMLP, based solely on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Transformers, as gMLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream tasks. On finetuning tasks where gMLP performs worse, making the gMLP model substantially larger can close the gap with Transformers. In general, our experiments show that gMLP can scale as well as Transformers over increased data and compute.

【2】 BDANet: Multiscale Convolutional Neural Network with Cross-directional Attention for Building Damage Assessment from Satellite Images
标题：BDANet：卫星图像建筑物损伤评估的横向关注多尺度卷积神经网络

作者：Yu Shen,Sijie Zhu,Taojiannan Yang,Chen Chen,Delu Pan,Jianyu Chen,Liang Xiao,Qian Du
备注：arXiv admin note: text overlap with arXiv:2010.14014
链接：https://arxiv.org/abs/2105.07364
摘要：当自然灾害（如地震、飓风等）来袭时，需要快速有效的反应。在救援工作部署之前，从卫星图像中评估建筑物损坏是至关重要的。利用一对灾前和灾后的卫星图像，建筑物损伤评估旨在预测建筑物的损伤程度。深层神经网络具有强大的特征表示能力，已成功地应用于建筑物损伤评估。大多数现有的工作只是将灾前和灾后图像连接起来作为深度神经网络的输入，而没有考虑它们之间的相关性。本文提出了一种新的用于建筑物损伤评估的两级卷积神经网络BDANet。在第一阶段，使用U网来提取建筑物的位置。然后在第二阶段将第一阶段得到的网络权值进行分摊，用于建筑物损伤评估。第二阶段采用双支路多尺度U网作为主干网，将灾前和灾后图像分别送入网络。提出了一个跨方向注意模块来研究灾前和灾后图像之间的相关性。此外，利用CutMix数据扩充来解决困难类的挑战。该方法在大规模数据集xBD上取得了最先进的性能。代码可在https://github.com/ShaneShen/BDANet-Building-Damage-Assessment.
摘要：Fast and effective responses are required when a natural disaster (e.g., earthquake, hurricane, etc.) strikes. Building damage assessment from satellite imagery is critical before relief effort is deployed. With a pair of pre- and post-disaster satellite images, building damage assessment aims at predicting the extent of damage to buildings. With the powerful ability of feature representation, deep neural networks have been successfully applied to building damage assessment. Most existing works simply concatenate pre- and post-disaster images as input of a deep neural network without considering their correlations. In this paper, we propose a novel two-stage convolutional neural network for Building Damage Assessment, called BDANet. In the first stage, a U-Net is used to extract the locations of buildings. Then the network weights from the first stage are shared in the second stage for building damage assessment. In the second stage, a two-branch multi-scale U-Net is employed as backbone, where pre- and post-disaster images are fed into the network separately. A cross-directional attention module is proposed to explore the correlations between pre- and post-disaster images. Moreover, CutMix data augmentation is exploited to tackle the challenge of difficult classes. The proposed method achieves state-of-the-art performance on a large-scale dataset -- xBD. The code is available at https://github.com/ShaneShen/BDANet-Building-Damage-Assessment.

【3】 Show Why the Answer is Correct! Towards Explainable AI using Compositional Temporal Attention
标题：展示为什么答案是正确的！基于构成性时间注意的可解释人工智能研究

作者：Nihar Bendre,Kevin Desai,Peyman Najafirad
机构： University of Texas at San Anto-nio
备注：7 pages, 4 figures, 3 tables
链接：https://arxiv.org/abs/2105.07141
摘要：视觉问答（VQA）模型在最近取得了巨大的成功。尽管VQA模型取得了成功，但它们大多是黑匣子模型，无法对预测的答案进行推理，因此对其在自治系统和网络安全等安全关键系统中的适用性提出了疑问。目前的技术水平无法改善复杂的问题，因此无法利用组成性。为了使这些模型的黑箱效应最小化，同时使它们更好地利用组成性，我们提出了一种动态神经网络（DMN），它可以理解一个特定的问题，然后从一个模块池中动态地组装各种相对浅层的深层学习模块，形成一个网络。我们将合成时间的注意力纳入这些基于深度学习的模块，以增加合成性开发。这样可以更好地理解复杂的问题，也可以为模块预测特定答案的原因提供推理。在VQA2.0和CLEVR两个基准数据集上的实验分析表明，我们的模型比以往的视觉问答方法具有更好的推理能力，从而使其在安全和安保等关键应用中更加可靠。
摘要：Visual Question Answering (VQA) models have achieved significant success in recent times. Despite the success of VQA models, they are mostly black-box models providing no reasoning about the predicted answer, thus raising questions for their applicability in safety-critical such as autonomous systems and cyber-security. Current state of the art fail to better complex questions and thus are unable to exploit compositionality. To minimize the black-box effect of these models and also to make them better exploit compositionality, we propose a Dynamic Neural Network (DMN), which can understand a particular question and then dynamically assemble various relatively shallow deep learning modules from a pool of modules to form a network. We incorporate compositional temporal attention to these deep learning based modules to increase compositionality exploitation. This results in achieving better understanding of complex questions and also provides reasoning as to why the module predicts a particular answer. Experimental analysis on the two benchmark datasets, VQA2.0 and CLEVR, depicts that our model outperforms the previous approaches for Visual Question Answering task as well as provides better reasoning, thus making it reliable for mission critical applications like safety and security.

人脸|人群计数(1篇)

【1】 Private Facial Diagnosis as an Edge Service for Parkinson's DBS Treatment Valuation
标题：私人面部诊断作为帕金森DBS治疗评估的边缘服务

作者：Richard Jiang,Paul Chazot,Danny Crookes,Ahmed Bouridane,M Emre Celebi
机构： Durham University, United Kingdom  Danny Crookes is an emeritus professor with Department of Computer Sci-ence, Queen’s University Belfast
备注：Under review
链接：https://arxiv.org/abs/2105.07533
摘要：面部分型作为一种诊断一系列疾病的新方法，最近被成功地应用于医学诊断，面部生物特征与潜在的遗传或医学原因有着密切的联系。本文以帕金森病（PD）为例，提出了一种面向人工智能（AIoT）边缘隐私保护的面部诊断框架，分析了深部脑刺激（DBS）对PD患者的治疗效果。在该框架中，提出了一个新的基于边缘信息的理论安全框架，在一个面向隐私保护的AIoT多方通信方案上实现了作为服务的私人深度面部诊断，其中部分同态加密（PHE）被用来直接在加密的面部模式上实现隐私保护的深度面部诊断。在我们收集帕金森病患者面部数据集的实验中，我们首次证明了面部模式可以用来评估接受DBS治疗的帕金森病患者的改善情况。我们进一步实现了一个隐私保护的深度面部诊断框架，该框架可以达到与非加密的诊断框架相同的准确性，显示了我们的隐私保护面部诊断作为一个值得信赖的边缘服务在患者PD严重程度分级方面的潜力。
摘要：Facial phenotyping has recently been successfully exploited for medical diagnosis as a novel way to diagnose a range of diseases, where facial biometrics has been revealed to have rich links to underlying genetic or medical causes. In this paper, taking Parkinson's Diseases (PD) as a case study, we proposed an Artificial-Intelligence-of-Things (AIoT) edge-oriented privacy-preserving facial diagnosis framework to analyze the treatment of Deep Brain Stimulation (DBS) on PD patients. In the proposed framework, a new edge-based information theoretically secure framework is proposed to implement private deep facial diagnosis as a service over a privacy-preserving AIoT-oriented multi-party communication scheme, where partial homomorphic encryption (PHE) is leveraged to enable privacy-preserving deep facial diagnosis directly on encrypted facial patterns. In our experiments with a collected facial dataset from PD patients, for the first time, we demonstrated that facial patterns could be used to valuate the improvement of PD patients undergoing DBS treatment. We further implemented a privacy-preserving deep facial diagnosis framework that can achieve the same accuracy as the non-encrypted one, showing the potential of our privacy-preserving facial diagnosis as an trustworthy edge service for grading the severity of PD in patients.

跟踪(2篇)

【1】 Multi-object Tracking with Tracked Object Bounding Box Association
标题：基于跟踪对象边界框关联的多目标跟踪

作者：Nanyang Yang,Yi Wang,Lap-Pui Chau
机构：School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore
备注：6 pages, accepted paper at ICME workshop 2021
链接：https://arxiv.org/abs/2105.07901
摘要：中心跟踪算法通过简单的检测模型和单帧空间偏移量来定位目标并预测其在单个网络中的关联，从而达到最先进的跟踪性能。然而，这种联合检测与跟踪方法由于关联方法的不足，仍然存在高身份切换问题。为了减少大量的身份切换，提高跟踪精度，本文提出在中心跟踪算法中加入一个简单的跟踪对象边界盒和基于当前帧的重叠预测。具体地说，我们在关联步骤中提出了一个联合上的交集（IOU）距离代价矩阵，而不是简单的点位移距离。我们在MOT17测试数据集上对我们提出的跟踪器进行了评估，结果表明，在相同的tracklet生存期下，我们提出的方法可以显著减少22.6%的身份切换，并且在IDF1中获得了1.5%的显著改进。源代码发布于https://github.com/Nanyangny/CenterTrack-IOU.
摘要：The CenterTrack tracking algorithm achieves state-of-the-art tracking performance using a simple detection model and single-frame spatial offsets to localize objects and predict their associations in a single network. However, this joint detection and tracking method still suffers from high identity switches due to the inferior association method. To reduce the high number of identity switches and improve the tracking accuracy, in this paper, we propose to incorporate a simple tracked object bounding box and overlapping prediction based on the current frame onto the CenterTrack algorithm. Specifically, we propose an Intersection over Union (IOU) distance cost matrix in the association step instead of simple point displacement distance. We evaluate our proposed tracker on the MOT17 test dataset, showing that our proposed method can reduce identity switches significantly by 22.6% and obtain a notable improvement of 1.5% in IDF1 compared to the original CenterTrack's under the same tracklet lifetime. The source code is released at https://github.com/Nanyangny/CenterTrack-IOU.

【2】 TSDF++: A Multi-Object Formulation for Dynamic Object Tracking and Reconstruction
标题：TSDF++：一种用于动态目标跟踪和重建的多目标模型

作者：Margarita Grinvald,Federico Tombari,Roland Siegwart,Juan Nieto
备注：7 pages, 3 figures. To be published in the 2021 IEEE International Conference on Robotics and Automation (ICRA). Code is available at this https URL and the accompanying video material can be found at this https URL
链接：https://arxiv.org/abs/2105.07468
摘要：同时跟踪和重建场景中移动的多个物体的能力对于机器人的自主导航和交互等任务至关重要。实际上，以前所有映射多个动态对象的尝试都已演变为将单个对象存储在单独的重建体积中，并跟踪它们之间的相对姿势。这种公式虽然简单直观，但相对于场景中对象的数量，它的伸缩性不好，因此需要显式的遮挡处理策略。相反，我们提出了一种地图表示法，它允许为整个场景和其中的所有对象保持一个单独的体积。为此，我们介绍了一种新的多目标TSDF公式，它可以在地图中的任何给定位置对多个物体表面进行编码。在多个动态目标跟踪和重建场景中，我们的表示允许保持曲面的精确重建，即使它们暂时被附近移动的其他对象遮挡。我们在一个公开的合成数据集上评估了所提出的TSDF++公式，并与标准的TSDF地图表示法相比，证明了其保留遮挡表面重建的能力。
摘要：The ability to simultaneously track and reconstruct multiple objects moving in the scene is of the utmost importance for robotic tasks such as autonomous navigation and interaction. Virtually all of the previous attempts to map multiple dynamic objects have evolved to store individual objects in separate reconstruction volumes and track the relative pose between them. While simple and intuitive, such formulation does not scale well with respect to the number of objects in the scene and introduces the need for an explicit occlusion handling strategy. In contrast, we propose a map representation that allows maintaining a single volume for the entire scene and all the objects therein. To this end, we introduce a novel multi-object TSDF formulation that can encode multiple object surfaces at any given location in the map. In a multiple dynamic object tracking and reconstruction scenario, our representation allows maintaining accurate reconstruction of surfaces even while they become temporarily occluded by other objects moving in their proximity. We evaluate the proposed TSDF++ formulation on a public synthetic dataset and demonstrate its ability to preserve reconstructions of occluded surfaces when compared to the standard TSDF map representation.

图像视频检索|Re-id相关(1篇)

【1】 FDDH: Fast Discriminative Discrete Hashing for Large-Scale Cross-Modal Retrieval
标题：FDDH：面向大规模跨模态检索的快速鉴别离散散列算法

作者：Xin Liu,Xingzhi Wang,Yiu-ming Cheung
机构： Wang is with the School of Electronics and Information Tech-nology, SunYat-senUniversity
备注：16 pages, 7 figures
链接：https://arxiv.org/abs/2105.07128
摘要：跨模态散列以其有效性和效率而受到广泛关注，以促进跨不同模态的高效检索。然而，现有的大多数方法在学习hash码时并没有充分利用语义信息的识别能力，而在处理大规模数据集时往往需要耗时的训练过程。为了解决这些问题，我们提出了一种基于语义数据正交旋转的相似性保持哈希码的学习方法，以最小化将这些数据映射到hamming空间的量化损失，并提出了一种用于大规模跨模态检索的快速判别离散哈希（FDDH）方法。更具体地说，FDDH引入了一个正交基，将训练样本的目标hash码回归到相应的语义标签上，并利用“-拖拽技术提供可证明的大语义裕度。因此，可以显式地捕获和最大化语义信息的辨别能力。此外，还提出了一种正交变换方案，将非线性嵌入数据映射到语义子空间，较好地保证了数据特征与其语义表示之间的语义一致性。因此，一个有效的封闭形式的解决方案是导出有区别的哈希码学习，这是非常有效的计算。此外，本文还提出了一种有效、稳定的在线学习策略，用于优化特定模态的投影函数，具有对不同训练规模和流数据的自适应性。所提出的FDDH方法在理论上近似于双Lipschitz连续性，运行速度足够快，并且与现有的方法相比显著提高了检索性能。源代码发布于：https://github.com/starxliu/FDDH.
摘要：Cross-modal hashing, favored for its effectiveness and efficiency, has received wide attention to facilitating efficient retrieval across different modalities. Nevertheless, most existing methods do not sufficiently exploit the discriminative power of semantic information when learning the hash codes, while often involving time-consuming training procedure for handling the large-scale dataset. To tackle these issues, we formulate the learning of similarity-preserving hash codes in terms of orthogonally rotating the semantic data so as to minimize the quantization loss of mapping such data to hamming space, and propose an efficient Fast Discriminative Discrete Hashing (FDDH) approach for large-scale cross-modal retrieval. More specifically, FDDH introduces an orthogonal basis to regress the targeted hash codes of training examples to their corresponding semantic labels, and utilizes "-dragging technique to provide provable large semantic margins. Accordingly, the discriminative power of semantic information can be explicitly captured and maximized. Moreover, an orthogonal transformation scheme is further proposed to map the nonlinear embedding data into the semantic subspace, which can well guarantee the semantic consistency between the data feature and its semantic representation. Consequently, an efficient closed form solution is derived for discriminative hash code learning, which is very computationally efficient. In addition, an effective and stable online learning strategy is presented for optimizing modality-specific projection functions, featuring adaptivity to different training sizes and streaming data. The proposed FDDH approach theoretically approximates the bi-Lipschitz continuity, runs sufficiently fast, and also significantly improves the retrieval performance over the state-of-the-art methods. The source code is released at: https://github.com/starxliu/FDDH.

裁剪|量化|加速|压缩相关(1篇)

【1】 Real-Time Quantized Image Super-Resolution on Mobile NPUs, Mobile AI 2021 Challenge: Report
标题：移动NPU上的实时量化图像超分辨率，移动AI 2021挑战：报告

作者：Andrey Ignatov,Radu Timofte,Maurizio Denna,Abdel Younes,Andrew Lek,Mustafa Ayazoglu,Jie Liu,Zongcai Du,Jiaming Guo,Xueyi Zhou,Hao Jia,Youliang Yan,Zexin Zhang,Yixin Chen,Yunbo Peng,Yue Lin,Xindong Zhang,Hui Zeng,Kun Zeng,Peirong Li,Zhihuang Liu,Shiqi Xue,Shengpeng Wang
备注：Mobile AI 2021 Workshop and Challenges: this https URL
链接：https://arxiv.org/abs/2105.07825
摘要：图像超分辨率是计算机视觉领域的一个热点问题，在移动设备中有着重要的应用。虽然已经有许多解决方案被提出用于这项任务，但它们通常甚至没有针对常见的智能手机AI硬件进行优化，更不用说通常只支持INT8推理的更受约束的智能电视平台了。为了解决这个问题，我们引入了第一个移动人工智能挑战，目标是开发一个基于端到端深度学习的图像超分辨率解决方案，可以在移动或边缘NPU上展示实时性能。为此，参与者被提供了DIV2K数据集和经过训练的量化模型来进行有效的3X图像放大。所有模型的运行时间均在Synaptics VS680智能家居板上进行评估，该智能家居板配有一个能够加速量化神经网络的专用NPU。所提出的解决方案与所有主要的移动人工智能加速器完全兼容，能够在40-60毫秒的时间内重建全高清图像，同时获得高保真效果。本文详细描述了在挑战中开发的所有模型。
摘要：Image super-resolution is one of the most popular computer vision problems with many important applications to mobile devices. While many solutions have been proposed for this task, they are usually not optimized even for common smartphone AI hardware, not to mention more constrained smart TV platforms that are often supporting INT8 inference only. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop an end-to-end deep learning-based image super-resolution solutions that can demonstrate a real-time performance on mobile or edge NPUs. For this, the participants were provided with the DIV2K dataset and trained quantized models to do an efficient 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated NPU capable of accelerating quantized neural networks. The proposed solutions are fully compatible with all major mobile AI accelerators and are capable of reconstructing Full HD images under 40-60 ms while achieving high fidelity results. A detailed description of all models developed in the challenge is provided in this paper.

表征学习(1篇)

【1】 Disentangled Variational Information Bottleneck for Multiview Representation Learning
标题：多视点表示学习的解缠变分信息瓶颈

作者：Feng Bao
机构：University of California, San Francisco
链接：https://arxiv.org/abs/2105.07599
摘要：多视图数据包含多种模式的信息，有可能为不同的机器学习任务提供更全面的特征。多视图分析中的一个基本问题是什么是附加视图带来的附加信息，并且可以定量地识别这些附加信息。在这项工作中，我们试图通过将纠缠的多视图特征分解为共享的潜在表示（在所有视图中是通用的）和私有表示（在每个视图中是特定的）来解决这一挑战。我们在信息瓶颈的框架下提出了这种特征解纠缠，并提出了解纠缠变分信息瓶颈（DVIB）。DVIB使用互信息的约束显式定义共享和私有表示的属性。通过推导互信息项的变分上下界，有效地优化了表示。我们证明了DVIB学习的共享表示和私有表示分别很好地保留了两个视图之间共享的公共标签和每个视图对应的唯一标签。DVIB在对有腐蚀的图像进行分类时也显示出相当的性能。DVIB实现可在https://github.com/feng-bao-ucsf/DVIB.
摘要：Multiview data contain information from multiple modalities and have potentials to provide more comprehensive features for diverse machine learning tasks. A fundamental question in multiview analysis is what is the additional information brought by additional views and can quantitatively identify this additional information. In this work, we try to tackle this challenge by decomposing the entangled multiview features into shared latent representations that are common across all views and private representations that are specific to each single view. We formulate this feature disentanglement in the framework of information bottleneck and propose disentangled variational information bottleneck (DVIB). DVIB explicitly defines the properties of shared and private representations using constrains from mutual information. By deriving variational upper and lower bounds of mutual information terms, representations are efficiently optimized. We demonstrate the shared and private representations learned by DVIB well preserve the common labels shared between two views and unique labels corresponding to each single view, respectively. DVIB also shows comparable performance in classification task on images with corruptions. DVIB implementation is available at https://github.com/feng-bao-ucsf/DVIB.

蒸馏|知识提取(1篇)

【1】 Algorithmic Principles of Camera-based Respiratory Motion Extraction
标题：基于摄像机的呼吸运动提取算法原理

作者：Wenjin Wang,Albertus C. den Brinker
备注：Camera based contactless health monitoring
链接：https://arxiv.org/abs/2105.07537
摘要：在视频健康监测产品中，基于人体运动的视频呼吸信号测量已经被提出，并在最近趋于成熟。这种测量的核心算法是对呼吸引起的胸部/腹部微小运动的估计，而最基本的挑战是运动敏感性。尽管现有技术报道了对真实人类受试者的验证，但是没有全面/严格的基准来量化基于运动的核心呼吸算法的灵敏度和边界条件，这些算法测量视频帧之间的亚像素位移。在本文中，我们设计了一个具有完全可控的物理模型的装置来研究核心算法的本质，以及一个包含两种运动估计策略和三种空间表示的数学模型，得到了六种用于呼吸信号提取的算法组合。他们的承诺和限制通过幻影基准进行了讨论和澄清。本文的研究成果有助于加深对基于摄像机的呼吸测量在健康监测中的理解和应用。
摘要：Measuring the respiratory signal from a video based on body motion has been proposed and recently matured in products for video health monitoring. The core algorithm for this measurement is the estimation of tiny chest/abdominal motions induced by respiration, and the fundamental challenge is motion sensitivity. Though prior arts reported on the validation with real human subjects, there is no thorough/rigorous benchmark to quantify the sensitivities and boundary conditions of motion-based core respiratory algorithms that measure sub-pixel displacement between video frames. In this paper, we designed a setup with a fully-controllable physical phantom to investigate the essence of core algorithms, together with a mathematical model incorporating two motion estimation strategies and three spatial representations, leading to six algorithmic combinations for respiratory signal extraction. Their promises and limitations are discussed and clarified via the phantom benchmark. The insights gained in this paper are intended to improve the understanding and applications of camera-based respiration measurement in health monitoring.

视觉解释|视频理解VQA|caption等(1篇)

【1】 Expressive Explanations of DNNs by Combining Concept Analysis with ILP
标题：概念分析与ILP相结合的DNNs表达解释

作者：Johannes Rabold,Gesina Schwalbe,Ute Schmid
机构： Cognitive Systems, University of Bamberg, Germany, Holistic Engineering and Technologies, Artificial Intelligence, Continental AG, Regensburg, Germany
备注：14 pages, 4 figures; Camera-ready submission to KI2020; The final authenticated publication is available online at this https URL; code available at this https URL
链接：https://arxiv.org/abs/2105.07371
摘要：可解释人工智能已经成为黑箱机器学习方法的一个关键组成部分，在对可靠性或透明度有很高要求的领域。例如，医疗辅助系统，以及与欧盟《一般数据保护条例》有关的应用，该条例以透明度为基石。这样的需求要求有能力审核分类器决策背后的基本原理。虽然视觉化是事实上的解释标准，但它们在许多方面缺乏表现力：它们不能区分视觉特征的不同属性表现（例如眼睛睁开与闭上），也不能准确地描述缺乏的影响以及特征之间的关系。另一种选择是更具表现力的符号替代模型。然而，这些需要符号输入，这在大多数计算机视觉任务中是不容易获得的。在本文中，我们研究如何克服这一点：我们使用由网络学习的固有特征来构建一个全局的、表达的、口头的解释前馈卷积深度神经网络（DNN）的基本原理。特征的语义是通过一组人类可理解的视觉概念训练的概念分析方法来挖掘的。这种解释是通过归纳逻辑程序设计（ILP）方法找到的，并以一阶规则表示。我们证明了我们的解释是忠实于原来的黑箱模型。我们的实验代码在https://github.com/mc-lovin-mlem/concept-embeddings-and-ilp/tree/ki2020.
摘要：Explainable AI has emerged to be a key component for black-box machine learning approaches in domains with a high demand for reliability or transparency. Examples are medical assistant systems, and applications concerned with the General Data Protection Regulation of the European Union, which features transparency as a cornerstone. Such demands require the ability to audit the rationale behind a classifier's decision. While visualizations are the de facto standard of explanations, they come short in terms of expressiveness in many ways: They cannot distinguish between different attribute manifestations of visual features (e.g. eye open vs. closed), and they cannot accurately describe the influence of absence of, and relations between features. An alternative would be more expressive symbolic surrogate models. However, these require symbolic inputs, which are not readily available in most computer vision tasks. In this paper we investigate how to overcome this: We use inherent features learned by the network to build a global, expressive, verbal explanation of the rationale of a feed-forward convolutional deep neural network (DNN). The semantics of the features are mined by a concept analysis approach trained on a set of human understandable visual concepts. The explanation is found by an Inductive Logic Programming (ILP) method and presented as first-order rules. We show that our explanation is faithful to the original black-box model. The code for our experiments is available at https://github.com/mc-lovin-mlem/concept-embeddings-and-ilp/tree/ki2020.

超分辨率|去噪|去模糊|去雾(3篇)

【1】 Window-Level is a Strong Denoising Surrogate
标题：窗口级是一个很强的去噪替代物

作者：Ayaan Haque,Adam Wang,Abdullah-Al-Zubaer Imran
机构： Saratoga High School, Saratoga, CA, USA, Stanford University, Stanford, CA, USA
备注：11 pages, 4 figures
链接：https://arxiv.org/abs/2105.07153
摘要：CT图像质量在很大程度上依赖于辐射剂量，辐射剂量与图像质量之间的权衡影响了后续的基于图像的诊断性能。然而，高辐射对病人和操作者都是有害的。一些（基于深度学习的）方法已经尝试去噪低剂量图像。然而，这些方法需要访问大型训练集，特别是用于参考的全剂量CT图像，这通常很难获得。自监督学习是一种新兴的替代方法，可以降低参考数据的需求，促进无监督学习。现有的自监督CT去噪方法要么依赖于外域，要么与任务无关。为了应对上述挑战，我们提出了一种新的自监督学习方法，即用于图像去噪的自监督窗口均衡（SSWL-IDN），它利用了一种创新的、与任务相关的、简单而有效的替代物——窗口均衡等价物的预测。SSWL-IDN利用剩余学习和结合感知损失和MSE的混合损失，所有这些都包含在VAE框架中。我们广泛的（域内和域间）实验证明了SSWL-IDN在仅在5%剂量水平下获得的CT（腹部和胸部）图像的积极去噪中的有效性。
摘要：CT image quality is heavily reliant on radiation dose, which causes a trade-off between radiation dose and image quality that affects the subsequent image-based diagnostic performance. However, high radiation can be harmful to both patients and operators. Several (deep learning-based) approaches have been attempted to denoise low dose images. However, those approaches require access to large training sets, specifically the full dose CT images for reference, which can often be difficult to obtain. Self-supervised learning is an emerging alternative for lowering the reference data requirement facilitating unsupervised learning. Currently available self-supervised CT denoising works are either dependent on foreign domain or pretexts are not very task-relevant. To tackle the aforementioned challenges, we propose a novel self-supervised learning approach, namely Self-Supervised Window-Leveling for Image DeNoising (SSWL-IDN), leveraging an innovative, task-relevant, simple, yet effective surrogate -- prediction of the window-leveled equivalent. SSWL-IDN leverages residual learning and a hybrid loss combining perceptual loss and MSE, all incorporated in a VAE framework. Our extensive (in- and cross-domain) experimentation demonstrates the effectiveness of SSWL-IDN in aggressive denoising of CT (abdomen and chest) images acquired at 5\% dose level only.

【2】 RIDnet: Radiologist-Inspired Deep Neural Network for Low-dose CT Denoising
标题：RIDnet：放射科医生启发的低剂量CT去噪深度神经网络

作者：Kecheng Chen,Jiayu Sun,Jiang Shen,Jixiang Luo,Xinyu Zhang,Xuelin Pan,Dongsheng Wu,Yue Zhao,Miguel Bento,Yazhou Ren,Xiaorong Pu
机构：Sichuan University
备注：under review
链接：https://arxiv.org/abs/2105.07146
摘要：低剂量ct（low-dose computer tomography，LDCT）因其辐射剂量低、对健康危害小等优点，已被广泛应用于肺癌和COVID-19的早期筛查。LDCT图像不可避免地受到复杂噪声的影响而退化。据报道，与商用迭代重建方法相比，基于深度学习（DL）的卷积神经网络（CNN）LDCT去噪方法取得了很好的效果。现有的基于DL的方法大多集中于CNN提取的局部信息，而忽略了显式的非局部信息和上下文信息（放射科医生利用了这些信息）。为了解决这个问题，我们提出了一个新的深度学习模型，名为radiologist-inspireddeep-denosing network（RIDnet），用来模拟放射科医生读取LDCT图像的工作流程。具体地说，该模型将所有的局部信息、非局部信息和上下文信息显式地集成在一起，而不仅仅是局部信息。我们的放射科医生启发模型是潜在的青睐放射科医生作为一个熟悉的工作流程。对一个公共临床数据集的双盲读者研究表明，与现有的方法相比，我们提出的模型在结构保真度、噪声抑制和总分方面取得了最令人印象深刻的性能。作为一个医生启发的模型，RIDnet给出了一个新的研究路线图，在设计辅助临床诊断的决策支持工具时考虑了医生的行为。型号和代码可在https://github.com/tonyckc/RIDnet_demo.
摘要：Being low-level radiation exposure and less harmful to health, low-dose computed tomography (LDCT) has been widely adopted in the early screening of lung cancer and COVID-19. LDCT images inevitably suffer from the degradation problem caused by complex noises. It was reported that, compared with commercial iterative reconstruction methods, deep learning (DL)-based LDCT denoising methods using convolutional neural network (CNN) achieved competitive performance. Most existing DL-based methods focus on the local information extracted by CNN, while ignoring both explicit non-local and context information (which are leveraged by radiologists). To address this issue, we propose a novel deep learning model named radiologist-inspired deep denoising network (RIDnet) to imitate the workflow of a radiologist reading LDCT images. Concretely, the proposed model explicitly integrates all the local, non-local and context information rather than local information only. Our radiologist-inspired model is potentially favoured by radiologists as a familiar workflow. A double-blind reader study on a public clinical dataset shows that, compared with state-of-the-art methods, our proposed model achieves the most impressive performance in terms of the structural fidelity, the noise suppression and the overall score. As a physicians-inspired model, RIDnet gives a new research roadmap that takes into account the behavior of physicians when designing decision support tools for assisting clinical diagnosis. Models and code are available at https://github.com/tonyckc/RIDnet_demo.

【3】 Image Super-Resolution Quality Assessment: Structural Fidelity Versus Statistical Naturalness
标题：图像超分辨率质量评估：结构保真度与统计自然度

作者：Wei Zhou,Zhou Wang,Zhibo Chen
机构：Dept. of Electrical & Computer Engineering, University of Waterloo, Waterloo, ON N,L,G, Canada, CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application System, University of Science and Technology of China, Hefei , China
备注：Accepted by QoMEX 2021
链接：https://arxiv.org/abs/2105.07139
摘要：单图像超分辨率（SISR）算法用低分辨率（LR）重建高分辨率（HR）图像。发展图像质量评价（IQA）方法不仅可以评价和比较SISR算法，而且可以指导其未来的发展。在本文中，我们评估了在结构保真度与统计自然度的二维（2D）空间中SISR生成图像的质量。这使得我们能够观察不同SISR算法在2D空间中的行为。具体地说，SISR方法传统上是为了获得高的结构保真度而设计的，但往往牺牲了统计的自然性，而最近基于生成性对抗网络（GAN）的算法倾向于创建更自然的结果，但在结构保真度上损失很大。此外，这样的2D评估可以容易地与标量质量预测融合。有趣的是，我们发现一个简单的线性组合一个简单的局部结构保真度和一个全球性的统计自然度措施产生令人惊讶的准确预测SISR图像质量时，测试使用公共科目评分SISR图像数据集。建议的SFSN模型的代码可在\url上公开获取{https://github.com/weizhou-geek/SFSN}.
摘要：Single image super-resolution (SISR) algorithms reconstruct high-resolution (HR) images with their low-resolution (LR) counterparts. It is desirable to develop image quality assessment (IQA) methods that can not only evaluate and compare SISR algorithms, but also guide their future development. In this paper, we assess the quality of SISR generated images in a two-dimensional (2D) space of structural fidelity versus statistical naturalness. This allows us to observe the behaviors of different SISR algorithms as a tradeoff in the 2D space. Specifically, SISR methods are traditionally designed to achieve high structural fidelity but often sacrifice statistical naturalness, while recent generative adversarial network (GAN) based algorithms tend to create more natural-looking results but lose significantly on structural fidelity. Furthermore, such a 2D evaluation can be easily fused to a scalar quality prediction. Interestingly, we find that a simple linear combination of a straightforward local structural fidelity and a global statistical naturalness measures produce surprisingly accurate predictions of SISR image quality when tested using public subject-rated SISR image datasets. Code of the proposed SFSN model is publicly available at \url{https://github.com/weizhou-geek/SFSN}.

点云|SLAM|雷达|激光|深度RGBD相关(2篇)

【1】 HCRF-Flow: Scene Flow from Point Clouds with Continuous High-order CRFs and Position-aware Flow Embedding
标题：HCRF-FLOW：具有连续高阶CRF和位置感知流嵌入的点云场景流

作者：Ruibo Li,Guosheng Lin,Tong He,Fayao Liu,Chunhua Shen
机构：S-Lab, Nanyang Technological University, Singapore, School of Computer Science and Engineering, Nanyang Technological University, Singapore, University of Adelaide, Australia ,Institute for Infocomm Research, ASTAR, Singapore
备注：Accepted to CVPR2021
链接：https://arxiv.org/abs/2105.07751
摘要：三维点云中的场景流在理解动态环境中起着重要的作用。虽然深度神经网络已经取得了很大的进展，但是由于只考虑了单点的平移运动，而忽略了局部区域刚体运动的约束，其性能远不能令人满意。为了解决这个问题，我们建议引入运动一致性来强制相邻点之间的平滑。此外，还通过共享每个局部区域内所有点的唯一刚体运动参数来增加对局部变换刚性的约束。为此，提出了一种基于高阶CRFs的关系模型（Con-HCRFs）来研究点方向的光滑性和区域方向的刚性。为了使CRFs具有一元判别项，我们还引入了一个位置感知的流量估计模块。在FlyingThings3D和KITTI上的综合实验表明，我们提出的框架（HCRF-Flow）达到了最先进的性能，并且大大优于以前的方法。
摘要：Scene flow in 3D point clouds plays an important role in understanding dynamic environments. Although significant advances have been made by deep neural networks, the performance is far from satisfactory as only per-point translational motion is considered, neglecting the constraints of the rigid motion in local regions. To address the issue, we propose to introduce the motion consistency to force the smoothness among neighboring points. In addition, constraints on the rigidity of the local transformation are also added by sharing unique rigid motion parameters for all points within each local region. To this end, a high-order CRFs based relation module (Con-HCRFs) is deployed to explore both point-wise smoothness and region-wise rigidity. To empower the CRFs to have a discriminative unary term, we also introduce a position-aware flow estimation module to be incorporated into the Con-HCRFs. Comprehensive experiments on FlyingThings3D and KITTI show that our proposed framework (HCRF-Flow) achieves state-of-the-art performance and significantly outperforms previous approaches substantially.

【2】 Differentiable SLAM-net: Learning Particle SLAM for Visual Navigation
标题：可微SLAM网：视觉导航的粒子SLAM学习

作者：Peter Karkus,Shaojun Cai,David Hsu
机构：National University of Singapore
备注：CVPR 2021
链接：https://arxiv.org/abs/2105.07593
摘要：同时定位与地图（SLAM）由于转弯速度快、墙壁无特征、相机质量差等原因，在视觉机器人导航等下游应用中仍然具有挑战性。我们引入了可微SLAM网络（SLAM-net）和一种导航结构，使平面机器人能够在以前看不见的室内环境中进行导航。SLAM网络将基于粒子滤波的SLAM算法编码到可微计算图中，通过SLAM算法进行反向传播学习面向任务的神经网络部件。由于SLAM-net可以为最终目标联合优化所有模型组件，因此SLAM-net可以学习在具有挑战性的条件下的鲁棒性。我们在Habitat平台上用不同的真实RGB和RGB-D数据集进行了实验。SLAM-net在噪声环境下的性能明显优于广泛采用的ORB-SLAM。我们采用SLAM网络的导航架构大大提高了人居挑战2020 PointNav任务的最新水平（成功率为37%至64%）。项目网站：http://sites.google.com/view/slamnet
摘要：Simultaneous localization and mapping (SLAM) remains challenging for a number of downstream applications, such as visual robot navigation, because of rapid turns, featureless walls, and poor camera quality. We introduce the Differentiable SLAM Network (SLAM-net) along with a navigation architecture to enable planar robot navigation in previously unseen indoor environments. SLAM-net encodes a particle filter based SLAM algorithm in a differentiable computation graph, and learns task-oriented neural network components by backpropagating through the SLAM algorithm. Because it can optimize all model components jointly for the end-objective, SLAM-net learns to be robust in challenging conditions. We run experiments in the Habitat platform with different real-world RGB and RGB-D datasets. SLAM-net significantly outperforms the widely adapted ORB-SLAM in noisy conditions. Our navigation architecture with SLAM-net improves the state-of-the-art for the Habitat Challenge 2020 PointNav task by a large margin (37% to 64% success). Project website: http://sites.google.com/view/slamnet

多模态(1篇)

【1】 A Review on Explainability in Multimodal Deep Neural Nets
标题：多模态深度神经网络的可解释性研究综述

作者：Gargi Joshi,Rahee Walambe,Ketan Kotecha
备注：24 pages 6 figures
链接：https://arxiv.org/abs/2105.07878
摘要：基于深度神经网络的人工智能技术在许多应用领域取得了巨大的成功，尤其是在计算机视觉应用和自然语言处理领域。超越人类水平的表现推动了语言、视觉、感官、文本等不同形态在准确预测和识别中的应用研究。文献中提出了几种采用深度学习模型的多模态融合方法。尽管它们表现出色，但深层神经网络的复杂、不透明和黑盒特性限制了它们的社会接受度和可用性。这引起了对模型可解释性和可解释性的追求，在涉及多模态人工智能方法的复杂任务中更是如此。本文对多模态深层神经网络的可解释性，特别是视觉和语言任务的可解释性进行了综述。本文讨论了多模态人工智能及其在一般领域中的应用，包括多模态人工智能的意义、数据集、方法和技术的基本组成部分、挑战、应用以及该领域的未来趋势
摘要：Artificial Intelligence techniques powered by deep neural nets have achieved much success in several application domains, most significantly and notably in the Computer Vision applications and Natural Language Processing tasks. Surpassing human-level performance propelled the research in the applications where different modalities amongst language, vision, sensory, text play an important role in accurate predictions and identification. Several multimodal fusion methods employing deep learning models are proposed in the literature. Despite their outstanding performance, the complex, opaque and black-box nature of the deep neural nets limits their social acceptance and usability. This has given rise to the quest for model interpretability and explainability, more so in the complex tasks involving multimodal AI methods. This paper extensively reviews the present literature to present a comprehensive survey and commentary on the explainability in multimodal deep neural nets, especially for the vision and language tasks. Several topics on multimodal AI and its applications for generic domains have been covered in this paper, including the significance, datasets, fundamental building blocks of the methods and techniques, challenges, applications, and future trends in this domain

其他神经网络|深度学习|模型|建模(8篇)

【1】 Learning to Automatically Catch Potholes in Worldwide Road Scene Images
标题：学习自动捕捉全球道路场景图像中的坑洞

作者：J. Javier Yebes,David Montero,Ignacio Arriola
机构： the Department of Transport in UK stated in 20 1 4 that more than £ 3 billion were spent nationally onroad repairs
备注：in IEEE Intelligent Transportation Systems Magazine
链接：https://arxiv.org/abs/2105.07986
摘要：在世界上任何铺砌道路上都存在的几种道路危险中，坑洞是最令人讨厌的，也涉及更高的维护成本。由于技术和研究的进步，人们对这些危险的自动检测越来越感兴趣。我们的研究工作解决了从真实道路场景图像中检测坑洞的难题。主要的新奇之处在于应用人工智能的最新进展来学习坑洞的视觉外观。我们建立了一个带有坑洞注释的大型图像数据集。它们包含了来自世界不同城市的道路场景，在不同的环境条件下用不同的相机、车辆和视点拍摄。然后，基于快速R-CNN和SSD深度神经网络对四种不同的目标检测模型进行了微调。在具有GPGPU功能的nvidiadrivepx2平台上进行了测试，测试结果表明该探测器具有较高的平均精度，可以嵌入到车辆上。此外，作为自动驾驶仪H2020项目的一部分，它被部署在实车上，将检测到的坑洞通知给给定的物联网平台。
摘要：Among several road hazards that are present in any paved way in the world, potholes are one of the most annoying and also involving higher maintenance costs. There exists an increasing interest on the automated detection of these hazards enabled by technological and research progress. Our research work tackled the challenge of pothole detection from images of real world road scenes. The main novelty resides on the application of the latest progress in AI to learn the visual appearance of potholes. We built a large dataset of images with pothole annotations. They contained road scenes from different cities in the world, taken with different cameras, vehicles and viewpoints under varied environmental conditions. Then, we fine-tuned four different object detection models based on Faster R-CNN and SSD deep neural networks. We achieved high average precision and the pothole detector was tested on the Nvidia DrivePX2 platform with GPGPU capability, which can be embedded on vehicles. Moreover, it was deployed on a real vehicle to notify the detected potholes to a given IoT platform as part of AUTOPILOT H2020 project.

【2】 Leveraging EfficientNet and Contrastive Learning for Accurate Global-scale Location Estimation
标题：利用EfficientNet和对比学习进行精确的全球范围位置估计

作者：Giorgos Kordopatis-Zilos,Panagiotis Galopoulos,Symeon Papadopoulos,Ioannis Kompatsiaris
机构：Information Technologies Institute, CERTH, Thessaloniki, Greece
链接：https://arxiv.org/abs/2105.07645
摘要：本文针对全球尺度图像地理定位问题，提出了一种混合分类检索方案。与其他将问题严格作为分类或检索任务来处理的方法不同，我们将这两种方法结合在一个统一的解决方案中，利用每种方法的优势和两个不同的模块。第一种方法利用EfficientNet架构以健壮的方式将图像分配给特定的地理单元。第二种是引入一种新的残差结构，通过对比学习将输入图像映射到一个嵌入空间，使得同一位置图像的成对测地距离最小。对于最终的位置估计，这两个模块结合了单元内搜索方案，其中基于空间聚类方案聚集来自预测地理单元的最相似图像的位置。我们的方法在四个公共数据集上展示了非常有竞争力的性能，在细粒度尺度上实现了最新的性能，即在Im2GPS3k上1km范围内达到15.0%。
摘要：In this paper, we address the problem of global-scale image geolocation, proposing a mixed classification-retrieval scheme. Unlike other methods that strictly tackle the problem as a classification or retrieval task, we combine the two practices in a unified solution leveraging the advantages of each approach with two different modules. The first leverages the EfficientNet architecture to assign images to a specific geographic cell in a robust way. The second introduces a new residual architecture that is trained with contrastive learning to map input images to an embedding space that minimizes the pairwise geodesic distance of same-location images. For the final location estimation, the two modules are combined with a search-within-cell scheme, where the locations of most similar images from the predicted geographic cell are aggregated based on a spatial clustering scheme. Our approach demonstrates very competitive performance on four public datasets, achieving new state-of-the-art performance in fine granularity scales, i.e., 15.0% at 1km range on Im2GPS3k.

【3】 Layerwise Optimization by Gradient Decomposition for Continual Learning
标题：基于梯度分解的连续学习分层优化

作者：Shixiang Tang,Dapeng Chen,Jinguo Zhu,Shijie Yu,Wanli Ouyang
机构：The University of Sydney, SenseTime Computer Vision Group, Australia, Xi’an Jiaotong University, Sensetime Group Limited, Hong Kong, Shenzhen Institutes of Advanced Technology, CAS
备注：cvpr2021
链接：https://arxiv.org/abs/2105.07561
摘要：深度神经网络在各个领域实现了最先进的、有时甚至是超人的性能。然而，当连续学习任务时，网络很容易忘记先前任务的知识，称为“灾难性遗忘”。为了实现新旧任务之间的一致性，一种有效的解决方法是修改梯度进行更新。以前的方法对不同的任务强制执行独立的梯度约束，而我们考虑到这些梯度包含复杂的信息，并提出通过梯度分解来利用任务间的信息。特别地，将旧任务的梯度分解为所有旧任务共享的部分和特定于该任务的部分。更新的梯度应该接近新任务的梯度，与所有旧任务共享的梯度一致，并且与旧任务特定的梯度所跨越的空间正交。通过这种方式，我们的方法在不损害任务特定知识的情况下鼓励了公共知识的巩固。此外，对每一层的梯度分别进行优化，而不是像以前的工作那样对所有梯度进行串联。这有效地避免了不同层中梯度大小变化的影响。大量实验验证了梯度分解优化和分层更新的有效性。我们提出的方法在各种持续学习的基准上取得了最先进的结果。
摘要：Deep neural networks achieve state-of-the-art and sometimes super-human performance across various domains. However, when learning tasks sequentially, the networks easily forget the knowledge of previous tasks, known as "catastrophic forgetting". To achieve the consistencies between the old tasks and the new task, one effective solution is to modify the gradient for update. Previous methods enforce independent gradient constraints for different tasks, while we consider these gradients contain complex information, and propose to leverage inter-task information by gradient decomposition. In particular, the gradient of an old task is decomposed into a part shared by all old tasks and a part specific to that task. The gradient for update should be close to the gradient of the new task, consistent with the gradients shared by all old tasks, and orthogonal to the space spanned by the gradients specific to the old tasks. In this way, our approach encourages common knowledge consolidation without impairing the task-specific knowledge. Furthermore, the optimization is performed for the gradients of each layer separately rather than the concatenation of all gradients as in previous works. This effectively avoids the influence of the magnitude variation of the gradients in different layers. Extensive experiments validate the effectiveness of both gradient-decomposed optimization and layer-wise updates. Our proposed method achieves state-of-the-art results on various benchmarks of continual learning.

【4】 Neural Trees for Learning on Graphs
标题：用于图上学习的神经树

作者：Rajat Talak,Siyi Hu,Lisa Peng,Luca Carlone
机构：The authors are with the Laboratory of Information and Decision Systems(LIDS), Massachusetts Institute of Technology
链接：https://arxiv.org/abs/2105.07264
摘要：图形神经网络（GNNs）是一种灵活而有效的图形学习方法。尽管取得了这一成功，但现有的gnn仍受到其本地消息传递体系结构的限制，其表现能力也受到了限制。在这项工作中，我们提出了一种新的GNN架构——神经树。神经树结构不在输入图上执行消息传递，而是在由输入图构造的树结构图（称为H-树）上执行消息传递。H-树中的节点对应于输入图中的子图，并且它们以分层方式重新组织，使得H-树中的节点的父节点总是对应于输入图中的较大子图。我们证明了这种神经树结构可以逼近无向图上的任意光滑概率分布函数，并对连接树算法进行了仿真。我们还证明了实现分布函数的$\epsilon$-近似所需的参数数目在输入图的树宽上是指数的，但在其大小上是线性的。我们将神经树应用于三维场景图中的半监督节点分类，结果表明，与传统的GNN结构相比，这些理论特性在预测精度上有显著提高。
摘要：Graph Neural Networks (GNNs) have emerged as a flexible and powerful approach for learning over graphs. Despite this success, existing GNNs are constrained by their local message-passing architecture and are provably limited in their expressive power. In this work, we propose a new GNN architecture -- the Neural Tree. The neural tree architecture does not perform message passing on the input graph but on a tree-structured graph, called the H-tree, that is constructed from the input graph. Nodes in the H-tree correspond to subgraphs in the input graph, and they are reorganized in a hierarchical manner such that a parent-node of a node in the H-tree always corresponds to a larger subgraph in the input graph. We show that the neural tree architecture can approximate any smooth probability distribution function over an undirected graph, as well as emulate the junction tree algorithm. We also prove that the number of parameters needed to achieve an $\epsilon$-approximation of the distribution function is exponential in the treewidth of the input graph, but linear in its size. We apply the neural tree to semi-supervised node classification in 3D scene graphs, and show that these theoretical properties translate into significant gains in prediction accuracy, over the more traditional GNN architectures.

【5】 Make Bipedal Robots Learn How to Imitate
标题：让两足机器人学会模仿

作者：Vishal Kumar,Sinnu Susan Thomas
机构： Thomas are with School of Computer Scienceand Engineering, Digital University Kerala (IIITMK) Kerala 69 5 3 17 India
链接：https://arxiv.org/abs/2105.07193
摘要：两足机器人的表现不如人类，因为它们不像我们那样学会走路。本文提出了一种利用仿人学习（IL）训练两足机器人完成基本动作的方法，即由指导员完成动作，机器人尝试模仿指导员的动作。据我们所知，这是我们第一次用教练的一个视频训练机器人执行动作，由于训练是基于关节角度进行的，机器人将始终保持其关节角度在物理极限，这反过来有助于更快的训练。采用OpenPose结构对机器人的关节进行识别，然后利用三点间的夹角提取关节角度数据，得到带噪的解。我们使用Savitzky-Golay滤波器平滑数据，并保留模拟数据。通过经验回放训练出一个编写精巧的深Q网络（DQN），使机器人学习与指导员相似的动作。该文件的实施已公开。
摘要：Bipedal robots do not perform well as humans since they do not learn to walk like we do. In this paper we propose a method to train a bipedal robot to perform some basic movements with the help of imitation learning (IL) in which an instructor will perform the movement and the robot will try to mimic the instructor movement. To the best of our knowledge, this is the first time we train the robot to perform movements with a single video of the instructor and as the training is done based on joint angles the robot will keep its joint angles always in physical limits which in return help in faster training. The joints of the robot are identified by OpenPose architecture and then joint angle data is extracted with the help of angle between three points resulting in a noisy solution. We smooth the data using Savitzky-Golay filter and preserve the Simulatore data anatomy. An ingeniously written Deep Q Network (DQN) is trained with experience replay to make the robot learn to perform the movements as similar as the instructor. The implementation of the paper is made publicly available.

【6】 Stacked Deep Multi-Scale Hierarchical Network for Fast Bokeh Effect Rendering from a Single Image
标题：基于堆叠式深度多尺度分层网络的单幅图像快速Bokeh效果绘制

作者：Saikat Dutta,Sourya Dipta Das,Nisarg A. Shah,Anil Kumar Tiwari
机构：IIT Madras, Chennai, India, Jadavpur University, Kolkata, India, IIT Jodhpur, Jodhpur, India
备注：Accepted to MAI workshop, CVPR 2021. Code and models: this https URL
链接：https://arxiv.org/abs/2105.07174
摘要：波基效应是摄影艺术中最理想的效果之一。通常，它需要一个具有不同光圈和快门设置的单反相机以及一定的摄影技巧来产生这种效果。在智能手机中，使用计算方法和附加传感器来克服物理镜头和传感器的限制，从而达到这种效果。现有的方法大多利用附加的传感器数据或预训练网络对场景进行精细的深度估计，有时采用纵向分割预训练网络模块对图像中的突出物体进行分割。由于这些原因，网络有许多参数，成为运行时密集型，无法在中端设备上运行。本文采用端到端的深度多尺度层次网络（DMSHN）模型对单目相机拍摄的图像进行Bokeh效果的直接渲染。为了进一步提高这种效果的感知质量，还提出了由两个DMSHN模块组成的叠加模型。我们的模型不依赖于任何预训练的网络模块来进行单目深度估计或显著性检测，从而大大减少了模型的大小和运行时间。堆叠DMSHN在大规模衰退中实现了最先进的结果！与当前处理高清图像的最先进模型相比，数据集的运行时间减少了约6倍。
摘要：The Bokeh Effect is one of the most desirable effects in photography for rendering artistic and aesthetic photos. Usually, it requires a DSLR camera with different aperture and shutter settings and certain photography skills to generate this effect. In smartphones, computational methods and additional sensors are used to overcome the physical lens and sensor limitations to achieve such effect. Most of the existing methods utilized additional sensor's data or pretrained network for fine depth estimation of the scene and sometimes use portrait segmentation pretrained network module to segment salient objects in the image. Because of these reasons, networks have many parameters, become runtime intensive and unable to run in mid-range devices. In this paper, we used an end-to-end Deep Multi-Scale Hierarchical Network (DMSHN) model for direct Bokeh effect rendering of images captured from the monocular camera. To further improve the perceptual quality of such effect, a stacked model consisting of two DMSHN modules is also proposed. Our model does not rely on any pretrained network module for Monocular Depth Estimation or Saliency Detection, thus significantly reducing the size of model and run time. Stacked DMSHN achieves state-of-the-art results on a large scale EBB! dataset with around 6x less runtime compared to the current state-of-the-art model in processing HD quality images.

【7】 High-Robustness, Low-Transferability Fingerprinting of Neural Networks
标题：一种高鲁棒性、低可移植性的神经网络指纹

作者：Siyue Wang,Xiao Wang,Pin-Yu Chen,Pu Zhao,Xue Lin
机构：. Northeastern University ,. Boston University ,. IBM Research
备注：ICLR 2021 Workshop on Security and Safety in Machine Learning Systems
链接：https://arxiv.org/abs/2105.07078
摘要：本文提出了一种有效地对深度神经网络进行指纹识别的特征实例，该特征实例对基础模型具有很强的鲁棒性，对模型剪枝具有很强的鲁棒性，对非关联模型的可移植性很低。这是第一个同时考虑鲁棒性和可转移性来生成真实指纹的工作，而目前的方法缺乏实际的假设，并且可能会产生较大的假阳性率。为了在鲁棒性和可转移性之间取得更好的平衡，我们提出了三种特征示例：vanilla C示例、RC示例和LTRC示例，从原始的基础模型中提取指纹。为了公平地描述稳健性和可转移性之间的权衡，我们提出了唯一性得分，这是一个衡量稳健性和可转移性之间差异的综合指标，同时也是虚警问题的一个指标。
摘要：This paper proposes Characteristic Examples for effectively fingerprinting deep neural networks, featuring high-robustness to the base model against model pruning as well as low-transferability to unassociated models. This is the first work taking both robustness and transferability into consideration for generating realistic fingerprints, whereas current methods lack practical assumptions and may incur large false positive rates. To achieve better trade-off between robustness and transferability, we propose three kinds of characteristic examples: vanilla C-examples, RC-examples, and LTRC-example, to derive fingerprints from the original base model. To fairly characterize the trade-off between robustness and transferability, we propose Uniqueness Score, a comprehensive metric that measures the difference between robustness and transferability, which also serves as an indicator to the false alarm problem.

【8】 Learned Smartphone ISP on Mobile NPUs with Deep Learning, Mobile AI 2021 Challenge: Report
标题：通过深度学习在移动NPU上学习智能手机ISP，移动AI 2021挑战：报告

作者：Andrey Ignatov,Cheng-Ming Chiang,Hsien-Kai Kuo,Anastasia Sycheva,Radu Timofte,Min-Hung Chen,Man-Yu Lee,Yu-Syuan Xu,Yu Tseng,Shusong Xu,Jin Guo,Chao-Hung Chen,Ming-Chun Hsyu,Wen-Chia Tsai,Chao-Wei Chen,Grigory Malivenko,Minsu Kwon,Myungje Lee,Jaeyoon Yoo,Changbeom Kang,Shinjo Wang,Zheng Shaolong,Hao Dejun,Xie Fen,Feng Zhuang,Yipeng Ma,Jingyang Peng,Tao Wang,Fenglong Song,Chih-Chung Hsu,Kwan-Lin Chen,Mei-Hsuang Wu,Vishal Chudasama,Kalpesh Prajapati,Heena Patel,Anjali Sarvaiya,Kishor Upla,Kiran Raja,Raghavendra Ramachandra,Christoph Busch,Etienne de Stoutz
备注：Mobile AI 2021 Workshop and Challenges: this https URL
链接：https://arxiv.org/abs/2105.07809
摘要：随着移动相机的质量开始在现代智能手机中扮演着至关重要的角色，ISP算法正越来越受到人们的关注，这些算法用于改善移动照片的各个感知方面。在这次移动人工智能挑战中，目标是开发一个端到端的基于深度学习的图像信号处理（ISP）管道，它可以取代传统的手工制作的ISP，并在智能手机npu上实现近乎实时的性能。为此，研究人员向参与者提供了一个全新的ISP学习数据集，该数据集由Sony IMX586 Quad Bayer移动传感器和专业的1.02亿像素中幅相机拍摄的原始RGB图像对组成。所有模型的运行时间均在MediaTek Dimensity 1000+平台上进行评估，该平台配有一个专用的AI处理单元，能够加速浮点和量化神经网络。所提出的解决方案与上述NPU完全兼容，能够在60-100毫秒内处理全高清照片，同时获得高保真效果。本文详细描述了在这一挑战中开发的所有模型。
摘要：As the quality of mobile cameras starts to play a crucial role in modern smartphones, more and more attention is now being paid to ISP algorithms used to improve various perceptual aspects of mobile photos. In this Mobile AI challenge, the target was to develop an end-to-end deep learning-based image signal processing (ISP) pipeline that can replace classical hand-crafted ISPs and achieve nearly real-time performance on smartphone NPUs. For this, the participants were provided with a novel learned ISP dataset consisting of RAW-RGB image pairs captured with the Sony IMX586 Quad Bayer mobile sensor and a professional 102-megapixel medium format camera. The runtime of all models was evaluated on the MediaTek Dimensity 1000+ platform with a dedicated AI processing unit capable of accelerating both floating-point and quantized neural networks. The proposed solutions are fully compatible with the above NPU and are capable of processing Full HD photos under 60-100 milliseconds while achieving high fidelity results. A detailed description of all models developed in this challenge is provided in this paper.

其他(13篇)

【1】 The Boombox: Visual Reconstruction from Acoustic Vibrations
标题：“音箱”：声学振动的视觉重建

作者：Boyuan Chen,Mia Chiquier,Hod Lipson,Carl Vondrick
机构：Columbia University
备注：Website: boombox.cs.columbia.edu
链接：https://arxiv.org/abs/2105.08052
摘要：我们介绍了Boombox，一种利用声音振动来重建内部内容图像的容器。当物体与容器相互作用时，它们会产生微小的声振动。精确的振动特性取决于箱体和物体的物理特性。我们演示了如何使用这个附带信号来预测视觉结构。经过学习，我们的方法仍然有效，即使相机无法看到盒子内。虽然我们使用低成本和低功耗的接触式麦克风来检测振动，但我们的结果表明，从多模态数据中学习可以使我们将廉价的声学传感器转换为丰富的视觉传感器。由于容器的普遍性，我们相信将感知能力整合到容器中将在人机交互和机器人技术中实现新的应用。我们的项目网站是：boombox.cs.columbia.edu
摘要：We introduce The Boombox, a container that uses acoustic vibrations to reconstruct an image of its inside contents. When an object interacts with the container, they produce small acoustic vibrations. The exact vibration characteristics depend on the physical properties of the box and the object. We demonstrate how to use this incidental signal in order to predict visual structure. After learning, our approach remains effective even when a camera cannot view inside the box. Although we use low-cost and low-power contact microphones to detect the vibrations, our results show that learning from multi-modal data enables us to transform cheap acoustic sensors into rich visual sensors. Due to the ubiquity of containers, we believe integrating perception capabilities into them will enable new applications in human-computer interaction and robotics. Our project website is at: boombox.cs.columbia.edu

【2】 A Light Stage on Every Desk
标题：每张桌子上都有一个轻便的舞台

作者：Soumyadip Sengupta,Brian Curless,Ira Kemelmacher-Shlizerman,Steve Seitz
机构：University of Washington
链接：https://arxiv.org/abs/2105.08051
摘要：每次你坐在电视机或显示器前，你的脸都会被时变的光照活跃地照亮。本文提出使用这种时变照明的合成重新照明你的脸与任何新的照明条件。在这样做的过程中，我们从德贝维克等人的灯光舞台工作中获得灵感，他们首先展示了在受控灯光环境中重新照亮被捕获的人的能力。鉴于现有的光台需要昂贵的、室内规模的球形捕捉架，而且世界上只有少数几个实验室存在，我们演示了如何从普通电视或桌面显示器获取有用的数据。我们对用户观看YouTube视频或其他标准内容的图像进行操作，而不是让用户感到不舒服的快速闪光模式。我们在给定用户的图像和监控模式上训练深度网络，并学习在任何目标照明（监控模式）下预测该用户的图像。实验结果表明，该方法能获得真实的重照明效果。视频结果可在http://grail.cs.washington.edu/projects/Light_Stage_on_Every_Desk/.
摘要：Every time you sit in front of a TV or monitor, your face is actively illuminated by time-varying patterns of light. This paper proposes to use this time-varying illumination for synthetic relighting of your face with any new illumination condition. In doing so, we take inspiration from the light stage work of Debevec et al., who first demonstrated the ability to relight people captured in a controlled lighting environment. Whereas existing light stages require expensive, room-scale spherical capture gantries and exist in only a few labs in the world, we demonstrate how to acquire useful data from a normal TV or desktop monitor. Instead of subjecting the user to uncomfortable rapidly flashing light patterns, we operate on images of the user watching a YouTube video or other standard content. We train a deep network on images plus monitor patterns of a given user and learn to predict images of that user under any target illumination (monitor pattern). Experimental evaluation shows that our method produces realistic relighting results. Video results are available at http://grail.cs.washington.edu/projects/Light_Stage_on_Every_Desk/.

【3】 StrobeNet: Category-Level Multiview Reconstruction of Articulated Objects
标题：StrobeNet：关节对象的类别级多视图重建

作者：Ge Zhang,Or Litany,Srinath Sridhar,Leonidas Guibas
机构：ShanghaiTech University, NVIDIA, Brown University, Stanford University
备注：preprint
链接：https://arxiv.org/abs/2105.08016
摘要：我们提出StrobeNet，一种从一个或多个未处理的RGB图像中对关节对象进行类别级三维重建的方法。重建一般的关节对象类别具有重要的应用，但由于对象在形状、关节、外观和拓扑结构上有很大的变化，因此具有挑战性。我们通过建立类别级连接规范化的思想来解决这个问题——将观察映射到规范化连接，从而实现无通信的多视图聚合。我们的端到端可训练神经网络从一个或多个对象的未处理图像中估计特征丰富的标准三维点云、关节和部分分割。这些中间估计用于生成最终的隐式三维重建。我们的方法重建对象，即使在具有大基线的图像和重建形状的动画中观察到它们在不同的关节中。通过对不同对象类别的定量和定性评价表明，该方法能够获得较高的重建精度，特别是在增加更多视图的情况下。
摘要：We present StrobeNet, a method for category-level 3D reconstruction of articulating objects from one or more unposed RGB images. Reconstructing general articulating object categories % has important applications, but is challenging since objects can have wide variation in shape, articulation, appearance and topology. We address this by building on the idea of category-level articulation canonicalization -- mapping observations to a canonical articulation which enables correspondence-free multiview aggregation. Our end-to-end trainable neural network estimates feature-enriched canonical 3D point clouds, articulation joints, and part segmentation from one or more unposed images of an object. These intermediate estimates are used to generate a final implicit 3D reconstruction.Our approach reconstructs objects even when they are observed in different articulations in images with large baselines, and animation of reconstructed shapes. Quantitative and qualitative evaluations on different object categories show that our method is able to achieve high reconstruction accuracy, especially as more views are added.

【4】 Global Wheat Head Dataset 2021: an update to improve the benchmarking wheat head localization with more diversity
标题：2021年全球小麦穗位数据集：更新以提高基准小麦穗位本地化的多样性

作者：Etienne DAVID,Mario Serouart,Daniel Smith,Simon Madec,Kaaviya Velumani,Shouyang Liu,Xu Wang,Francisco Pinto Espinosa,Shahameh Shafiee,Izzat S. A. Tahir,Hisashi Tsujimoto,Shuhei Nasuda,Bangyou Zheng,Norbert Kichgessner,Helge Aasen,Andreas Hund,Pouria Sadhegi-Tehran,Koichi Nagasawa,Goro Ishikawa,Sébastien Dandrifosse,Alexis Carlier,Benoit Mercatoris,Ken Kuroki,Haozhou Wang,Masanori Ishii,Minhajul A. Badhon,Curtis Pozniak,David Shaner LeBauer,Morten Lilimo,Jesse Poland,Scott Chapman,Benoit de Solan,Frédéric Baret,Ian Stavness,Wei Guo
备注：8 pages, 2 figures, 1 table
链接：https://arxiv.org/abs/2105.07660
摘要：全球小麦头部检测（GWHD）数据集创建于2020年，从不同采集平台和7个国家/机构采集的4700幅RGB图像中收集了193634个标记小麦头部。通过在Kaggle举办的相关比赛，GWHD成功地吸引了计算机视觉和农业科学界的关注。从2020年的第一次经验来看，已经确定了一些改进的途径，特别是从数据大小、头部多样性和标签可靠性的角度。为了解决这些问题，对2020年数据集进行了重新检查、重新标记，并通过添加来自另外5个国家的1722张图像进行了扩充，允许添加81553个额外的小麦穗。因此，我们希望在2021年发布一个新版本的全球小麦头部检测（GWHD）数据集，它比2020年的版本更大、更多样化、噪音更小。gwhd2021现在可以在http://www.global-wheat.com/ 在AIcrowd上组织了一个新的数据挑战，以利用这个更新的数据集。
摘要：The Global Wheat Head Detection (GWHD) dataset was created in 2020 and has assembled 193,634 labelled wheat heads from 4,700 RGB images acquired from various acquisition platforms and 7 countries/institutions. With an associated competition hosted in Kaggle, GWHD has successfully attracted attention from both the computer vision and agricultural science communities. From this first experience in 2020, a few avenues for improvements have been identified, especially from the perspective of data size, head diversity and label reliability. To address these issues, the 2020 dataset has been reexamined, relabeled, and augmented by adding 1,722 images from 5 additional countries, allowing for 81,553 additional wheat heads to be added. We would hence like to release a new version of the Global Wheat Head Detection (GWHD) dataset in 2021, which is bigger, more diverse, and less noisy than the 2020 version. The GWHD 2021 is now publicly available at http://www.global-wheat.com/ and a new data challenge has been organized on AIcrowd to make use of this updated dataset.

【5】 Rethinking "Batch" in BatchNorm

作者：Yuxin Wu,Justin Johnson
机构：Facebook AI Research
备注：Tech report
链接：https://arxiv.org/abs/2105.07576
摘要：BatchNorm是现代卷积神经网络中的一个重要组成部分。它独特的操作“批”而不是单个样本的特性引入了与深度学习中大多数其他操作显著不同的行为。因此，它会导致许多隐藏的警告，这些警告会以微妙的方式对模型的性能产生负面影响。本文全面回顾了视觉识别任务中的这些问题，指出解决这些问题的关键是重新思考BatchNorm中“批”概念的不同选择。通过介绍这些注意事项及其缓解措施，我们希望这篇综述能够帮助研究人员更有效地使用BatchNorm。
摘要：BatchNorm is a critical building block in modern convolutional neural networks. Its unique property of operating on "batches" instead of individual samples introduces significantly different behaviors from most other operations in deep learning. As a result, it leads to many hidden caveats that can negatively impact model's performance in subtle ways. This paper thoroughly reviews such problems in visual recognition tasks, and shows that a key to address them is to rethink different choices in the concept of "batch" in BatchNorm. By presenting these caveats and their mitigations, we hope this review can help researchers use BatchNorm more effectively.

【6】 AgeFlow: Conditional Age Progression and Regression with Normalizing Flows
标题：AgeFlow：带归一化流的条件年龄递进和回归

作者：Zhizhong Huang,Shouzhen Chen,Junping Zhang,Hongming Shan
机构：Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Institute of Science and Technology for Brain-inspired Intelligence and MOE Frontiers Center, for Brain Science, Fudan University, Shanghai , China
备注：IJCAI 2021
链接：https://arxiv.org/abs/2105.07239
摘要：年龄推进和回归的目的是合成一个给定的人脸图像照片真实感外观老化和年轻化的影响，分别。现有的基于生成性对抗网络的方法存在以下三个主要问题：1）不稳定的训练在生成的人脸中引入了强烈的幽灵伪影；2）不成对的训练导致了诸如性别和种族等人脸属性的意外变化，非双射年龄映射增加了人脸变换的不确定性。为了克服这些问题，本文提出了一个新的框架，称为AgeFlow，它综合了基于流的模型和GANs的优点。提出的AgeFlow包含三个部分：通过可逆神经网络将给定人脸映射到潜在空间的编码器，将源潜在向量转换为目标潜在向量的新型可逆条件翻译模块（ICTM），以及解码器，其使用相同的编码器网络从目标潜在向量重构生成的面部；所有部分都是可逆的，实现了双射年龄映射。ICTM的新颖之处有两个。首先，我们提出一种属性感知的知识提取方法，在保持其他无关属性不变的情况下，学习年龄序列的操作方向，以减轻面部属性的意外变化。其次，我们提出在潜在空间中使用GANs，以确保学习到的潜在向量与真实向量不可区分，这比传统的GANs在图像域中的使用要容易得多。实验结果表明，在两个基准数据集上的性能优于现有的基于GANs的方法。源代码位于https://github.com/Hzzone/AgeFlow.
摘要：Age progression and regression aim to synthesize photorealistic appearance of a given face image with aging and rejuvenation effects, respectively. Existing generative adversarial networks (GANs) based methods suffer from the following three major issues: 1) unstable training introducing strong ghost artifacts in the generated faces, 2) unpaired training leading to unexpected changes in facial attributes such as genders and races, and 3) non-bijective age mappings increasing the uncertainty in the face transformation. To overcome these issues, this paper proposes a novel framework, termed AgeFlow, to integrate the advantages of both flow-based models and GANs. The proposed AgeFlow contains three parts: an encoder that maps a given face to a latent space through an invertible neural network, a novel invertible conditional translation module (ICTM) that translates the source latent vector to target one, and a decoder that reconstructs the generated face from the target latent vector using the same encoder network; all parts are invertible achieving bijective age mappings. The novelties of ICTM are two-fold. First, we propose an attribute-aware knowledge distillation to learn the manipulation direction of age progression while keeping other unrelated attributes unchanged, alleviating unexpected changes in facial attributes. Second, we propose to use GANs in the latent space to ensure the learned latent vector indistinguishable from the real ones, which is much easier than traditional use of GANs in the image domain. Experimental results demonstrate superior performance over existing GANs-based methods on two benchmarked datasets. The source code is available at https://github.com/Hzzone/AgeFlow.

【7】 FloorPlanCAD: A Large-Scale CAD Drawing Dataset for Panoptic Symbol Spotting
标题：FloorPlanCAD：用于全景符号定位的大规模CAD图形数据集

作者：Zhiwen Fan,Lingjie Zhu,Honghua Li,Xiaohao Chen,Siyu Zhu,Ping Tan
机构：Alibaba Group, Simon Fraser University, †Equal contribution
备注：17 pages, 16 figures
链接：https://arxiv.org/abs/2105.07147
摘要：访问大型和多样化的计算机辅助设计（CAD）图纸对于开发符号定位算法至关重要。在本文中，我们提出了FloorPlanCAD，一个大规模的真实世界CAD绘图数据集，包含超过10000个平面图，从住宅到商业建筑。数据集中的CAD图形都表示为矢量图形，这使我们能够提供30个对象类别的线粒度注释。利用这些注释，我们引入了全景符号识别的任务，它不仅需要识别可数事物的实例，而且需要识别不可数事物的语义。为了解决这一问题，我们提出了一种将图卷积网络（GCNs）与卷积神经网络（CNNs）相结合的新方法，该方法能同时捕获非欧几里德和欧几里德特征，并能进行端到端的训练。提出的CNN-GCN方法在语义符号识别任务上取得了最新的性能，为全景符号识别任务建立了一个基线网络。我们的贡献有三个方面：1）据我们所知，所提出的CAD绘图数据集是第一个；2）全景符号识别任务将事物实例的识别和事物语义的识别看作一个识别问题；3）提出了一种基于CNN-GCN的全景符号识别方法，实现了SOTA在语义符号识别方面的性能。我们相信这些贡献将促进相关领域的研究。
摘要：Access to large and diverse computer-aided design (CAD) drawings is critical for developing symbol spotting algorithms. In this paper, we present FloorPlanCAD, a large-scale real-world CAD drawing dataset containing over 10,000 floor plans, ranging from residential to commercial buildings. CAD drawings in the dataset are all represented as vector graphics, which enable us to provide line-grained annotations of 30 object categories. Equipped by such annotations, we introduce the task of panoptic symbol spotting, which requires to spot not only instances of countable things, but also the semantic of uncountable stuff. Aiming to solve this task, we propose a novel method by combining Graph Convolutional Networks (GCNs) with Convolutional Neural Networks (CNNs), which captures both non-Euclidean and Euclidean features and can be trained end-to-end. The proposed CNN-GCN method achieved state-of-the-art (SOTA) performance on the task of semantic symbol spotting, and help us build a baseline network for the panoptic symbol spotting task. Our contributions are three-fold: 1) to the best of our knowledge, the presented CAD drawing dataset is the first of its kind; 2) the panoptic symbol spotting task considers the spotting of both thing instances and stuff semantic as one recognition problem; and 3) we presented a baseline solution to the panoptic symbol spotting task based on a novel CNN-GCN method, which achieved SOTA performance on semantic symbol spotting. We believe that these contributions will boost research in related areas.

【8】 Move2Hear: Active Audio-Visual Source Separation
标题：Move2Hear：有源音视频源分离

作者：Sagnik Majumder,Ziad Al-Halah,Kristen Grauman
机构：The University of Texas at Austin, Facebook AI Research
链接：https://arxiv.org/abs/2105.07142
摘要：我们引入了主动音像源分离问题，其中一个代理必须智能地移动，以便更好地隔离来自其环境中感兴趣对象的声音。代理可以同时听到多个音频源（例如，在嘈杂的家庭中，一个人在大厅里讲话），并且必须在有限的时间预算内用眼睛和耳朵自动分离目标物体发出的声音。为了实现这一目标，我们引入了一种强化学习方法，该方法通过预测音频分离质量的改善来训练控制代理的相机和麦克风放置的移动策略。我们在增强现实（系统已经和目标对象位于同一位置）和移动机器人（代理从任意远离目标对象的地方开始）的场景中演示了我们的方法。利用最先进的三维真实感视听模拟，我们证明了我们的模型能够找到最小的运动序列和最大的回报，用于音频源分离。项目：http://vision.cs.utexas.edu/projects/move2hear.
摘要：We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest in its environment. The agent hears multiple audio sources simultaneously (e.g., a person speaking down the hall in a noisy household) and must use its eyes and ears to automatically separate out the sounds originating from the target object within a limited time budget. Towards this goal, we introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time, guided by the improvement in predicted audio separation quality. We demonstrate our approach in scenarios motivated by both augmented reality (system is already co-located with the target object) and mobile robotics (agent begins arbitrarily far from the target object). Using state-of-the-art realistic audio-visual simulations in 3D environments, we demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation. Project: http://vision.cs.utexas.edu/projects/move2hear.

【9】 Regularized Deep Linear Discriminant Analysis
标题：正则化深线性判别分析

作者：Hongwei Chen,Wen Lu
机构：Hubei University of Technology, China, Editors: Vineeth N Balasubramanian and Ivor Tsang
链接：https://arxiv.org/abs/2105.07129
摘要：作为经典线性判别分析（LDA）的非线性扩展，深度线性判别分析（DLDA）用基于特征值的损失函数代替原有的分类交叉熵（CCE）损失函数，使深度神经网络（DNN）能够学习线性可分隐式表示。本文首先指出DLDA侧重于训练潜在子空间中所有维度的协同辨别能力，而较少注重训练单个维度的可分离能力。为了改进DLDA算法，提出了一种基于类内散布矩阵的正则化方法，增强了各维的识别能力，并保持了它们之间的互补性。在STL-10、CIFAR-10和小儿肺炎胸片数据集上的实验结果表明，本文提出的正则化深度线性判别分析（RDLDA）方法优于DLDA和以CCE为目标的常规神经网络方法。为了进一步提高RDLDA在局部空间的识别能力，提出了一种子类RDLDA算法。
摘要：As a non-linear extension of the classic Linear Discriminant Analysis(LDA), Deep Linear Discriminant Analysis(DLDA) replaces the original Categorical Cross Entropy(CCE) loss function with eigenvalue-based loss function to make a deep neural network(DNN) able to learn linearly separable hidden representations. In this paper, we first point out DLDA focuses on training the cooperative discriminative ability of all the dimensions in the latent subspace, while put less emphasis on training the separable capacity of single dimension. To improve DLDA, a regularization method on within-class scatter matrix is proposed to strengthen the discriminative ability of each dimension, and also keep them complement each other. Experiment results on STL-10, CIFAR-10 and Pediatric Pneumonic Chest X-ray Dataset showed that our proposed regularization method Regularized Deep Linear Discriminant Analysis(RDLDA) outperformed DLDA and conventional neural network with CCE as objective. To further improve the discriminative ability of RDLDA in the local space, an algorithm named Subclass RDLDA is also proposed.

【10】 Can self-training identify suspicious ugly duckling lesions?
标题：自我训练能否识别可疑的丑小鸭病变？

作者：Mohammadreza Mohseni,Jordan Yap,William Yolland,Arash Koochek,M Stella Atkins
机构：School of Computing Science, Simon Fraser University, MetaOptima Technology Inc, Department of Skin Science and Dermatology, University of British Columbia, Banner Health
备注：Accepted at Sixth ISIC Skin Image Analysis Workshop @ CVPR 2021
链接：https://arxiv.org/abs/2105.07116
摘要：一种常用的检测黑色素瘤的临床方法是识别丑小鸭痣的存在，或是皮肤病变与同一患者的其他病变看起来不同。与手工筛查方法相比，自动检测和分析这些病变将有助于标准化研究。然而，对于丑小鸭的病变很难获得专业的标记图像。因此，我们建议使用自监督机器学习来自动检测异常病变。我们首先从广域皮肤图像中自动检测并提取所有的病灶，然后根据自动识别的特征计算每个病灶在患者图像中的嵌入量。然后使用这些嵌入来计算L2距离，以此来测量相异性。利用这种深度学习方法，丑小鸭被识别为异常值，应该引起更多的关注，从检查医生。我们通过与皮肤科医生的比较进行评估，在保留的测试集上获得了72.1%的敏感率和94.2%的诊断准确率。
摘要：One commonly used clinical approach towards detecting melanomas recognises the existence of Ugly Duckling nevi, or skin lesions which look different from the other lesions on the same patient. An automatic method of detecting and analysing these lesions would help to standardize studies, compared with manual screening methods. However, it is difficult to obtain expertly-labelled images for ugly duckling lesions. We therefore propose to use self-supervised machine learning to automatically detect outlier lesions. We first automatically detect and extract all the lesions from a wide-field skin image, and calculate an embedding for each detected lesion in a patient image, based on automatically identified features. These embeddings are then used to calculate the L2 distances as a way to measure dissimilarity. Using this deep learning method, Ugly Ducklings are identified as outliers which should deserve more attention from the examining physician. We evaluate through comparison with dermatologists, and achieve a sensitivity rate of 72.1% and diagnostic accuracy of 94.2% on the held-out test set.

【11】 A Large Visual, Qualitative and Quantitative Dataset of Web Pages
标题：一个可视化的、定性的、定量的Web页面大数据集

作者：Christian Mejia-Escobar,Miguel Cazorla,Ester Martinez-Martin
机构：Central University of Ecuador, P.O. Box ,-,-, Quito, Ecuador, Institute for Computer Research, University of Alicante, P.O. Box ,. , Spain
链接：https://arxiv.org/abs/2105.07113
摘要：万维网不仅是当今最重要的通信和信息平台之一，也是科学研究的热点领域。这激发了大量需要大量数据的工作和项目。然而，没有一个数据集能够集成网页的参数和视觉外观，因为它的收集在时间和精力上都是一项昂贵的任务。在各种计算机工具和编程脚本的支持下，我们创建了49438个网页的大型数据集。它由视觉、文字和数字数据类型组成，包括世界各国，考虑了艺术、娱乐、经济、商业、教育、政府、新闻、媒体、科学和环境等广泛的主题，涵盖了不同的文化特征和不同的设计偏好。在本文中，我们描述了收集、调试和发布最终产品的过程，最终产品是免费提供的。为了证明我们的数据集的有用性，我们提出了一个用于检测错误网页的二元分类模型和一个基于多类Web主题的分类模型，这两个问题都是使用卷积神经网络来解决的。
摘要：The World Wide Web is not only one of the most important platforms of communication and information at present, but also an area of growing interest for scientific research. This motivates a lot of work and projects that require large amounts of data. However, there is no dataset that integrates the parameters and visual appearance of Web pages, because its collection is a costly task in terms of time and effort. With the support of various computer tools and programming scripts, we have created a large dataset of 49,438 Web pages. It consists of visual, textual and numerical data types, includes all countries worldwide, and considers a broad range of topics such as art, entertainment, economy, business, education, government, news, media, science, and environment, covering different cultural characteristics and varied design preferences. In this paper, we describe the process of collecting, debugging and publishing the final product, which is freely available. To demonstrate the usefulness of our dataset, we expose a binary classification model for detecting error Web pages, and a multi-class Web subject-based categorization, both problems using convolutional neural networks.

【12】 NeLF: Practical Novel View Synthesis with Neural Light Field
标题：NELF：实用的神经光场新视图合成

作者：Celong Liu,Zhong Li,Junsong Yuan,Yi Xu
机构：OPPO US Research Center, InnoPeak Technology Inc., Palo Alto, California, USA, University at Buffalo, State University of New York., Buffalo, New York, USA
备注：13 pages, 12 figures
链接：https://arxiv.org/abs/2105.07112
摘要：在本文中，我们提出了一个实用和稳健的深度学习解决方案，用于复杂场景的新视图合成。在我们的方法中，一个连续的场景被表示为一个光场，即一组光线，每个光线都有相应的颜色。我们采用了光场的4D参数化。然后我们将光场表示为4D函数，将4D坐标映射到相应的颜色值。我们训练一个深度完全连接的网络来优化这个功能。然后，使用场景特定模型来合成新的视图。以前的光场方法通常需要密集的视图采样来可靠地渲染高质量的新视图。该方法通过采样光线，直接查询网络中每条光线的颜色，实现了新颖的视图绘制；因此，可以使用非常稀疏的输入图像集进行快速光场渲染。我们的方法在保持交互式帧速率的同时获得了最新的视图合成结果。
摘要：In this paper, we present a practical and robust deep learning solution for the novel view synthesis of complex scenes. In our approach, a continuous scene is represented as a light field, i.e., a set of rays, each of which has a corresponding color. We adopt a 4D parameterization of the light field. We then formulate the light field as a 4D function that maps 4D coordinates to corresponding color values. We train a deep fully connected network to optimize this function. Then, the scene-specific model is used to synthesize novel views. Previous light field approaches usually require dense view sampling to reliably render high-quality novel views. Our method can render novel views by sampling rays and querying the color for each ray from the network directly; thus enabling fast light field rendering with a very sparse set of input images. Our method achieves state-of-the-art novel view synthesis results while maintaining an interactive frame rate.

【13】 Advances in Artificial Intelligence to Reduce Polyp Miss Rates during Colonoscopy
标题：人工智能降低结肠镜检查息肉漏检率的研究进展

作者：Michael Yeung,Evis Sala,Carola-Bibiane Schönlieb,Leonardo Rundo
机构：Authors’ institutions, Department of Radiology, University of Cambridge, Cambridge CB,QQ, United Kingdom, School of Clinical Medicine, University of Cambridge, Cambridge CB,SP, United Kingdom
链接：https://arxiv.org/abs/2105.07467
摘要：背景与背景：人工智能有可能通过降低结肠镜筛查结直肠癌时息肉的漏检率来帮助胃肠科医生。新发现：我们引入了一种新的深度神经网络结构focusu-Net，它在五个包含结肠镜检查中获得的息肉图像的公共数据集中实现了最先进的息肉分割性能。局限性：该模型已在结肠镜检查期间拍摄的图像上进行了验证，但需要对实时视频数据进行验证，以确保通用性。影响：一旦在实时视频数据上得到验证，我们的息肉分割算法可以集成到结肠镜检查实践中，并通过减少遗漏息肉的数量来帮助胃肠科医生
摘要：BACKGROUND AND CONTEXT: Artificial intelligence has the potential to aid gastroenterologists by reducing polyp miss detection rates during colonoscopy screening for colorectal cancer. NEW FINDINGS: We introduce a new deep neural network architecture, the Focus U-Net, which achieves state-of-the-art performance for polyp segmentation across five public datasets containing images of polyps obtained during colonoscopy. LIMITATIONS: The model has been validated on images taken during colonoscopy but requires validation on live video data to ensure generalisability. IMPACT: Once validated on live video data, our polyp segmentation algorithm could be integrated into colonoscopy practice and assist gastroenterologists by reducing the number of polyps missed

表情

图片

附件

热门资讯

北京大学CCL语料库语系、语族、语支——世界语言万花筒【前沿】R语言元分析专题第七章：亚组分析【前沿】交叉滞后中介模型Mplus的应用语言学的主要分支【网上课堂】雨课堂+腾讯会议操作攻略揭开句法学之谜：主谓框架－成分分析法的由... R语言元分析专题：计算效应量的大小 2020年最新语言学SSCI期刊影响因子排名... R语言元分析专题第五章：森林图

推荐工具