new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Oct 29

Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image

In this work, we introduce Unique3D, a novel image-to-3D framework for efficiently generating high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability. Previous methods based on Score Distillation Sampling (SDS) can produce diversified 3D results by distilling 3D knowledge from large 2D diffusion models, but they usually suffer from long per-case optimization time with inconsistent issues. Recent works address the problem and generate better 3D results either by finetuning a multi-view diffusion model or training a fast feed-forward model. However, they still lack intricate textures and complex geometries due to inconsistency and limited generated resolution. To simultaneously achieve high fidelity, consistency, and efficiency in single image-to-3D, we propose a novel framework Unique3D that includes a multi-view diffusion model with a corresponding normal diffusion model to generate multi-view images with their normal maps, a multi-level upscale process to progressively improve the resolution of generated orthographic multi-views, as well as an instant and consistent mesh reconstruction algorithm called ISOMER, which fully integrates the color and geometric priors into mesh results. Extensive experiments demonstrate that our Unique3D significantly outperforms other image-to-3D baselines in terms of geometric and textural details.

  • 8 authors
·
May 30, 2024

OrthoDoc: Multimodal Large Language Model for Assisting Diagnosis in Computed Tomography

Multimodal large language models (MLLMs) have achieved significant success in the general field of image processing. Their emerging task generalization and freeform conversational capabilities can greatly facilitate medical diagnostic assistance, helping patients better understand their conditions and enhancing doctor-patient trust. Computed Tomography (CT) is a non-invasive imaging technique used to capture the internal mechanisms of a patient's condition and is widely utilized. However, in past research, the complex textural features of this imaging data have made accurate interpretation by algorithms challenging, impeding the performance of general LLMs in diagnostic assistance. To address this, we developed OrthoDoc, a MLLM designed for CT diagnostics. OrthoDoc is trained on 120,000 CT images and diagnostic reports and includes a Retrieval-Augmented Generation (RAG) module capable of effectively mitigating model hallucinations. This module is informed by extensive medical literature, textbooks, and explanatory data. Thus, OrthoDoc not only processes complex CT images but also stores, understands, and reasons over medical knowledge and language. In extensive experiments, OrthoDoc outperforms commercial models led by GPT-4, demonstrating superior diagnostic capabilities and accuracy. Specifically, OrthoDoc significantly surpasses existing models in the diagnosis of common orthopedic conditions such as fractures, arthritis, and tumors. Additionally, OrthoDoc exhibits robust generalization and stability when handling rare and complex cases.

  • 2 authors
·
Aug 30, 2024

Intensive Vision-guided Network for Radiology Report Generation

Automatic radiology report generation is booming due to its huge application potential for the healthcare industry. However, existing computer vision and natural language processing approaches to tackle this problem are limited in two aspects. First, when extracting image features, most of them neglect multi-view reasoning in vision and model single-view structure of medical images, such as space-view or channel-view. However, clinicians rely on multi-view imaging information for comprehensive judgment in daily clinical diagnosis. Second, when generating reports, they overlook context reasoning with multi-modal information and focus on pure textual optimization utilizing retrieval-based methods. We aim to address these two issues by proposing a model that better simulates clinicians' perspectives and generates more accurate reports. Given the above limitation in feature extraction, we propose a Globally-intensive Attention (GIA) module in the medical image encoder to simulate and integrate multi-view vision perception. GIA aims to learn three types of vision perception: depth view, space view, and pixel view. On the other hand, to address the above problem in report generation, we explore how to involve multi-modal signals to generate precisely matched reports, i.e., how to integrate previously predicted words with region-aware visual content in next word prediction. Specifically, we design a Visual Knowledge-guided Decoder (VKGD), which can adaptively consider how much the model needs to rely on visual information and previously predicted text to assist next word prediction. Hence, our final Intensive Vision-guided Network (IVGN) framework includes a GIA-guided Visual Encoder and the VKGD. Experiments on two commonly-used datasets IU X-Ray and MIMIC-CXR demonstrate the superior ability of our method compared with other state-of-the-art approaches.

  • 8 authors
·
Feb 6, 2024

One Flight Over the Gap: A Survey from Perspective to Panoramic Vision

Driven by the demand for spatial intelligence and holistic scene perception, omnidirectional images (ODIs), which provide a complete 360 field of view, are receiving growing attention across diverse applications such as virtual reality, autonomous driving, and embodied robotics. Despite their unique characteristics, ODIs exhibit remarkable differences from perspective images in geometric projection, spatial distribution, and boundary continuity, making it challenging for direct domain adaption from perspective methods. This survey reviews recent panoramic vision techniques with a particular emphasis on the perspective-to-panorama adaptation. We first revisit the panoramic imaging pipeline and projection methods to build the prior knowledge required for analyzing the structural disparities. Then, we summarize three challenges of domain adaptation: severe geometric distortions near the poles, non-uniform sampling in Equirectangular Projection (ERP), and periodic boundary continuity. Building on this, we cover 20+ representative tasks drawn from more than 300 research papers in two dimensions. On one hand, we present a cross-method analysis of representative strategies for addressing panoramic specific challenges across different tasks. On the other hand, we conduct a cross-task comparison and classify panoramic vision into four major categories: visual quality enhancement and assessment, visual understanding, multimodal understanding, and visual generation. In addition, we discuss open challenges and future directions in data, models, and applications that will drive the advancement of panoramic vision research. We hope that our work can provide new insight and forward looking perspectives to advance the development of panoramic vision technologies. Our project page is https://insta360-research-team.github.io/Survey-of-Panorama

  • 11 authors
·
Sep 4