Meet Rodin: a new artificial intelligence (AI) framework for generating 3D digital avatars from various input sources

Generative models are becoming the de facto solution for many challenging tasks in computer science. They represent one of the most promising ways to analyze and synthesize visual data. Stable Diffusion is the best-known generative model for producing beautiful, lifelike images from a complex input prompt. The architecture is based on diffusion models (DM), which have shown phenomenal generative power for images and video. Rapid advances in dissemination and generative modeling are fueling a revolution in 2D content creation. The mantra is simple enough: “If you can describe it, you can visualize it.” or rather, “if you can describe it, the model can paint it for you”. It’s truly amazing what generative models are capable of.

While 2D content has proven to be a stress test for DMs, 3D content poses several challenges due to, but not limited to, the added dimension. Generating 3D content, such as avatars, at the same quality as 2D content is a difficult task given memory and processing costs, which can be prohibitive to produce the rich detail required for high quality avatars.

With technology driving the use of digital avatars in movies, games, the metaverse, and the 3D industry, allowing anyone to create a digital avatar can be beneficial. This is the motivation that drives the development of this work.

Introducing Hailo-8™: An AI Processor Using Computer Vision for Multi-Camera Multi-Person Re-Identification (Sponsored)

The authors propose the Roll-out diffusion network (Rodin) to address the issue of creating a digital avatar. An overview of the model is shown in the figure below.

The input to the model can be an image, random noise, or text description of the desired avatar. The latent vector zz it is subsequently derived from the given input and employed in dissemination. The diffusion process consists of several stages of noise reduction. First, random noise is added to the initial state or image and the noise is removed to get a much sharper image.

The difference here lies in the 3D nature of the desired content. The diffusion process works as usual, but instead of targeting a 2D image, the diffusion model generates the gross geometry of the avatar, followed by a diffusion oversampling for detail synthesis.

Computational and memory efficiency is one of the goals of this work. To achieve this, the authors exploited the three-plane (three-axis) representation of a neural radiance field, which, compared to voxel grids, offers a significantly smaller memory footprint without sacrificing expressiveness.

Another diffusion model is then trained to upsample the produced triplane representation to match the desired resolution. Finally, a lightweight MLP decoder consisting of 4 fully connected layers is exploited to generate an RGB volumetric image.

Some results are shown below.

Compared to the state-of-the-art approaches mentioned, Rodin provides the sharpest digital avatars. For the model, no artifacts are visible in the shared samples, contrary to the other techniques.

This was the summary of Rodin, a new framework for easily generating 3D digital avatars from various input sources. If you are interested, you can find more information in the links below.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to subscribe our Reddit page and discord channelwhere we share the latest news on AI research, cool AI projects and more.

Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 at the University of Padua, Italy. He is a PhD. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He currently works at the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning and QoS/QoE assessment.

Add a Comment

Your email address will not be published. Required fields are marked *