[3D Machine Learning] - 3D data representations

Hello! This is the first article of a series of articles related to 3D machine learning as I am very interested in the topic. 3D machine learning is a fascinating field of research at the intersection of mathematics, machine learning, computer vision and computer graphics that deals with the topic of processing and understanding 3D geometry automatically with a computer. Allowing computers to have such capabilities can be very useful for many applications including 3D scene modeling and understanding for virtual and augmented reality, video games and robotics but not only. It can also be employed for the preservation of our cultural heritage as it allows to 3D scan (both geometry and reflectance parameters) cultural resources that might be getting fragile with time. The 3D scans of artifacts can then be shown to the public without risking degrading the original ones. Biology is another field that can benefit from 3D deep learning as recent examples demonstrate. In particular, it has been shown that drug discovery can be made faster by having generative neural networks that propose new molecules which are likely to have certain properties based on the 3D structure of their atoms [6-7]. This first article deals with the basics of 3D data representations, a very important topic on which all the 3D machine learning algorithms depend.

Nowadays 3D models are very common and are employed in a wide range of applications related to computer vision and graphics. However, machine learning models that directly work on 3D data to perform classification and regression tasks are still lacking behind models that work on images and videos. There are several reasons for that, one of them being the fact that building large scale datasets of 3D models is very time consuming and requires either expert 3D modelers or specialized 3D scanners that are not widely available. Another reason comes from the data structures that are employed to represent 3D data.

Indeed, images and videos have a natural representation with pixels that are well organized along grids which makes it very easy to work on such data. For example, as shown in Figure 1, an image has 3 dimensions (height, width and color channels) and a video has 4 dimensions (width, height, temporal dimension and color channels) along which the pixels are organized. In this grid structure, the sampling along each dimension is done at equal distances, a property that is exploited by the convolution operation in convolutional neural networks in order to extract meaningful representations.

Figure 1 : Tensor representation of images and videos. Arrays, images and videos are data structures organized along grids which makes common machine learning operations (e.g., convolutions, pooling) work well on such data. — **Figure 1 :** Tensor representation of images and videos. Arrays, images and videos are data structures organized along grids which makes common machine learning operations (e.g., convolutions, pooling) work well on such data.

Unfortunately, such a grid structure does not exist for 3D geometric data. As a result, traditional machine learning operations such as convolutions and pooling cannot be employed directly on this type of data. This is the reason why the field of geometric deep learning has emerged, with the goal of being able to extend these operations (and a fortiori neural networks) to unstructured data representations.

The choice of a data structure to represent the 3D geometry is therefore crucial as it will determine the type of algorithms that can be employed in order to learn from 3D data. This is the reason why, before looking at machine learning model architectures for 3D data, it is important to understand the pros and cons of each 3D data representation. Note that although this article presents a few important 3D data representations, it is not exhaustive. For a more complete review of 3D data representations, please refer to [4].

3D Data representations

Three dimensional geometric data is very interesting as there exists many ways of representing it, each representation having advantages and inconveniences. Examples of such representations include :

Multi-view representations : a multi view representation of a 3D shape is a set of 2D images that correspond to pictures (or renderings) of a given 3D shape from different viewpoints. Such data structure is very convenient as it can be directly be processed with 2D convolutional neural networks. However multi-view representations are dependent on lighting and a high number of images might be required to cover all the angles of a given 3D shape (e.g., an object with many cavities would require many pictures so that every cavity is visible in at least one picture).
RGB-D images : a RGB-D image is a color image that contains depth information at each pixel, i.e., the distance between the camera and the object. This distance encodes information about the 3D geometry from a fixed point of view. Such RGB-D images can be easily captured with relatively cheap hardware (e.g., a Microsoft Kinect camera) which is the reason why many datasets of RGB-D images exist.
Point clouds (Figure 2 - left) : a set a 3D vertices with coordinates (x,y,z). This is a very common representation, as it is the one usually captured by 3D scanners.
Meshes (Figure 2 - right) : a set of vertices connected to each other in order to form triangles (or sometimes quadrilaterals). The set of triangles forms a 2D surface in a 3D space. It is a very common data structure usually employed by 3D renderers. It allows for instance, to get a normal vector at every point on the surface which can then be used by rendering algorithms (e.g., to compute lighting effects).
Voxel grid representation : the term voxel means “volume element” and is an extension of 2D pixels (“picture element”) to 3D. In the case of 2D, a sampled image is a rectangle that is subdivided into a grid of pixels, each pixel having a color value to make a full image. The same idea can be extended to represent a 3D shape. More specifically, to represent a 3D shape with voxels, we start by taking a 3D bounding box (a rectangular parallelepiped) that contains the 3D shape. Then we subdivide this rectangular parallelepiped (sampling stage) into a grid of smaller volume elements (voxels), each of them having an occupancy value that tells whether the voxel is inside the shape or not. By only considering the voxels that are inside the shape, we get a sampled version of the original 3D shape (Figure 2 - center). Note that it is also possible to attach to each voxel additional information such as a color or an opacity value.

Figure 2 : 3D representation of the Stanford bunny. (Left) : Point Cloud representation. (Middle) Voxel representation. (Right) Mesh representation. Image taken from [1] — **Figure 2 :** 3D representation of the Stanford bunny. **(Left)** : Point Cloud representation. **(Middle)** Voxel representation. **(Right)** Mesh representation. Image taken from [1]

The case of a voxel representation is interesting as it is a volumetric data structure organized along a 3 dimensional grid. As a result, regular 3D convolutional neural networks can directly be trained on voxel representations of 3D shapes. However, the downside is their high memory requirement (order of O(n^3) for a subdivision of n voxels along each dimension). In addition, a high voxel resolution is required to faithfully represent a 3D shape which is very inefficient as it usually contains a majority of empty voxels as shown in Figure 3.

Figure 3 : Voxel grid representation of the chair shown on the left with an increasing voxel resolution. Only the occupied voxels within the bounding box of the shape are represented. (Center-Left) : Voxel grid resolution of 30 x 30 x 30. (Center-Ri… — **Figure 3 :** Voxel grid representation of the chair shown on the left with an increasing voxel resolution. Only the occupied voxels within the bounding box of the shape are represented. **(Center-Left)** : Voxel grid resolution of 30 x 30 x 30. **(Center-Right)** : Voxel grid resolution of 64 x 64 x 64. **(Left)** : Voxel grid resolution of 128 x 128 x 128.
The percentage denotes the fraction of occupied voxels within the bounding box. As can be seen, the higher the resolution, the better the approximation of the original shape but the lower the fraction of occupied voxels. For instance, for the image on the right, only 2.41% of the 128 x 128 x 128 voxels are occupied by the shape meaning that 97.59% are empty! A high resolution voxel representation is therefore very inefficient as the memory is mostly occupied by empty voxels. Image taken from [2]

In addition to these data representations, there exist implicit data representations of 3D shapes. One of them is called the signed distance function.

Signed distance functions : a signed distance function F, takes as an input a 3D point P=(x,y,z) and outputs the distance between P to the closest point on the 3D surface. The function F(x,y,z) is zero on the surface, negative inside the surface and positive outside of the surface. Therefore the function F encodes the position of the surface in the 3D space. Note that signed distance functions can be transformed into a mesh representation with the marching cube algorithm.

Figure 4 : Signed distance function. The signed distance function is zero on the surface of the bunny, negative inside (blue region) and positive outside (red region). Image taken from [3] — **Figure 4 :** Signed distance function. The signed distance function is zero on the surface of the bunny, negative inside (blue region) and positive outside (red region). Image taken from [3]

There exist many other ways of representing 3D geometry (e.g., hierarchical structure representations [5]) but in this article I have tried to talk about the main representations that someone who is interested in the field of 3D deep learning should know. In the next articles, we will see recent algorithms and machine learning models that are designed to work with such data representations.

Bibliography

[1] - A Deep Learning Method for 3D Object Classification Using the Wave Kernel Signature and A Center Point of the 3D-Triangle Mesh. Long Hoang, Suk-Hwan Lee, Oh-Heum Kwon and Ki-Ryong Kwon. Electronics 2019, 8(10), 1196

[2] - FPNN: Field Probing Neural Networks for 3D Data. Yangyan Li, Sören Pirk, Hao Su, Charles R. Qi, Leonidas J. Guibas. NIPS 2016

[3] - DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation. Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, Steven Lovegrove. CVPR 2019

[4] - A survey on deep learning advances on different 3D data representations. Eman Ahmed, Alexandre Saint, Abd El Rahman Shabayek, Kseniya Cherenkova, Rig Das, Gleb Gusev, Djamila Aouada, and Bjorn Ottersten. arXiv preprint arXiv:1808.01462. 2018 Aug 4.

[5] - StructEdit: Learning structural shape variations. Kaichun Mo, Paul Guerrero, Li Yi, Hao Su, Peter Wonka, Niloy J. Mitra, and Leonidas J. Guibas. CVPR 2020

[6] - 3D Molecular Representations Based on the Wave Transform for Convolutional Neural Networks. Denis Kuzminykh, Daniil Polykovskiy, Artur Kadurin, Alexander Zhebrak, Ivan Baskov, Sergey Nikolenko, Rim Shayakhmetov, and Alex Zhavoronkov. Molecular pharmaceutics 15.10 (2018): 4378-4385.

[7] - The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology. Artur Kadurin, Alexander Aliper, Andrey Kazennov, Polina Mamoshina, Quentin Vanhaelen, Kuzma Khrabrov, and Alex Zhavoronkov. Oncotarget 8, no. 7 (2017): 10883.