Home Learning for 3D Vision
Post
Cancel

Learning for 3D Vision

Assignment 1: Rendering Basics with PyTorch3D

Questions: Github Assignment 1

1. Practicing with Cameras

1.1 360-degree Renders

Usage:

1
python -m submissions.src.render_360 --num_frames 100 --fps 15 --output_file submissions/360-render.gif

360-degree gif video that shows many continuous views of the provided cow mesh:

image

1.2 Re-creating the Dolly Zoom

Usage:

1
python -m starter.dolly_zoom --num_frames 20 --output_file submissions/dolly.gif

My recreated Dolly Zoom effect:

image

2. Practicing with Meshes

2.1 Constructing a Tetrahedron

Usage:

1
python -m submissions.src.mesh_practice --shape tetrahedron --num_frames 50 --fps 15 --output_file submissions/tetrahedron.gif --image_size 512

360-degree gif animation of tetrahedron:

image

Number of vertices = 4
Number of faces = 4

2.2 Constructing a Cube

Usage:

1
python -m submissions.src.mesh_practice --shape cube --num_frames 50 --fps 15 --output_file submissions/cube.gif --image_size 512

360-degree gif animation of cube:

image

Number of vertices = 8
Number of faces = 12

3. Re-texturing a mesh

Chosen colors:
color1 = [0, 1, 1]
color2 = [1, 1, 0]

Usage:

1
python -m submissions.src.retexturing_mesh --num_frames 50 --fps 15 --output_file submissions/retexture_mesh.gif --image_size 512

Gif of rendered mesh:

image

4. Camera Transformations

$1)$ Rotate about z-axis by -90 degrees:

$R_{relative} = [[\cos(-\pi/2), -\sin(-\pi/2), 0], [\sin(-\pi/2), \cos(-\pi/2), 0], [0, 0, 1]]$

Use original translation matrix: $T_{relative} = [0, 0, 0]$

image

$2)$ Keep original rotation matrix: $R_{relative} = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]$

Move along z-axis by 2: $T_{relative} = [0, 0, 2]$

image

$3)$ Keep original rotation matrix: $R_{relative} = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]$

Move along x-axis by 0.5 and along y-axis by -0.5: $T_{relative} = [0.5, -0.5, 0]$

image

$4)$ Rotate along y-axis by 90 degrees: $R_{relative} = [[\cos(\pi/2), 0, \sin(\pi/2)], [0, 1, 0], [-\sin(\pi/2), 0, \cos(\pi/2)]]$

Move along x-axis by -3 and along z-axis by 3: $T_{relative} = [-3, 0, 3]$

image

5. Rendering Generic 3D Representations

5.1 Rendering Point Clouds from RGB-D Images

Usage:

1
python -m submissions.src.pcl_render --image_size 512

Gif of point cloud corresponding to the first image:

image

Gif of point cloud corresponding to the second image:

image

Gif of point cloud formed by the union of the first 2 point clouds:

image

5.2 Parametric Functions

Usage:

1
python -m submissions.src.torus_render --function parametric --image_size 512 --num_samples 500

Parametric equations of Torus:

\[x = (R + r\cos\theta)\cos\phi \\ y = (R + r\cos\theta)\sin\phi \\ z = r\sin\theta\]

where \(\theta \in [0,2\pi) \\ \phi \in [0,2\pi)\)

The major radius $R$ is the distance from the center of the tube to the center of the torus and the minor radius $r$ is the radius of the tube

360-degree gif of torus, with visible hole:

image

Parametric equations of Superquadric Surface:

\[x = a(\cos\theta)^m(\cos\phi)^n \\ y = b(\cos\theta)^m(\sin\phi)^n \\ z = c(\sin\theta)^m\]

where \(\theta \in [-\frac{\pi}{2}, \frac{\pi}{2}] \\ \phi \in [0,2\pi)\)

360-degree gif of Superquadric Surface:

image

5.3 Implicit Surfaces

Usage:

1
python -m submissions.src.torus_render --function implicit --image_size 512

Implicit equation of torus:

\[F(X,Y,Z) = (R - \sqrt{X^2+Y^2})^2 + Z^2 - r^2\]

360-degree gif of torus, with visible hole:

image

Implicit equation of Superquadric Surface:

\[F(X,Y,Z) = \bigg(\bigg(\bigg\rvert\frac{X}{a}\bigg\rvert \bigg)^\frac{2}{n} + \bigg(\bigg\rvert\frac{Y}{b}\bigg\rvert \bigg)^\frac{2}{n} \bigg)^\frac{n}{m} + \bigg(\bigg\rvert\frac{Z}{c}\bigg\rvert \bigg)^\frac{2}{m} - 1\]

360-degree gif of Superquadric Surface:

image

Tradeoffs between rendering as a mesh vs a point cloud:

  • Method of Generation:
    • Point Clouds: Formed by directly sampling a parametric function.
    • Meshes: Built by voxelizing a 3D space, sampling an implicit function, and then extracting surfaces using the Marching Cubes algorithm.
  • Rendering Speed:
    • Point Clouds: Faster to render since they simply use sampled points without extra processing.
    • Meshes: Slower because they need additional steps like voxelization and surface extraction before rendering.
  • Accuracy & Visual Quality:
    • Point Clouds: More accurate at capturing fine details because each point represents a sampled location. However, they don’t have surfaces, making shading and texturing more difficult.
    • Meshes: Can be less accurate due to voxelization, but increasing the resolution can improve precision. They also provide continuous surfaces, which makes them easier to texture and shade.
  • Computational Efficiency:
    • Point Clouds: Easier to rotate, scale, and modify since they are just a collection of points.
    • Meshes: More computationally expensive to modify because updating a mesh requires adjusting vertex positions and their connections.

6. Do Something Fun

Here is a 360 degree view of a cottage and also a dolly zoom view:

Usage:

1
python -m submissions.src.fun --function full --image_size 512 --output_file submissions/cottage_render_360.gif

image

Usage:

1
python -m submissions.src.fun --function dolly --image_size 512 --output_file submissions/cottage_dolly.gif

image

(Extra Credit) 7. Sampling Points on Meshes

image

image

image

image

image


Assignment 2: Single View to 3D

Questions: Github Assignment 2

0. Setup

Downloaded the shapenet single-class dataset. Unzipped the dataset and set the appropriate path in dataset_location.py.

1. Exploring loss functions

1.1. Fitting a voxel grid

To align a predicted voxel grid with a target shape, I used a binary cross-entropy (BCE) loss function. A 3D voxel grid consists of 0 (empty) and 1 (occupied) values, making this a binary classification problem where we predict occupancy probabilities of each voxel.

Implementation:

1
2
3
4
5
6
7
8
9
10
def voxel_loss(voxel_src,voxel_tgt):
	# voxel_src: b x h x w x d
	# voxel_tgt: b x h x w x d

	voxel_src.unsqueeze(1)
	voxel_tgt.type(dtype=torch.LongTensor)
	
	loss = torch.nn.functional.binary_cross_entropy(voxel_src, voxel_tgt)

	return loss

Usage:

1
python fit_data.py --type 'vox'
Ground TruthOptimized Voxel
imageimage

I trained the data for 10000 iterations.

1.2. Fitting a point cloud

Usage:

1
python fit_data.py --type 'point'
Ground TruthOptimized Point Cloud
imageimage
1
2
3
4
5
6
7
8
def chamfer_loss(point_cloud_src,point_cloud_tgt):
	# point_cloud_src, point_cloud_src: b x n_points x 3  

	p1 = knn_points(point_cloud_src, point_cloud_tgt)
	p2 = knn_points(point_cloud_tgt, point_cloud_src)
	loss_chamfer = torch.mean(torch.sum(p1.dists + p2.dists))	

	return loss_chamfer

1.3. Fitting a mesh

Usage:

1
python fit_data.py --type 'mesh'
Ground TruthOptimized Mesh
imageimage
1
2
3
4
5
def smoothness_loss(mesh_src):

	loss_laplacian = mesh_laplacian_smoothing(mesh_src)

	return loss_laplacian

2. Reconstructing 3D from single view

2.1. Image to voxel grid

Input RGBGround Truth MeshGround Truth VoxelPredicted 3D Voxel
imageimageimageimage
imageimageimageimage
imageimageimageimage
imageimageimageimage

I implemented the decoder architecture from the paper Pix2Vox: Context-aware 3D Reconstruction from Single and Multi-view Images.

For the encoder, I used the pre-trained ResNet-18 model, which computes a set of features for the decoder to recover the 3D shape of the object.

The decoder is responsible for transforming information of 2D feature maps into 3D volumes. I specifically implemented a slightly modified version of the Pix2Vox-F architecture from the above paper. The input of the decoder is of size [batch_size x 512] and the output is [batch_size x 32 x 32 x 32]. The decoder contains five 3D transposed convolutional layers. The first four transposed convolutional layers are of kernel size $4^3$, with stride of $2$ and padding of $1$. The last layer has a kernel of size $1^3$. Each transposed convolutional layer is followed by a LeakyReLU activation function, except for the last layer which is followed by a sigmoid activation function. The number of output channels for each layer follows the Pix2Vox-F configuration: 128 -> 64 -> 32 -> 8 -> 1.

I trained the model for 10000 iterations, with the default batch size of 32 and learning rate of 4e-4.

Usage:

1
2
3
python train_model.py --type 'vox' --max_iter 10000 --save_freq 500 

python eval_model.py --type 'vox' --load_checkpoint

2.2. Image to point cloud

Input RGBGround Truth MeshGround Truth Point CloudPredicted 3D Point Cloud
imageimageimageimage
imageimageimageimage
imageimageimageimage
imageimageimageimage

I used an approach similar to the Pix2Vox-F decoder that I implemented above. The ResNet-18 model encodes the input images into feature maps, and a decoder reconstructs the 3D shape of the object from them.

The decoder takes in an input of size [batch_size x 512] and gives an output of size [batch_size x n_points x 3]. The decoder architecture comprises of 4 fully connected layers, three of which are followed by a LeakyReLU activation function.

I trained the model for 2000 iterations, with the default batch size of 32 and learning rate of 4e-4.

Usage:

1
2
3
python train_model.py --type 'point' --max_iter 2000 --save_freq 500 --n_points 5000 

python eval_model.py --type 'point' --load_checkpoint --n_points 5000

2.3. Image to mesh

Input RGBGround Truth MeshPredicted Mesh
imageimageimage
imageimageimage
imageimageimage
imageimageimage

Instead of encoding an image like I did in case of image to voxel and image to point cloud, the meshes are constructed from an icosphere mesh. The purpose of the decoder is to refine this initial mesh by giving per-vertex displacement vector as an ouput.

The decoder architecture that I implemented is very similar to that in case of image to point cloud, as mentioned above. It takes an input of size [batch_size x 512] and gives an output of size [batch_size x num_vertices x 3]. It comprises of 4 fully connected layers, three of which are followed by a ReLU activation function.

I trained the model for 2000 iterations, with the default batch size of 32, learning rate of 4e-4.

Usage:

1
2
3
python train_model.py --type 'mesh' --max_iter 2000 --save_freq 500

python eval_model.py --type 'mesh' --load_checkpoint

2.4. Quantitative comparisions

F1-score curves at different thresholds:

Voxel GridPoint CloudMesh
imageimageimage

From the above plots, we can infer that the point cloud model performed the best, giving the highest F1-score, followed by the mesh model and the voxel model.

Intuitively, I think the reason the point cloud outperformed the voxel and mesh models is because it aligns well with the evaluation method, which compares points directly from the network output to the ground truth. Since point clouds don’t need to define surfaces or connections, they are more flexible and avoid errors caused by surface sampling. This makes them easier to optimize and more accurate in reconstruction.

The mesh model performed slightly worse primarily due to the challenges associated with sampling points from a continuous surface. Unlike point clouds, where each output point is directly predicted, meshes require proper face orientation and connectivity. Due to this, the sampled points might not always align perfectly with the ground truth, especially when dealing with complex geometries like thin structures (legs of the chair).

The voxel model has the lowest F1-score because representing 3D space as voxels limits a lot of detail and accuracy. Fine details can be lost due to this fixed resolution and sampled points may not always align perfectly with the object’s actual surface, affecting evaluation results.

2.5. Analyse effects of hyperparams variations

I have chosen to vary the w_smooth hyperparameter and analyze the changes in the mesh model prediction. The default value of w_smooth is 0.1, and its results have been shown in Section 2.3. I sampled 4 other values of w_smooth - 0.001, 0.01, 1, 10.

Input RGBGround Truth Meshw_smooth=0.001w_smooth=0.01w_smooth=0.1 (Default)w_smooth=1w_smooth=10
imageimageimageimageimageimageimage
imageimageimageimageimageimageimage
imageimageimageimageimageimageimage

F1-score curves for different variations in w_smooth for the mesh model:

w_smooth=0.001w_smooth=0.01w_smooth=0.1 (Default)w_smooth=1w_smooth=10
Avg F1-score@0.05 = 72.977Avg F1-score@0.05 = 72.133Avg F1-score@0.05 = 70.951 (Default)Avg F1-score@0.05 = 71.834Avg F1-score@0.05 = 72.337
imageimageimageimageimage

For low values of w_smooth = 0.001, 0.01:

  • They preserve the fine details but introduce noise and distortions
  • It results in rough and fragmented surfaces
  • They show a slightly higher F1-score because they retain geometric details

For high values of w_smooth = 1, 10:

  • They produce cleaner and more smooth meshes with reduced artifacts
  • It over-smooths the surface, causing loss of sharp details
  • The F1-score improves slightly likely because they more likely fall within the threshold radius of the ground truth

The default value of w_smooth = 0.1 falls in between the above two categories.

2.6. Interpret your model

Per-Point Error Visualization

Input RGBGround Truth Point CloudPredicted 3D Point CloudPer-Point Error
imageimageimageimage
imageimageimageimage

I used per-point error to gauge how well each predicted 3D point matches its corresponding point in the ground truth. In my approach, I compute the distance between each point in my reconstructed point cloud and its nearest neighbor in the ground truth. Then, I color-code these distances such that points with very small errors appear in cool colors (blue), while those with larger errors show up in warm colors (red).

From the above gifs, we can clearly see that some regions are rendered in cool tones, which tells me that my model is accurately capturing those parts of the object, such as the seat surface or the main body of the chair. On the other hand, areas highlighted in warm colors reveal where the model struggles, like along the thin chair legs or at complex curves of the backrest.

This visualization pinpoints the exact regions that need improvement.

Failure Case Analysis

In analyzing my 3D voxel model’s predictions, I noticed that while it reconstructs the backrest of chairs quite well, it struggles significantly with legs, seats, and unusual shapes. These failure cases provide valuable insight into the model’s learning behavior and what its limitations are.

  1. Legs: Chair legs vary widely in shape, thickness, and placement across different samples in the dataset. Some chairs have four standard legs, while others may have a single central support or a complex curved base. Because the model tries to generalize patterns across the dataset, it struggles to reconstruct legs consistently. Additionally, legs are usually thin and small compared to the rest of the chair, and this makes them more prone to voxelization errors.

  2. Seats: Many chair designs have gaps in them, which makes it challenging for the model to learn and also more prone to voxelization errors. Since the model tries to reconstruct a smoothed version of objects, it often fails to represent holes correctly, either closing them off entirely or introducing unexpected artifacts.

  3. Unusual shapes: Some chairs in the dataset have very unique designs. Since the model is trained on a limited dataset, it may not have seen enough similar examples to generalize well.

3. Exploring other architectures / datasets

3.3 Extended dataset for training

dataset_location.py updated to include the extended dataset containing three classes (chair, car, and plane).

category#instances
airplane36400
car31460
chair61000
total128860

I trained and evaluated the point cloud model with n_points = 5000.

Quantitative evaluation:

Point Cloud trained on one classPoint Cloud trained on three classes
imageimage

Qualitative evlautaion by comparing “training on one class” vs “training on three classes”:

Input RGBGround Truth Point CloudPredicted 3D Point Cloud for 1 Class TrainingPredicted 3D Point Cloud for 3 Classes Training
imageimageimageimage
imageimageimageimage
imageimageimageimage

3D consistency and diversity of output samples:

Training the model on a single class, like chairs, results in more consistent and refined reconstructions. Since the model only sees one object type during training, it gets really good at capturing the details and structure unique to that class. However, this also means that the model becomes highly specialized. So when it is faced with a completely new object type (such as an airplane or car), it struggles because it hasn’t learned to handle the variation.

On the other hand, training on multiple classes (airplanes, cars, and chairs) allows the model to adapt better to different object shapes. Instead of focusing on one type, it learns general patterns that apply across different categories. This makes it more versatile when reconstructing new objects.

So in conclusion, single-class models tend to produce more uniform outputs because they have learned a very specific structural representation but lack adaptability. Multi-class models generate more diverse outputs because they have seen various object types and have learned to adapt to different shapes but at the cost of some fine-grained details.


Assignment 3: Part-1 Neural Volume Rendering

Questions: Github Assignment 3

0. Transmittance Calculation

image

1. Differentiable Volume Rendering

1.3. Ray sampling

Usage:

1
python volume_rendering_main.py --config-name=box

image

image

1.4. Point sampling

Usage:

1
python volume_rendering_main.py --config-name=box

image

1.5. Volume rendering

Usage:

1
python volume_rendering_main.py --config-name=box

image

image

2. Optimizing a basic implicit volume

2.1. Random ray sampling

1
2
3
4
5
6
7
def get_random_pixels_from_image(n_pixels, image_size, camera):
    xy_grid = get_pixels_from_image(image_size, camera)
    
    # Random subsampling of pixel coordinaters
    xy_grid_sub = xy_grid[np.random.choice(xy_grid.shape[0], n_pixels)].to("cuda")

    return xy_grid_sub.reshape(-1, 2)[:n_pixels]

2.2. Loss and training

loss = torch.nn.functional.mse_loss(rgb_gt, out['feature'])

Usage:

1
python volume_rendering_main.py --config-name=train_box

Box center: (0.2502, 0.2506, -0.0005)
Box side lengths: (2.0051, 1.5036, 1.5034)

2.3. Visualization

Usage:

1
python volume_rendering_main.py --config-name=train_box

image

3. Optimizing a Neural Radiance Field (NeRF)

Usage:

1
python volume_rendering_main.py --config-name=nerf_lego

image

4. NeRF Extras

4.1 View Dependence

Usage:

1
python volume_rendering_main.py --config-name=nerf_materials

image

Trade-offs between increased view dependence and generalization quality:

  • Adding view dependence allows the model to capture complex lighting effects like reflections, and translucency. But excessive view dependence can create inconsistencies when interpolating between unique views, which will make the rendering look unnatural.
  • If the network heavily relies on viewing direction, it may overfit to the specific camera angles in the training data. This can lead to poor generalization to unseen viewpoints.
  • It can increase the network’s complexity, requiring more parameters and training time.

Assignment 3: Part-2 Neural Surface Rendering

5. Sphere Tracing

My implementation of the SphereTracingRenderer class uses the sphere tracing algorithm to find intersections between rays and an implicit surface of a torus defined by a signed distance field. The algorithm iteratively updates points along each ray by moving in the direction of the ray by the amount of the SDF value at the current point. This process continues until the maximum number of iterations is reached or the SDF value becomes very close to zero (threshold of 1e-6), indicating a surface intersection.

Usage:

1
python -m surface_rendering_main --config-name=torus_surface

image

6. Optimizing a Neural SDF

Usage:

1
python -m surface_rendering_main --config-name=points_surface
Inputlr=0.0001lr=0.001lr=0.00001
imageimageimageimage
Loss0.0012790.0004280.001635

eikonal_loss = ((gradients.norm(2, dim=1) - 1.0) ** 2).mean()

Implementation:

1
2
3
4
5
6
Input (XYZ points) -> Harmonic Embedding -> Layer 1 (Linear + ReLU)
                      -> Layer 2 (Linear + ReLU)
                      -> Layer 3 (Linear + ReLU)
                      -> ...
                      -> Layer N (Linear + ReLU)
                      -> Linear SDF (Output: Signed Distance Function)

7. VolSDF

Usage:

1
python -m surface_rendering_main --config-name=volsdf_surface
  • Alpha: Scales the overall density. A higher value increases the density, while a lower value reduces it.
  • Beta: Controls how quickly the density changes with distance from the surface. A smaller beta results in a sharper transition, while a larger beta smooths the transition.
1
2
3
4
5
6
def sdf_to_density(signed_distance, alpha, beta):
    return torch.where(
            signed_distance > 0,
            0.5 * torch.exp(-signed_distance / beta),
            1 - 0.5 * torch.exp(signed_distance / beta),
        ) * alpha
GeometryResult
imageimage
Loss0.006958

The above renders are for values alpha=10.0 and beta=0.05.

When alpha=10.0 and beta is changed:

betabeta=0.05beta=0.1beta=0.5
Geometryimageimageimage
Renderimageimageimage
Loss0.0069580.0102270.020789

When beta=0.05 and alpha is changed:

alphaalpha=1alpha=10alpha=50
Geometryimageimageimage
Renderimageimageimage
Loss0.0223170.0069580.004329

How does high beta bias your learned SDF? What about low beta?

High beta makes the transition between occupied and free space more gradual, leading to a smoother SDF. This can cause a bias where surfaces appear more diffused rather than sharp.

Low beta results in a sharper transition, meaning the SDF will be more precise in distinguishing surfaces, but it can also lead to unstable gradients and more difficult optimization.

Would an SDF be easier to train with volume rendering and low beta or high beta? Why?

An SDF is generally easier to train with volume rendering when using a high beta. This is because high beta values cause a larger number of points along each ray to have non-zero density, allowing gradients to be backpropagated through more points simultaneously. This leads to denser gradients and faster convergence during training.

Training with a low beta can be more challenging because it forces the network to learn very sharp transitions, which means only points very close to the surface contributes significantly to the rendering. This can lead to sparse gradients and slower convergence.

Would you be more likely to learn an accurate surface with high beta or low beta? Why?

You are more likely to learn an accurate surface with a low beta. A low beta encourages sharp boundaries and a more precise surface representation, as the density function closely approximates a step function. High beta values, on the other hand, lead to smoother surfaces, which can be less accurate.

Implementation:

1
2
3
4
5
6
Input (SDF Feature + XYZ Embedding) -> Layer 1 (Linear + ReLU)  
                      -> Layer 2 (Linear + ReLU)  
                      -> Layer 3 (Linear + ReLU)  
                      -> ...  
                      -> Layer N (Linear + ReLU)  
                      -> Linear RGB (Output: 3D Color Prediction)  

8. Neural Surface Extras

8.3 Alternate SDF to Density Conversions

Logistic density distribution function:

\[\phi_s(x) = \frac{se^{-sx}}{(1+e^{-sx})^2}\]
1
2
def neus_sdf_to_density(signed_distance, s):
    return s * torch.exp(-s * signed_distance) / ((1 + torch.exp(-s * signed_distance))**2) 
ss=10s=50
Geometryimageimage
Renderimageimage
Loss0.0055900.006529

Low s results in a more blurry render, while a higher value of s makes it look sharp.

This post is licensed under CC BY 4.0 by the author.