Assignment 1: Rendering Basics with PyTorch3D

1. Practicing with Cameras

1.1 360-degree Renders

Usage:

  
python -m submissions.src.render_360 --num_frames 100 --fps 15 --output_file submissions/360-render.gif

360-degree gif video that shows many continuous views of the provided cow mesh:

1.2 Re-creating the Dolly Zoom

Usage:

  
python -m starter.dolly_zoom --num_frames 20 --output_file submissions/dolly.gif

My recreated Dolly Zoom effect:

2. Practicing with Meshes

2.1 Constructing a Tetrahedron

Usage:

  
python -m submissions.src.mesh_practice --shape tetrahedron --num_frames 50 --fps 15 --output_file submissions/tetrahedron.gif --image_size 512

360-degree gif animation of tetrahedron:

Number of vertices = 4
Number of faces = 4

2.2 Constructing a Cube

Usage:

  
python -m submissions.src.mesh_practice --shape cube --num_frames 50 --fps 15 --output_file submissions/cube.gif --image_size 512

360-degree gif animation of cube:

Number of vertices = 8
Number of faces = 12

3. Re-texturing a mesh

Chosen colors:
color1 = [0, 1, 1]
color2 = [1, 1, 0]

Usage:

  
python -m submissions.src.retexturing_mesh --num_frames 50 --fps 15 --output_file submissions/retexture_mesh.gif --image_size 512

Gif of rendered mesh:

4. Camera Transformations

$1)$ Rotate about z-axis by -90 degrees:

$R_{relative} = [[\cos(-\pi/2), -\sin(-\pi/2), 0], [\sin(-\pi/2), \cos(-\pi/2), 0], [0, 0, 1]]$

Use original translation matrix: $T_{relative} = [0, 0, 0]$

$2)$ Keep original rotation matrix: $R_{relative} = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]$

Move along z-axis by 2: $T_{relative} = [0, 0, 2]$

$3)$ Keep original rotation matrix: $R_{relative} = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]$

Move along x-axis by 0.5 and along y-axis by -0.5: $T_{relative} = [0.5, -0.5, 0]$

$4)$ Rotate along y-axis by 90 degrees: $R_{relative} = [[\cos(\pi/2), 0, \sin(\pi/2)], [0, 1, 0], [-\sin(\pi/2), 0, \cos(\pi/2)]]$

Move along x-axis by -3 and along z-axis by 3: $T_{relative} = [-3, 0, 3]$

5. Rendering Generic 3D Representations

5.1 Rendering Point Clouds from RGB-D Images

Usage:

python -m submissions.src.pcl_render --image_size 512

Gif of point cloud corresponding to the first image:

Gif of point cloud corresponding to the second image:

Gif of point cloud formed by the union of the first 2 point clouds:

5.2 Parametric Functions

Usage:

  
python -m submissions.src.torus_render --function parametric --image_size 512 --num_samples 500

Parametric equations of Torus:

\[x = (R + r\cos\theta)\cos\phi \\ y = (R + r\cos\theta)\sin\phi \\ z = r\sin\theta\]

where $\theta \in [0,2\pi) \\ \phi \in [0,2\pi)$

The major radius $R$ is the distance from the center of the tube to the center of the torus and the minor radius $r$ is the radius of the tube

360-degree gif of torus, with visible hole:

Parametric equations of Superquadric Surface:

\[x = a(\cos\theta)^m(\cos\phi)^n \\ y = b(\cos\theta)^m(\sin\phi)^n \\ z = c(\sin\theta)^m\]

where $\theta \in [-\frac{\pi}{2}, \frac{\pi}{2}] \\ \phi \in [0,2\pi)$

360-degree gif of Superquadric Surface:

5.3 Implicit Surfaces

Usage:

  
python -m submissions.src.torus_render --function implicit --image_size 512

Implicit equation of torus:

\[F(X,Y,Z) = (R - \sqrt{X^2+Y^2})^2 + Z^2 - r^2\]

360-degree gif of torus, with visible hole:

Implicit equation of Superquadric Surface:

\[F(X,Y,Z) = \bigg(\bigg(\bigg\rvert\frac{X}{a}\bigg\rvert \bigg)^\frac{2}{n} + \bigg(\bigg\rvert\frac{Y}{b}\bigg\rvert \bigg)^\frac{2}{n} \bigg)^\frac{n}{m} + \bigg(\bigg\rvert\frac{Z}{c}\bigg\rvert \bigg)^\frac{2}{m} - 1\]

360-degree gif of Superquadric Surface:

Tradeoffs between rendering as a mesh vs a point cloud:

Method of Generation:
- Point Clouds: Formed by directly sampling a parametric function.
- Meshes: Built by voxelizing a 3D space, sampling an implicit function, and then extracting surfaces using the Marching Cubes algorithm.
Rendering Speed:
- Point Clouds: Faster to render since they simply use sampled points without extra processing.
- Meshes: Slower because they need additional steps like voxelization and surface extraction before rendering.
Accuracy & Visual Quality:
- Point Clouds: More accurate at capturing fine details because each point represents a sampled location. However, they don’t have surfaces, making shading and texturing more difficult.
- Meshes: Can be less accurate due to voxelization, but increasing the resolution can improve precision. They also provide continuous surfaces, which makes them easier to texture and shade.
Computational Efficiency:
- Point Clouds: Easier to rotate, scale, and modify since they are just a collection of points.
- Meshes: More computationally expensive to modify because updating a mesh requires adjusting vertex positions and their connections.

6. Do Something Fun

Here is a 360 degree view of a cottage and also a dolly zoom view:

Usage:

  
python -m submissions.src.fun --function full --image_size 512 --output_file submissions/cottage_render_360.gif

Usage:

  
python -m submissions.src.fun --function dolly --image_size 512 --output_file submissions/cottage_dolly.gif

(Extra Credit) 7. Sampling Points on Meshes

Assignment 2: Single View to 3D

Questions: Github Assignment 2

0. Setup

Downloaded the shapenet single-class dataset. Unzipped the dataset and set the appropriate path in dataset_location.py.

1. Exploring loss functions

1.1. Fitting a voxel grid

To align a predicted voxel grid with a target shape, I used a binary cross-entropy (BCE) loss function. A 3D voxel grid consists of 0 (empty) and 1 (occupied) values, making this a binary classification problem where we predict occupancy probabilities of each voxel.

Implementation:

  
def voxel_loss(voxel_src,voxel_tgt):
	# voxel_src: b x h x w x d
	# voxel_tgt: b x h x w x d

	voxel_src.unsqueeze(1)
	voxel_tgt.type(dtype=torch.LongTensor)
	
	loss = torch.nn.functional.binary_cross_entropy(voxel_src, voxel_tgt)

	return loss

Usage:

python fit_data.py --type 'vox'

Ground Truth	Optimized Voxel

I trained the data for 10000 iterations.

1.2. Fitting a point cloud

Usage:

python fit_data.py --type 'point'

Ground Truth	Optimized Point Cloud

  
def chamfer_loss(point_cloud_src,point_cloud_tgt):
	# point_cloud_src, point_cloud_src: b x n_points x 3  

	p1 = knn_points(point_cloud_src, point_cloud_tgt)
	p2 = knn_points(point_cloud_tgt, point_cloud_src)
	loss_chamfer = torch.mean(torch.sum(p1.dists + p2.dists))	

	return loss_chamfer

1.3. Fitting a mesh

Usage:

python fit_data.py --type 'mesh'

Ground Truth	Optimized Mesh

  
def smoothness_loss(mesh_src):

	loss_laplacian = mesh_laplacian_smoothing(mesh_src)

	return loss_laplacian

2. Reconstructing 3D from single view

2.1. Image to voxel grid

Input RGB	Ground Truth Mesh	Ground Truth Voxel	Predicted 3D Voxel

I implemented the decoder architecture from the paper Pix2Vox: Context-aware 3D Reconstruction from Single and Multi-view Images.

For the encoder, I used the pre-trained ResNet-18 model, which computes a set of features for the decoder to recover the 3D shape of the object.

The decoder is responsible for transforming information of 2D feature maps into 3D volumes. I specifically implemented a slightly modified version of the Pix2Vox-F architecture from the above paper. The input of the decoder is of size [batch_size x 512] and the output is [batch_size x 32 x 32 x 32]. The decoder contains five 3D transposed convolutional layers. The first four transposed convolutional layers are of kernel size $4^3$, with stride of $2$ and padding of $1$. The last layer has a kernel of size $1^3$. Each transposed convolutional layer is followed by a LeakyReLU activation function, except for the last layer which is followed by a sigmoid activation function. The number of output channels for each layer follows the Pix2Vox-F configuration: 128 -> 64 -> 32 -> 8 -> 1.

I trained the model for 10000 iterations, with the default batch size of 32 and learning rate of 4e-4.

Usage:

  
python train_model.py --type 'vox' --max_iter 10000 --save_freq 500 

python eval_model.py --type 'vox' --load_checkpoint

2.2. Image to point cloud

Input RGB	Ground Truth Mesh	Ground Truth Point Cloud	Predicted 3D Point Cloud

I used an approach similar to the Pix2Vox-F decoder that I implemented above. The ResNet-18 model encodes the input images into feature maps, and a decoder reconstructs the 3D shape of the object from them.

The decoder takes in an input of size [batch_size x 512] and gives an output of size [batch_size x n_points x 3]. The decoder architecture comprises of 4 fully connected layers, three of which are followed by a LeakyReLU activation function.

I trained the model for 2000 iterations, with the default batch size of 32 and learning rate of 4e-4.

Usage:

  
python train_model.py --type 'point' --max_iter 2000 --save_freq 500 --n_points 5000 

python eval_model.py --type 'point' --load_checkpoint --n_points 5000

2.3. Image to mesh

Input RGB	Ground Truth Mesh	Predicted Mesh

Instead of encoding an image like I did in case of image to voxel and image to point cloud, the meshes are constructed from an icosphere mesh. The purpose of the decoder is to refine this initial mesh by giving per-vertex displacement vector as an ouput.

The decoder architecture that I implemented is very similar to that in case of image to point cloud, as mentioned above. It takes an input of size [batch_size x 512] and gives an output of size [batch_size x num_vertices x 3]. It comprises of 4 fully connected layers, three of which are followed by a ReLU activation function.

I trained the model for 2000 iterations, with the default batch size of 32, learning rate of 4e-4.

Usage:

  
python train_model.py --type 'mesh' --max_iter 2000 --save_freq 500

python eval_model.py --type 'mesh' --load_checkpoint

2.4. Quantitative comparisions

F1-score curves at different thresholds:

Voxel Grid	Point Cloud	Mesh

From the above plots, we can infer that the point cloud model performed the best, giving the highest F1-score, followed by the mesh model and the voxel model.

Intuitively, I think the reason the point cloud outperformed the voxel and mesh models is because it aligns well with the evaluation method, which compares points directly from the network output to the ground truth. Since point clouds don’t need to define surfaces or connections, they are more flexible and avoid errors caused by surface sampling. This makes them easier to optimize and more accurate in reconstruction.

The mesh model performed slightly worse primarily due to the challenges associated with sampling points from a continuous surface. Unlike point clouds, where each output point is directly predicted, meshes require proper face orientation and connectivity. Due to this, the sampled points might not always align perfectly with the ground truth, especially when dealing with complex geometries like thin structures (legs of the chair).

The voxel model has the lowest F1-score because representing 3D space as voxels limits a lot of detail and accuracy. Fine details can be lost due to this fixed resolution and sampled points may not always align perfectly with the object’s actual surface, affecting evaluation results.

2.5. Analyse effects of hyperparams variations

I have chosen to vary the w_smooth hyperparameter and analyze the changes in the mesh model prediction. The default value of w_smooth is 0.1, and its results have been shown in Section 2.3. I sampled 4 other values of w_smooth - 0.001, 0.01, 1, 10.

Input RGB	Ground Truth Mesh	`w_smooth=0.001`	`w_smooth=0.01`	`w_smooth=0.1` (Default)	`w_smooth=1`	`w_smooth=10`

F1-score curves for different variations in w_smooth for the mesh model:

`w_smooth=0.001`	`w_smooth=0.01`	`w_smooth=0.1` (Default)	`w_smooth=1`	`w_smooth=10`
Avg F1-score@0.05 = `72.977`	Avg F1-score@0.05 = `72.133`	Avg F1-score@0.05 = `70.951` (Default)	Avg F1-score@0.05 = `71.834`	Avg F1-score@0.05 = `72.337`

For low values of w_smooth = 0.001, 0.01:

They preserve the fine details but introduce noise and distortions
It results in rough and fragmented surfaces
They show a slightly higher F1-score because they retain geometric details

For high values of w_smooth = 1, 10:

They produce cleaner and more smooth meshes with reduced artifacts
It over-smooths the surface, causing loss of sharp details
The F1-score improves slightly likely because they more likely fall within the threshold radius of the ground truth

The default value of w_smooth = 0.1 falls in between the above two categories.

2.6. Interpret your model

Per-Point Error Visualization

Input RGB	Ground Truth Point Cloud	Predicted 3D Point Cloud	Per-Point Error

I used per-point error to gauge how well each predicted 3D point matches its corresponding point in the ground truth. In my approach, I compute the distance between each point in my reconstructed point cloud and its nearest neighbor in the ground truth. Then, I color-code these distances such that points with very small errors appear in cool colors (blue), while those with larger errors show up in warm colors (red).

From the above gifs, we can clearly see that some regions are rendered in cool tones, which tells me that my model is accurately capturing those parts of the object, such as the seat surface or the main body of the chair. On the other hand, areas highlighted in warm colors reveal where the model struggles, like along the thin chair legs or at complex curves of the backrest.

This visualization pinpoints the exact regions that need improvement.

Failure Case Analysis

In analyzing my 3D voxel model’s predictions, I noticed that while it reconstructs the backrest of chairs quite well, it struggles significantly with legs, seats, and unusual shapes. These failure cases provide valuable insight into the model’s learning behavior and what its limitations are.

Legs: Chair legs vary widely in shape, thickness, and placement across different samples in the dataset. Some chairs have four standard legs, while others may have a single central support or a complex curved base. Because the model tries to generalize patterns across the dataset, it struggles to reconstruct legs consistently. Additionally, legs are usually thin and small compared to the rest of the chair, and this makes them more prone to voxelization errors.
Seats: Many chair designs have gaps in them, which makes it challenging for the model to learn and also more prone to voxelization errors. Since the model tries to reconstruct a smoothed version of objects, it often fails to represent holes correctly, either closing them off entirely or introducing unexpected artifacts.
Unusual shapes: Some chairs in the dataset have very unique designs. Since the model is trained on a limited dataset, it may not have seen enough similar examples to generalize well.

3. Exploring other architectures / datasets

3.3 Extended dataset for training

dataset_location.py updated to include the extended dataset containing three classes (chair, car, and plane).

category	#instances
airplane	36400
car	31460
chair	61000
total	128860

I trained and evaluated the point cloud model with n_points = 5000.

Quantitative evaluation:

Point Cloud trained on one class	Point Cloud trained on three classes

Qualitative evlautaion by comparing “training on one class” vs “training on three classes”:

Input RGB	Ground Truth Point Cloud	Predicted 3D Point Cloud for 1 Class Training	Predicted 3D Point Cloud for 3 Classes Training

3D consistency and diversity of output samples:

Training the model on a single class, like chairs, results in more consistent and refined reconstructions. Since the model only sees one object type during training, it gets really good at capturing the details and structure unique to that class. However, this also means that the model becomes highly specialized. So when it is faced with a completely new object type (such as an airplane or car), it struggles because it hasn’t learned to handle the variation.

On the other hand, training on multiple classes (airplanes, cars, and chairs) allows the model to adapt better to different object shapes. Instead of focusing on one type, it learns general patterns that apply across different categories. This makes it more versatile when reconstructing new objects.

So in conclusion, single-class models tend to produce more uniform outputs because they have learned a very specific structural representation but lack adaptability. Multi-class models generate more diverse outputs because they have seen various object types and have learned to adapt to different shapes but at the cost of some fine-grained details.

Assignment 3: Part-1 Neural Volume Rendering

Questions: Github Assignment 3

0. Transmittance Calculation

1. Differentiable Volume Rendering

1.3. Ray sampling

Usage:

python volume_rendering_main.py --config-name=box

1.4. Point sampling

Usage:

python volume_rendering_main.py --config-name=box

1.5. Volume rendering

Usage:

python volume_rendering_main.py --config-name=box

2. Optimizing a basic implicit volume

2.1. Random ray sampling

  
def get_random_pixels_from_image(n_pixels, image_size, camera):
    xy_grid = get_pixels_from_image(image_size, camera)
    
    # Random subsampling of pixel coordinaters
    xy_grid_sub = xy_grid[np.random.choice(xy_grid.shape[0], n_pixels)].to("cuda")

    return xy_grid_sub.reshape(-1, 2)[:n_pixels]

2.2. Loss and training

loss = torch.nn.functional.mse_loss(rgb_gt, out['feature'])

Usage:

python volume_rendering_main.py --config-name=train_box

Box center: (0.2502, 0.2506, -0.0005)
Box side lengths: (2.0051, 1.5036, 1.5034)

2.3. Visualization

Usage:

python volume_rendering_main.py --config-name=train_box

3. Optimizing a Neural Radiance Field (NeRF)

Usage:

python volume_rendering_main.py --config-name=nerf_lego

4. NeRF Extras

4.1 View Dependence

Usage:

python volume_rendering_main.py --config-name=nerf_materials

Trade-offs between increased view dependence and generalization quality:

Adding view dependence allows the model to capture complex lighting effects like reflections, and translucency. But excessive view dependence can create inconsistencies when interpolating between unique views, which will make the rendering look unnatural.
If the network heavily relies on viewing direction, it may overfit to the specific camera angles in the training data. This can lead to poor generalization to unseen viewpoints.
It can increase the network’s complexity, requiring more parameters and training time.

Assignment 3: Part-2 Neural Surface Rendering

5. Sphere Tracing

My implementation of the SphereTracingRenderer class uses the sphere tracing algorithm to find intersections between rays and an implicit surface of a torus defined by a signed distance field. The algorithm iteratively updates points along each ray by moving in the direction of the ray by the amount of the SDF value at the current point. This process continues until the maximum number of iterations is reached or the SDF value becomes very close to zero (threshold of 1e-6), indicating a surface intersection.

Usage:

  
python -m surface_rendering_main --config-name=torus_surface

6. Optimizing a Neural SDF

Usage:

  
python -m surface_rendering_main --config-name=points_surface

Input	`lr=0.0001`	`lr=0.001`	`lr=0.00001`

Loss	`0.001279`	`0.000428`	`0.001635`

eikonal_loss = ((gradients.norm(2, dim=1) - 1.0) ** 2).mean()

Implementation:

Input (XYZ points) -> Harmonic Embedding -> Layer 1 (Linear + ReLU)
                      -> Layer 2 (Linear + ReLU)
                      -> Layer 3 (Linear + ReLU)
                      -> ...
                      -> Layer N (Linear + ReLU)
                      -> Linear SDF (Output: Signed Distance Function)

7. VolSDF

Usage:

  
python -m surface_rendering_main --config-name=volsdf_surface

Alpha: Scales the overall density. A higher value increases the density, while a lower value reduces it.
Beta: Controls how quickly the density changes with distance from the surface. A smaller beta results in a sharper transition, while a larger beta smooths the transition.

  
def sdf_to_density(signed_distance, alpha, beta):
    return torch.where(
            signed_distance > 0,
            0.5 * torch.exp(-signed_distance / beta),
            1 - 0.5 * torch.exp(signed_distance / beta),
        ) * alpha

Geometry	Result

Loss	`0.006958`

The above renders are for values alpha=10.0 and beta=0.05.

When alpha=10.0 and beta is changed:

`beta`	`beta=0.05`	`beta=0.1`	`beta=0.5`
Geometry
Render
Loss	`0.006958`	`0.010227`	`0.020789`

When beta=0.05 and alpha is changed:

`alpha`	`alpha=1`	`alpha=10`	`alpha=50`
Geometry
Render
Loss	`0.022317`	`0.006958`	`0.004329`

How does high beta bias your learned SDF? What about low beta?

High beta makes the transition between occupied and free space more gradual, leading to a smoother SDF. This can cause a bias where surfaces appear more diffused rather than sharp.

Low beta results in a sharper transition, meaning the SDF will be more precise in distinguishing surfaces, but it can also lead to unstable gradients and more difficult optimization.

Would an SDF be easier to train with volume rendering and low beta or high beta? Why?

An SDF is generally easier to train with volume rendering when using a high beta. This is because high beta values cause a larger number of points along each ray to have non-zero density, allowing gradients to be backpropagated through more points simultaneously. This leads to denser gradients and faster convergence during training.

Training with a low beta can be more challenging because it forces the network to learn very sharp transitions, which means only points very close to the surface contributes significantly to the rendering. This can lead to sparse gradients and slower convergence.

Would you be more likely to learn an accurate surface with high beta or low beta? Why?

You are more likely to learn an accurate surface with a low beta. A low beta encourages sharp boundaries and a more precise surface representation, as the density function closely approximates a step function. High beta values, on the other hand, lead to smoother surfaces, which can be less accurate.

Implementation:

Input (SDF Feature + XYZ Embedding) -> Layer 1 (Linear + ReLU)  
                      -> Layer 2 (Linear + ReLU)  
                      -> Layer 3 (Linear + ReLU)  
                      -> ...  
                      -> Layer N (Linear + ReLU)  
                      -> Linear RGB (Output: 3D Color Prediction)  

8. Neural Surface Extras

8.3 Alternate SDF to Density Conversions

Logistic density distribution function:

\[\phi_s(x) = \frac{se^{-sx}}{(1+e^{-sx})^2}\]

  
def neus_sdf_to_density(signed_distance, s):
    return s * torch.exp(-s * signed_distance) / ((1 + torch.exp(-s * signed_distance))**2) 

`s`	`s=10`	`s=50`
Geometry
Render
Loss	`0.005590`	`0.006529`

Low s results in a more blurry render, while a higher value of s makes it look sharp.

Assignment 4

Questions: Github Assignment 4

1. 3D Gaussian Splatting

1.1 3D Gaussian Rasterization

Run:

python render.py --gaussians_per_splat 1024

Output GIF:

1.2 Training 3D Gaussian Representations

Run:

python train.py --gaussians_per_splat 2048 --device cuda

I modified the run_training function to improve performance and reduce CUDA memory usage. Specifically, I:

Reduced the number of Gaussians from 10k to 7k by subsampling the points file, lowering the overall GPU memory footprint.
Ensured key Gaussian parameters (opacities, scales, colours, and means) were trainable and set up an optimizer with different learning rates for each parameter group (as mentioned in 1.2.1)
Wrapped the forward pass in autocast and used GradScaler to scale the loss, which both reduced memory usage and accelerated computation

I called the scene.render method to generate the predicted image from the input camera parameters, using the specified image size, background color, and the number of Gaussians per splat. I then computed the L1 loss between the rendered (predicted) image and the ground truth image.

Learning rates that I used for each parameter:

pre_act_opacities = 0.001
pre_act_scales = 0.001
colours = 0.02
means = 0.0002

Number of iterations that I trained the model for = 1000

Mean PSNR: 27.356

Mean SSIM: 0.915

Training final render GIF:

Training progress GIF:

1.3 Extensions

1.3.1 Rendering Using Spherical Harmonics

Run:

python render.py --gaussians_per_splat 1024

GIF:

Original	Spherical Harmonics

RGB image comparisons of the renderings obtained from both the cases:

Frame	Original	Spherical Harmonics
000
015
021
030

Differences that can be observed:

Frame 000 and 015: The spherical harmonics rendering looks more photorealistic because the angular variations in color better capture how the material responds to illumination from different directions.
Frames 021 and 030: The spherical harmonics rendering looks more glossy and reflective because the added directional sensitivity leads to more dynamic and detailed shading.

2. Diffusion-guided Optimization

2.1 SDS Loss + Image Optimization

Run:

python Q21_image_optimization.py --sds_guidance 1

Prompt	Without Guidance	With Guidance
Iterations	400	2000
“a hamburger”
“a standing corgi dog”
“a fish in a pan”
“a mansion”

2.2 Texture Map Optimization for Mesh

Run:

python python Q22_mesh_optimization.py

In order to reduce the CUDA memory footprint, I reduced the image size to 256x256.

Prompt	Initial Mesh GIF	Final Mesh GIF
“a tiger”
“a zebra”

2.3 NeRF Optimization

I perfomed no CUDA memory optimization here. All values were the same as default as pulled from GitHub.

Prompt: “a standing corgi dog”

Run:

  
python Q23_nerf_optimization.py --prompt "a standing corgi dog" --lambda_entropy 1e-2 --lambda_orient 1e-2 --latent_iter_ratio 0.1 

Rendered RGB video:

Rendered depth video:

Prompt: “a hamburger”

Run:

  
python Q23_nerf_optimization.py --prompt "a hamburger" --iters 2500 --lambda_entropy 1e-3 --lambda_orient 1e-2 --latent_iter_ratio 0.2 

Rendered RGB video:

Rendered depth video:

Prompt: “a mansion”

Run:

  
python Q23_nerf_optimization.py --prompt "a mansion" --iters 2500 --lambda_entropy 1e-2 --lambda_orient 1e-2 --latent_iter_ratio 0.1 

Rendered RGB video:

Rendered depth video:

2.4 Extensions

2.4.1 View-dependent text embedding

Prompt: “a standing corgi dog”

Run:

  
python Q23_nerf_optimization.py --prompt "a standing corgi dog" --lambda_entropy 1e-2 --lambda_orient 1e-2 --latent_iter_ratio 0.1 --view_dep_text 1

Rendered RGB video:

Rendered depth video:

Prompt: “a hamburger”

Run:

  
python Q23_nerf_optimization.py --prompt "a hamburger" --iters 2500 --lambda_entropy 1e-3 --lambda_orient 1e-2 --latent_iter_ratio 0.2 --view_dep_text 1

Rendered RGB video:

Rendered depth video:

Comparing with 2.3:

In 2.3, the NeRF model used one fixed text embedding for all views. This led to consistent but somewhat flat results and sometimes produced artifacts like multiple front faces (standing corgi dog example) because the model didn’t know how to adjust for different angles.

With view-dependent text conditioning, different embeddings (like front, side, and back) are used based on the camera angle. This helps the model adjust lighting, highlights, and reflections for each view, resulting in more realistic and 3D-consistent images.

The “hamburger” looks really good with view-dependent conditioning, while the “standing corgi dog” did not turn out as well, likely because it needs more training to capture all its details.

Assignment 5

Questions: Github Assignment 5

1. Classification Model

Run:

python train.py --task cls

python eval_cls.py

The model was trained for 250 epochs, and the best model was saved at epoch 115.

Train loss = 1.1288

Test accuracy of best model = 0.9696

Visualization of a few random test point clouds and their predicted classes:

Correct Prediction	Chairs	Vases	Lamps
Point Cloud
Point Cloud

Visualization of a failure prediction for each class:

Correct Label	Vase	Lamp	Vase
Point Cloud
Incorrect Prediction	Lamp	Vase	Lamp

The successful results show that the model is able to pick out important features. For example, it correctly identifies the distinct structure of chairs, vases, and lamps. But in some cases, the shapes can be ambiguous or similar, which causes the model to misclassify objects. This can happen when the point sampling misses some important details or when the features are too similar between classes.

2. Segmentation Model

Run:

python train.py --task seg

python eval_seg.py

The model was trained for 250 epochs, and the best model was saved at epoch 225.

Train loss = 13.0442

Test accuracy of best model = 0.8966

Visualization of correct segmentation predictions along with their corresponding ground truth:

Accuracy	0.9317	0.9147	0.9308
Ground Truth
Good Segmentation Predictions

Visualization of incorrect segmentation predictions along with their corresponding ground truth:

Accuracy	0.5604	0.5572	0.4702
Ground Truth
Bad Segmentation Predictions

The good segmentation predictions show that the model can distinguish the chair’s seat, back, and legs fairly accurately. In the failure cases, we can see that certain parts like the seat or back merges with the legs or other regions. A possible reason could be that if the training data includes ambiguous or poorly differentiated boundaries, the model may struggle to learn to differentiate segments in these areas.

3. Robustness Analysis

3.1 Rotate Input Point Clouds

3.1.1 Classification Model

Visualization of a few random test point clouds and their predicted classes:

Angle	Accuracy	Chairs	Vases	Lamps
10 degrees	0.958	Successful	Successful	Successful
10 degrees	0.958	Successful	Successful	Successful
20 degrees	0.894	Successful	Successful	Successful
20 degrees	0.894	Wrong predicted as lamp	Successful	Successful
30 degrees	0.679	Successful	Successful	Successful
30 degrees	0.679	Wrong predicted as lamp	Wrongly predicted as lamp	Successful
60 degrees	0.299	Wrongly predicted as vase	Successful	Wrongly predicted as chair
60 degrees	0.299	Wrong predicted as lamp	Wrongly predicted as lamp	Successful

From the observations above, we see that the model classifies most rotated objects correctly when the rotation is small (10 or 20 degrees). However, as the rotation angle becomes larger (30 or 60 degrees), the accuracy drops and the model starts confusing objects. This happens because the model wasn’t trained with any rotational variations and so the learned features become orientation-dependent.

3.1.2 Segmentation Model

Angle	Model	X	X	X	X
10 degrees	Ground Truth
10 degrees	Prediction	Success: 0.9185	Success: 0.9134	Success: 0.9055	Failure: 0.4798
20 degrees	Ground Truth
20 degrees	Prediction	Success: 0.9396	Success: 0.9191	Success: 0.9307	Failure: 0.5906
30 degrees	Ground Truth
30 degrees	Prediction	Success: 0.9093	Success: 0.9429	Failure: 0.4707
60 degrees	Ground Truth
60 degrees	Prediction	Failure: 0.4707	Failure: 0.5283	Failure: 0.5478

From the observations above, we see that the model segments most portions of the rotated objects well when the rotation is small (10 or 20 degrees). However, as the rotation angle becomes larger (30 or 60 degrees), we see a massive accuracy drop. This happens because the model wasn’t trained with any rotational variations and so the learned features become orientation-dependent. The rotations make it difficult for the network to correctly differentiate between segments.

3.2 Different Number of Point Points per Object

3.2.1 Classification Model

Number of Points	Accuracy	Chairs	Vases	Lamps
5000	0.968	Successful	Successful	Successful
1000	0.9695	Successful	Successful	Successful
500	0.9622	Successful	Successful	Successful
100	0.882	Successful	Successful	Successful

We see that the classification model performs well even with fewer points. The drop in accuracy when the number of points is reduced to 100 or below is expected because fewer points reduce geometric detail, which makes it harder to distinguish specific features. Still, even at 100 points, the outline structure is preserved, allowing the model to classify successfully in many cases.

3.2.2 Segmentation Model

Number of Points	Model	X	X	X	X
5000	Ground Truth
5000	Prediction	Success: 0.9448	Success: 0.992	Success: 0.9658	Failure: 0.5366
1000	Ground Truth
1000	Prediction	Success: 0.952	Success: 0.988	Success: 0.904	Failure: 0.515
500	Ground Truth
500	Prediction	Success: 0.934	Success: 0.998	Success: 0.926	Failure: 0.484
100	Ground Truth
100	Prediction	Success: 0.92	Success: 0.96	Success: 0.91	Failure: 0.58

We see that the segmentation model performs well even with fewer points. While the model performs well even with 500 points, the performance becomes more unstable at 100 points, especially in complex or ambiguous regions. This is because fewer points reduce the structural detail, making it harder for the model to distinguish fine boundaries and specific object features like thin legs and armrests.

4. Bonus Question - Locality

I implemented a general Transformer framework for both the classification model and the segmentation model.

The classification model predicts a single label for an entire point cloud. Each point’s 3D coordinates are passed through two linear layers: one to embed the input features, and another to learn positional information. These two embeddings are added together and fed into a standard Transformer Encoder. After processing, I apply max pooling across all points to extract a global feature vector. This is then passed through fully connected layers with batch normalization, ReLU activation, and dropout. The log softmax over the class scores is then returned.

The segmentation model /assets/images/L3D/a5/outputs a class label for each point. I embed the 3D points and add positional information before passing them through a Transformer Encoder. Instead of pooling the features, I use 1D convolution layers to generate per-point predictions. These /assets/images/L3D/a5/outputs are also passed through a log softmax to get per-point class probabilities.

4.1 Classification Model using Transformers

Run:

  
python train.py --task cls

python eval_cls.py --load_checkpoint best_model --/assets/images/L3D/a5/output_dir ".//assets/images/L3D/a5/output/Q4/cls"

The model was trained for 100 epochs, and the best model was saved at epoch 18.

Train loss = 30.6177

Test accuracy of best model = 0.9454

Visualization of a few random test point clouds and their predicted classes using Transformers:

Correct Prediction	Chairs	Vases	Lamps
Point Cloud
Point Cloud

Visualization of a failure prediction for each class using Transformers:

Correct Label	Vase	Lamp	Lamp
Point Cloud
Incorrect Prediction	Lamp	Vase	Chair

The model was run only for 100 epochs because it was taking a lot of time. I also did not have the time to run the segmentation Transformer model. However, there were no errors in starting the training and it only had to be left running for ~8 hours on my laptop.

Moreover, the Transformer model gave a very high accuracy of 94% for just training for 100 epochs. The classification model without Transformer gave an accuracy of 97% for training for 250 epochs.

Learning for 3D Vision

Assignment 1: Rendering Basics with PyTorch3D

1. Practicing with Cameras

1.1 360-degree Renders

1.2 Re-creating the Dolly Zoom

2. Practicing with Meshes

2.1 Constructing a Tetrahedron

2.2 Constructing a Cube

3. Re-texturing a mesh

4. Camera Transformations

5. Rendering Generic 3D Representations

5.1 Rendering Point Clouds from RGB-D Images

5.2 Parametric Functions

5.3 Implicit Surfaces

6. Do Something Fun

(Extra Credit) 7. Sampling Points on Meshes

Assignment 2: Single View to 3D

0. Setup

1. Exploring loss functions

1.1. Fitting a voxel grid

1.2. Fitting a point cloud

1.3. Fitting a mesh

2. Reconstructing 3D from single view

2.1. Image to voxel grid

2.2. Image to point cloud

2.3. Image to mesh

2.4. Quantitative comparisions

2.5. Analyse effects of hyperparams variations

2.6. Interpret your model

Per-Point Error Visualization

Failure Case Analysis

3. Exploring other architectures / datasets

3.3 Extended dataset for training

Assignment 3: Part-1 Neural Volume Rendering

0. Transmittance Calculation

1. Differentiable Volume Rendering

1.3. Ray sampling

1.4. Point sampling

1.5. Volume rendering

2. Optimizing a basic implicit volume

2.1. Random ray sampling

2.2. Loss and training

2.3. Visualization

3. Optimizing a Neural Radiance Field (NeRF)

4. NeRF Extras

4.1 View Dependence

Assignment 3: Part-2 Neural Surface Rendering

5. Sphere Tracing

6. Optimizing a Neural SDF

7. VolSDF

8. Neural Surface Extras

8.3 Alternate SDF to Density Conversions

Assignment 4

1. 3D Gaussian Splatting

1.1 3D Gaussian Rasterization

1.2 Training 3D Gaussian Representations

1.3 Extensions

1.3.1 Rendering Using Spherical Harmonics

2. Diffusion-guided Optimization

2.1 SDS Loss + Image Optimization

2.2 Texture Map Optimization for Mesh

2.3 NeRF Optimization

2.4 Extensions

2.4.1 View-dependent text embedding

Assignment 5

1. Classification Model

2. Segmentation Model

3. Robustness Analysis

3.1 Rotate Input Point Clouds

3.1.1 Classification Model

3.1.2 Segmentation Model

3.2 Different Number of Point Points per Object

3.2.1 Classification Model

3.2.2 Segmentation Model

4. Bonus Question - Locality

4.1 Classification Model using Transformers

Further Reading

Introduction to Computer Vision

Recognizing Traffic Signs using CNNs

Autonomous Ground Vehicle