CAD Model Generation using Vision-Language Models (VLMs)

This project explores CAD model generation with Vision-Language Models (VLMs) by translating natural-language requirements into parametric, editable CAD programs rather than static meshes. A common limitation of generative 3D outputs is that they can look reasonable but fail practical design needs such as dimension control, symmetry, manufacturability, and easy iteration.

Overview

We designed and implemented an end-to-end pipeline that turns a user’s prompt and target media (image/video) into a structured CAD program, then improves it through iterative evolution and feedback.

Fitness / Scoring

We score each candidate using an LLM-based visual critique conditioned on the target media and six canonical renders (top, bottom, front, back, left, right).

Input: target (image/video) + 6 rendered views (PNG) + a structured rubric prompt
Output (strict JSON):
- keep: what already matches the target
- improve: concrete, actionable edits to reduce mismatch
- score (1–10): overall resemblance using the full scale

The score is used as the primary fitness signal, while improve provides targeted guidance for the next generation (e.g., proportions, feature placement, curvature/fillets, thickness/clearances).

We refine designs with a simple top-2 loop driven by critique scores.

Select parents (Top-2): pick the two highest-scoring candidates from critique JSONs.
Build context: provide the target media, original prompt, and both parents’ artifacts (CadQuery code + 6-view renders).
Generate offspring: the model outputs a new executable CadQuery program (result) per call; we sample multiple offspring in parallel for diversity.
Iterate: render and critique the new candidates, then repeat the loop for further improvements.

3D CAD Generation

CAD Model Generation using Vision-Language Models (VLMs)

Overview

Fitness / Scoring

Evolutionary Refinement (Top-2 Guided Evolution)