Abhishek Bagade's blog

This is Abhishek Bagade's blog on which I try to maintain my hobby projects and other stuff. I try to add blogs related to GATE, Arduino Projects and general ML side projects.

View on GitHub

Image and Text to Scale-Accurate Parametric 3D Models

AI-assisted research synthesis. Verify critical claims with primary sources. Status: Completed Last updated: 2026-04-19T18:48:28+05:30 Mode: method-survey

Summary

Overview

This note surveys how to build a system that converts either:

  1. images with scale information, or
  2. natural language requests

into scale-accurate, editable parametric 3D models.

The target here is not generic mesh output. It is closer to real CAD: B-rep solids, CAD command programs, or other structured parametric representations that can be modified, measured, and manufactured.

That distinction matters. A large fraction of “text-to-3D” and “image-to-3D” progress is aimed at meshes, NeRFs, Gaussian splats, or visually plausible shapes. Those are useful for graphics, but they are not good enough when you need dimensioned, editable parts with manufacturable geometry.

Background

There are now several overlapping research directions:

The central difficulty is that parametric CAD combines:

This makes CAD much less forgiving than code generation or mesh generation.

For your use case, there are three especially important distinctions:

  1. Image-to-3D is not image-to-CAD. A good mesh prior is not the same as a good parametric model.
  2. Single-view estimation is not scale-accurate reverse engineering. Scale needs explicit cues or calibrated measurement.
  3. Assembly/edit tasks are not object generation tasks. “Battery cover with a GoPro mount” is better framed as template retrieval + constrained composition than as fully novel generation.

Core Analysis

Problem framing

You are really asking for one system with at least three subproblems:

  1. Geometry recovery
    • infer object shape from words or images
  2. Scale recovery
    • infer true dimensions from cues such as fiducials, known objects, camera calibration, or user measurements
  3. Parametric reconstruction or composition
    • turn the shape into editable CAD operations, dimensions, sketches, and feature relationships

These can be solved in different ways depending on the input mode.

Pipeline mental model

The most practical pipeline today is:

  1. Interpret the input
    • text prompt, one or more photos, optional reference part IDs, optional known dimensions
  2. Estimate or anchor scale
    • fiducial marker, ruler, ArUco tag, known dimension, camera intrinsics, or multi-view reconstruction
  3. Classify the task type
    • pure generation, reverse engineering, edit existing part, or compose existing parts
  4. Retrieve priors
    • reference CAD models, part libraries, object templates, feature libraries, mounting standards
  5. Synthesize CAD program or parametric structure
    • CadQuery, FreeCAD Python, OpenSCAD, STEP-oriented structured representation, or patch-based CAD decoder
  6. Execute and validate
    • compile the geometry, check solid validity, measure dimensions, compare against image/text constraints
  7. Repair or clarify
    • ask for missing dimensions, fix invalid geometry, or refine with visual/numeric feedback

This can be summarized as:

\(\hat{P} = \operatorname*{argmax}_P \; \text{Fit}(P, X, T, S, R) \quad \text{s.t. valid}(P)=1\) Plain English: choose the CAD program or parametric model $P$ that best fits the images, text, scale cues, and retrieved references, while remaining geometrically valid. Variables:

A simple scale anchoring equation looks like:

\(\text{scale} = \frac{d_{real}}{d_{image}} \cdot f\) Plain English: the global scale can be estimated from a known real-world distance, the corresponding image measurement, and camera calibration terms. Variables:

The exact implementation depends on whether you use calibrated monocular geometry, multi-view reconstruction, or fiducial pose estimation.

Method families

1. Script-first text-to-CAD generation

What it does: maps text into executable CAD code such as CadQuery, FreeCAD Python, or OpenSCAD.

How it helps: gives you editable parametric outputs immediately and lets you verify geometry by execution.

Why it exists: LLMs already know Python reasonably well, and script-based CAD languages are much easier to target than opaque CAD kernels or raw B-rep graphs.

What it is good at:

What it does not solve well:

Representative work:

2. Image-to-CAD through factorization

What it does: splits image-to-CAD into subproblems such as discrete structure prediction plus continuous parameter prediction.

How it helps: reduces the difficulty of directly predicting a full CAD program from pixels.

Why it exists: CAD outputs mix discrete and continuous structure, which is hard to learn end-to-end.

What it is good at:

What it does not solve well:

Representative work:

3. Direct B-rep or CAD-native generation

What it does: generates CAD-native geometry directly, such as B-rep structures or editable surface patches.

How it helps: avoids lossy mesh intermediates and can produce cleaner engineering-style geometry.

Why it exists: meshes are insufficient for editable CAD and manufacturing workflows.

What it is good at:

What it does not solve well:

Representative work:

4. Agent-aided CAD generation

What it does: uses an LLM or multimodal model as a planner/coder controlling a real CAD engine, then validates and repairs the result using programmatic checks.

How it helps: moves the burden from latent memorization to tool use, retrieval, validation, and iterative correction.

Why it exists: CAD errors are often easier to detect with a compiler, kernel measurements, and render checks than with a purely learned loss.

What it is good at:

What it does not solve well:

Representative work:

Representative methods

Text-to-CadQuery and CAD-Coder

These are among the clearest signals that code-first CAD generation is a serious path. Both target CadQuery, which is important because it is Pythonic, executable, and measurable. CAD-Coder adds chain-of-thought and geometric reward, while Text-to-CadQuery shows that fine-tuning larger code-capable models materially improves performance.

This is a strong argument that if your target output is parametric CAD, choosing the right output language matters more than trying to train a generic 3D model first.

ProCAD and clarification-first systems

ProCAD is especially relevant to your use case because natural language CAD requests are often incomplete. For example, “Xbox controller battery cover with a GoPro mount” leaves many questions open:

The paper’s core insight is correct: do not hallucinate missing dimensions. Ask or infer only where justified.

CADSmith-style harnesses

CADSmith is one of the strongest current signals for your question because it argues directly for a harness over fine-tuning in many practical settings. It uses:

This is very close to what you should build first.

It is also the clearest connection to your Opus comment: the paper uses a stronger Claude Opus judge over a Claude Sonnet generator to reduce confirmation bias.

Img2CAD, CADCrafter, and CADDreamer

These show that image-conditioned parametric CAD is becoming real, but also expose the current constraints:

The main lesson is not “a foundation model will solve this end-to-end.” The lesson is that image-to-CAD needs decomposition, geometry priors, and strong validation.

DreamCAD and parametric surface generation

DreamCAD is impressive because it claims multimodal CAD generation from text, images, and point clouds with editable patch-based surfaces and STEP output. But its representation is still not the same as a feature-history CAD model. It is closer to a CAD-friendly surface generator than to a fully faithful editable design history.

That means it is exciting as a geometry backbone, but it does not remove the need for a harness if you want exact product edits and compositions.

Tradeoffs

End-to-end fine-tuned model

Pros:

Cons:

Best fit:

Harness around strong existing models

Pros:

Cons:

Best fit:

Hybrid path

This is the best long-term plan:

This mirrors what the best current CAD papers are converging on: code-first generation, verifier-guided repair, and post-training with geometric reward.

Practical guidance

Verdict

Verdict: Only worth doing if you constrain the domain and build a harness around existing SOTA models plus CAD tools.

More explicitly:

Why this verdict follows from the evidence

In other words: there is room for a useful system, but not for a naive one.

Best implementation strategies

Build a multimodal pipeline with these modules:

  1. Input parser
    • text request, photos, optional known measurements, optional existing CAD references
  2. Scale subsystem
    • ArUco/AprilTag/ruler detection
    • camera calibration or structure-from-motion if multi-view
    • optional manual dimension confirmation UI
  3. Task router
    • classify as: generate from text, reverse-engineer from image, edit existing CAD, compose reference parts
  4. Retrieval subsystem
    • fetch templates from STEP/FreeCAD/CadQuery library
    • use CAD similarity search for known parts
    • fetch standards such as GoPro mount geometry, screw standards, dovetails, snap fits
  5. CAD planner/coder
    • use frontier model (e.g. Opus 4.7 or similar) to produce a structured spec and then CAD code
  6. Kernel execution + validation
    • OpenCascade / CadQuery / FreeCAD execution
    • validate bounding box, volume, face counts, wall thickness, mounting positions, boolean success
  7. Repair loop
    • cheaper model or same model revises code from exact failure report
  8. Judge loop
    • stronger model or separate critic reviews rendered views + measurements

This is the highest-probability path.

Option B — Reference-model composition system for edit tasks

This is the best answer to prompts like:

create xbox controller battery cover with a gopro mount baked in

Pipeline:

  1. identify object family and exact product variant
  2. retrieve or reconstruct base battery-cover geometry
  3. retrieve GoPro mount reference geometry and mating constraints
  4. choose attach surface and orientation
  5. solve attachment constraints and clearance
  6. generate merged CAD feature tree
  7. validate printability and fit dimensions

This is much more reliable than free-form generation because the system only needs to infer how to combine known parts, not invent everything from scratch.

Option C — Fine-tune a dedicated text-to-CAD model

Do this only if:

Good targets:

Bad target for a first effort:

Option D — Fine-tune image-to-CAD for constrained categories

Best when all of these are true:

For example:

How to use Opus 4.7 specifically

The most defensible use of Claude Opus 4.7 is as:

Why:

But do not assume “Opus 4.7 alone solves CAD.” Public evidence still favors tool use + validation + retrieval.

For a practical v1:

Concrete build plans

MVP 1 — text + reference composition

Focus only on:

This avoids the hard image perception problem and gets you to value quickly.

MVP 2 — photo-guided constrained reverse engineering

Add:

This is much more realistic than “upload one random phone photo and get exact CAD.”

MVP 3 — assembly-aware editing harness

Support:

This is likely the highest-value business workflow.

Open questions

Evidence and Sources

Core text-to-CAD / agentic sources

Image-to-CAD / CAD-native geometry sources

Practical systems and products

Uncertainties and Competing Views

Practical Takeaways

References

  1. Claude Opus 4.7 — current frontier model positioning across coding, vision, and agents.
  2. CadQueryEval — public benchmark for natural-language-to-CadQuery generation.
  3. OpenECAD: An Efficient Visual Language Model for Editable 3D-CAD Design — image-conditioned editable CAD generation with VLM fine-tuning.
  4. Img2CAD: Reverse Engineering 3D CAD Models from Images through VLM-Assisted Conditional Factorization — factorized image-to-CAD.
  5. Text2CAD: Text to 3D CAD Generation via Technical Drawings — text to technical drawings to CAD reconstruction.
  6. CADDreamer: CAD Object Generation from Single-view Images — single-view image to CAD B-rep.
  7. Text-to-CadQuery: A New Paradigm for CAD Generation with Scalable Large Model Capabilities — fine-tuned LLMs for CadQuery generation.
  8. CAD-Coder: Text-to-CAD Generation with Chain-of-Thought and Geometric Reward — CoT + GRPO for text-to-CAD.
  9. CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images — image-to-parametric CAD from unconstrained images.
  10. DreamCAD: Scaling Multi-modal CAD Generation using Differentiable Parametric Surfaces — multimodal CAD generation from text, images, and point clouds.
  11. GraphBrep: Learning B-Rep in Graph Structure for Efficient CAD Generation — graph-based B-rep generation.
  12. CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation — multimodal conditional B-rep generation on mmABC.
  13. PLLM: Pseudo-Labeling Large Language Models for CAD Program Synthesis — self-training CAD program synthesis.
  14. Clarify Before You Draw: Proactive Agents for Robust Text-to-CAD Generation — clarification-first CAD agents.
  15. Towards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding — FutureCAD and B-rep grounding.
  16. CADSmith: Multi-Agent CAD Generation with Programmatic Geometric Validation — multi-agent validated CAD generation.
  17. TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning — RL-trained CAD tool-use agents.
  18. Agent-Aided Design for Dynamic CAD Models — assembly-aware agentic CAD with constraints.
  19. CADAM — open-source text/image-to-CAD web application.
  20. Text-to-.step — practical FreeCAD-based text-to-STEP system.
  21. FreeCAD MCP — Claude-to-FreeCAD integration.
  22. MCP-FreeCAD Integration — AI assistant integration with FreeCAD via MCP.