Paper Banana - AI with Surya channel



https://www.youtube.com/watch?v=MtA7CGOXnko Here is a Markdown summary of the video transcript regarding the PaperBanana framework.

PaperBanana: Automated Academic Illustration for AI Scientists

PaperBanana is a new framework developed by a team at Google and Peking University. It is designed to generate publication-quality diagrams and infographics directly from plain text descriptions, overcoming the limitations of standard one-shot image generation models.


🚫 The Problem with Current Models

Current state-of-the-art models (referred to as Nano Banana Pro in the video) operate on a “one-shot” basis:

  • You give a prompt You get an image.
  • Issues: If a label is misspelled, a connection is missing, or the color scheme is off, you have to regenerate the entire image and hope for the best.

💡 The PaperBanana Solution: Agentic Workflow

PaperBanana wraps the base image generator in a multi-agent pipeline. It does not just prompt; it plans, executes, and refines.

The 5-Agent Architecture

  1. Retriever Agent: Finds relevant reference examples to guide the system.
  2. Planner Agent: Acts as the cognitive core, translating context into detailed layout descriptions.
  3. Stylist Agent: Enforces aesthetic standards and design guidelines.
  4. Visualizer Agent: Generates the actual image (uses Nano Banana Pro internally).
  5. Critic Agent: Reviews the output, critiques it, and sends it back for revision.

The Loop

The system operates on a Generate Critique Refine loop (typically 3 iterations). It self-corrects specific details like arrow direction, color coding, and text labels without needing manual user intervention.


🧪 Demos & Capabilities

1. Transformer Architecture (Sample Input)

The video demonstrated generating a diagram of the Transformer method.

  • Result: High fidelity to the actual architecture.
  • Details: Correctly separated Encoder/Decoder layers into different color palettes, used dashed lines for residual connections, and placed “Pre-LN” annotations correctly.
  • Process: The system iterated three times, refining the “Sparse Attention Context” arrow and label placement automatically.

2. Google Agent Development Kit (Custom Input)

The presenter fed the system a raw text description of a hypothetical “ADK Agent” system involving orchestration, research agents, BigQuery, and Pandas.

  • Result: A complex, professional system design diagram.
  • Details: It correctly mapped relationships between the User, Orchestrator, and Sub-agents. It visually represented specific tools (BigQuery, Pandas) and protocols (A2A) mentioned in the text.

📊 Benchmarks & Performance

The paper evaluated the model on four dimensions: Faithfulness, Conciseness, Readability, and Aesthetics.

MetricVanilla Model (One-shot)PaperBanana (Agentic)Human Experts
Overall Score43.260.2N/A
Key FindingStruggled with specifics.Beats Humans in Conciseness, Readability, & Aesthetics.Wins only on Faithfulness (Intent).

🚀 Use Cases

While built for researchers, this technology is applicable to:

  • Solutions Architects: System design documentation.
  • Product Managers: Feature flowcharts.
  • Founders: Pitch deck visuals.
  • Developers: Pipeline visualization.

⚠️ Important Notes

  • Code Status: The video utilized an unofficial open-source implementation (hosted on GitHub/Antigravity). The official code from Google/Peking University has not been released yet (expected circa Jan 30, 2026).
  • Underlying Tech: PaperBanana is a framework/wrapper; it uses existing strong image models (like Nano Banana Pro) to do the actual pixel generation.

Link to research paper: https://arxiv.org/abs/2601.23265

Unofficial GitHub to clone: https://github.com/llmsresearch/paperbanana