D-Robotics 2026 GSoC Ideas List
This is the exclusive blog abstract for D-Robotics 2026 Google Summer of Code (GSoC) Project Ideas List, aiming to provide clear and concise project direction guidance for prospective GSoC Contributors.
Idea 1: Unified Embodied AI Dataset Standardization & Cross-Robot Generalization Benchmark
Abstract / Brief Summary
The diversity of data formats (e.g., DROID, BridgeData, RT-1) and robot morphologies currently hinders the development of generalist Vision-Language-Action (VLA) models. This project aims to build an automated pipeline to convert heterogeneous embodied datasets into the standardized LeRobot v2.1 format, inspired by the AIRoA MoMa dataset. Furthermore, the student will develop a simulation-based benchmark (using Habitat-Sim or OmniGibson) to evaluate how foundational policies generalize across different robot embodiments (e.g., D-Robotics RDK platforms vs. standard research robots).
Project Description
- The Problem: Current embodied AI datasets are highly fragmented in terms of storage formats, control frequencies, and observation spaces. For instance, datasets like Open X-Embodiment contain 22+ embodiments, but transferring these skills to new hardware like the D-Robotics RDK X5 requires complex re-calibration and data alignment. Existing resources also often lack the hierarchical annotations (sub-goals/primitive actions) necessary for long-horizon task learning.
- The Solution: We propose a "Standardization & Benchmark" framework. The student will first develop tools to re-sample and synchronize multi-modal streams (RGB, proprioception, force-torque) into a unified format. Following the AIRoA MoMa schema, the pipeline will utilize VLMs to auto-generate hierarchical sub-goal labels. Finally, the student will establish a benchmark suite to test policy zero-shot transferability across different robot degrees-of-freedom (DoF) and sensor configurations.
- The Impact: This project will provide the open-source community with a "Universal Data Bridge." It enables developers to easily fine-tune large-scale VLA models (like RT-X or π0) on D-Robotics hardware, accelerating the deployment of robust mobile manipulation agents in real-world environments.
Key Deliverables
The contributor is expected to deliver the following:
- Step 1: Multi-Dataset Converter (Python)
- Develop a modular tool to convert at least 3 major datasets (e.g., DROID, OXE, BridgeData) to LeRobot v2.1 format.
- Implement synchronized re-sampling (e.g., to 30Hz) and multi-view RGB alignment.
- Step 2: Automated Hierarchical Labeling Pipeline
- Integrate a VLM-based (e.g., Gemini or GPT-4V) module to label "Short Horizon Tasks" and "Primitive Actions" for raw trajectories.
- Generate a sample "Enhanced Dataset" hosted on Hugging Face.
- Step 3: Cross-Robot Benchmark Suite
- Create a testing environment in Habitat-Sim or OmniGibson featuring at least two different robot URDFs (e.g., Fetch and a D-Robotics mobile base).
- Define standard metrics for "Transfer Success Rate" and "Morphology Adaptation Latency".
- Step 4: Technical Report
- A comprehensive report analyzing the impact of data standardization on model performance across different hardware platforms.
Skills Required
- Mandatory: Python, experience with Embodied AI datasets (LeRobot, OXE, or similar).
- Important: Knowledge of Multi-modal data processing (time-synchronization, re-sampling) and ROS 2.
- referred: Familiarity with 3D simulators (Habitat-Sim, OmniGibson) and VLA models (RT-1/RT-2).
Mentorship & Difficulty
- Difficulty: Large (350 hours). This involves significant data engineering, simulation setup, and multi-modal model integration.
- Possible Mentors: Longtao Wu
- Hardware Requirement: The contributor will use D-Robotics RDK X5 / RDK Ultra for real-world validation (or we provide remote access to our hardware logs and simulation environments)
Idea 2: Lightweight Grounding DINO: Backbone Optimization & BPU Acceleration on RDK X5
Abstract / Brief Summary
Grounding DINO is currently a state-of-the-art framework for open-set object detection. However, its original backbone (Swin Transformer) is computationally expensive and unfriendly to edge AI accelerators like the Horizon BPU. This project aims to replace the heavy backbone with BPU-friendly CNN architectures (e.g., ResNet50 or EfficientNet), prune the model, and deploy it on the D-Robotics RDK X5 to achieve real-time zero-shot detection for robotics applications.
Project Description
The Problem: The RDK X5 features a powerful 10 TOPS BPU (Brain Processing Unit) that excels at processing CNN-based structures. The standard Grounding DINO relies on Swin-Transformer, which involves complex window-shifted attention mechanisms that suffer from high latency and poor quantization support on edge hardware.
The Solution: We propose a "Backbone Replacement" strategy. The student will modify the Grounding DINO architecture by swapping the Swin-T backbone with a standard ResNet-50 or EfficientNet-B0/B3. These CNN backbones are fully supported by the Horizon OpenExplorer toolchain and can be efficiently quantized to INT8.
The Impact: By enabling Grounding DINO on RDK X5, we provide the developer community with a powerful "Text-to-Object" perception tool. This allows low-cost robots to understand complex commands (e.g., "Find the red plastic bottle") without relying on cloud APIs.
Key Deliverables
The student is expected to deliver the following:
- Step 1: Model Re-Architecting (PyTorch)
- Replace the Swin-Transformer backbone with ResNet50 or EfficientNet in the official Grounding DINO codebase.
- Align the feature pyramid network (FPN) outputs to match the detection head requirements.
- Fine-tune or transfer weights to ensure basic accuracy recovery.
- Step 2: Toolchain Adaptation (OpenExplorer)
- Convert the modified PyTorch model to ONNX.
- Use Horizon OpenExplorer to optimize the model and quantize it to INT8 (
.binmodel generation). - Solve potential operator fallback issues (ensure >90% ops run on BPU).
- Step 3: On-Board Inference SDK
- Provide a C++ or Python inference script using the
hbbpulibrary. - (Optional/Advanced) Wrap the inference engine into a ROS2 Node.
- Provide a C++ or Python inference script using the
- Step 4: Benchmark Report
- A comparison report: Original Swin-T vs. Optimized CNN Backbone on RDK X5 (Latency, FPS, and Accuracy drop).
Skills Required
- Mandatory: Python, PyTorch (deep understanding of model definition and loading pre-trained weights).
- Important: Basic knowledge of Model Quantization (INT8) and ONNX.
- Preferred: C++ (for high-performance inference) and ROS2 basics.
Mentorship & Difficulty
- Difficulty: Middle (150 hours). This is not a simple deployment; it involves model surgery.
- Possible Mentors: Guosheng Xu
- Hardware Requirement: The student will need access to an RDK X5 board (or we can provide remote access/simulation logs).
