Appendix B - Method

B.1. Fine-Tuning Settings

To ensure that Instruction-LLaVA effectively interprets visual and contextual information, the model was fine-tuned using the Low-Rank Adaptation (LoRA) method. We fine-tuned the LLaVA1.5-7B-Chat model with the following key hyperparameters:

The training was conducted using the PEFT framework, which facilitated the efficient fine-tuning of the LLaVA model for the specific requirements of the AlignBot system.

B.2. Algorithm for Case-Based Learning

B.2.1. Case-Based Learning for Enhanced GPT Prompting

Long-horizon task planning in household robotics often faces challenges such as action omissions and sequence errors, which can undermine execution reliability. For instance, a robot might attempt to place an object without first grasping it. To address these issues, we propose a case-based learning approach that enhances GPT-4o's planning capabilities by incorporating relevant historical cases into the prompting process.

This method involves dynamically retrieving the top-\( k \) most relevant cases—specifically, past action plans that were both successful and approved by users—from historical dialogue data. The retrieved cases serve as reference examples for GPT-4o, providing contextual guidance based on prior successful executions. By leveraging similar and correct action sequences from past cases, we aim to improve the accuracy and robustness of the generated plans.

The rationale behind this approach is that previous successful plans encapsulate effective solutions that can help prevent common errors in new tasks. However, retrieving relevant cases in a multimodal context poses challenges due to the complexity of assessing similarity across different modalities. To overcome this, we first select successful plans that match the same task description. We then utilize GPT-4o's feedback to evaluate and rank these plans based on their relevance and effectiveness. By dynamically integrating the most appropriate cases as references, GPT-4o can generate more accurate and comprehensive action sequences, thereby reducing omissions and enhancing overall task execution.

B.2.2. Algorithm Description

Consider a dataset \( D = \{ \langle p_1, t_1 \rangle, \langle p_2, t_2 \rangle, \dots, \langle p_n, t_n \rangle \} \), where each action plan \( p_i \) is linked to a historical task description \( t_i \). Each plan \( p_i \) is assigned an initial priority score \( f_i = f_0 \) and a gradient \( \Delta f_i = 0 \).

When a new task \( t_{\text{new}} \) arrives, the system performs the following steps:

  1. Instance Selection:
    1. Compute the similarity \( \text{sim}(t_i, t_{\text{new}}) \) for each instance \( p_i \in D \) using TF-IDF combined with cosine similarity.
    2. Select a subset \( D_r' \) such that: \[ D_r' = \{ p_i \mid \text{sim}(t_i, t_{\text{new}}) > \tau \} \] where \( \tau \) is a predefined similarity threshold.
  2. Priority Normalization:
    1. Apply Softmax Function:
      • Normalize the priority scores of instances in \( D' \) to prevent over-reliance on a limited set of action plans: \[ f_i = \frac{\exp(f_i)}{\sum_{p_j \in D'} \exp(f_j)}, \quad \forall p_i \in D' \]
  3. Action Plan Selection (\( \epsilon \)-Greedy Strategy):
    1. Implement an \( \epsilon \)-Greedy strategy to balance exploration and exploitation: \[ p_{\text{selected}} = \begin{cases} \text{Randomly select } k \text{ instances from } D_r', & \text{with probability } \epsilon \\ \text{Select top-}k \text{ instances from } D_r' \text{ based on } f_i, & \text{with probability } 1 - \epsilon \end{cases} \] where \( \epsilon \) is the exploration rate.
  4. Priority and Gradient Update:
    1. For each selected instance \( p_i \), update the gradient and priority based on the utility of the action plan:
      • If \( p_i \) was useful (effective): \[ \Delta f_i = \Delta f_{\text{positive}}^0 \cdot \exp(-\alpha \cdot n_i) \]
      • If \( p_i \) was not useful (ineffective): \[ \Delta f_i = -\Delta f_{\text{negative}}^0 \cdot \exp(-\beta \cdot n_i) \]
      where \( n_i \) is the usage count of \( p_i \), \( \alpha \) and \( \beta \) are decay parameters, and \( \Delta f_{\text{positive}}^0 \) and \( \Delta f_{\text{negative}}^0 \) are initial adjustment values.

      Update the priority score: \[ f_i = \min(\max(f_i + \Delta f_i, f_{\text{min}}), f_{\text{max}}) \] where \( f_{\text{min}} \) and \( f_{\text{max}} \) are the minimum and maximum allowable priority scores.

  5. Iterative Improvement: Repeat the above steps for subsequent tasks, continuously refining the priority scores and improving the system's planning capabilities over time.

B.2.4. Parameter Settings

The key parameters used in the algorithm are set as follows:

These parameters balance the exploration and exploitation trade-off, ensuring adaptability without over-reliance on a narrow set of strategies. The decay parameters \( \alpha \) and \( \beta \) control how quickly the impact of feedback diminishes with repeated use, preventing unbounded growth or decay of priority scores over time.

B.2.5. Continuous Improvement through Iterative Feedback

By iteratively applying this algorithm across multiple tasks, the system refines its action planning capabilities. Initially, when all action plans have identical priority scores, the selection is more random. As tasks are executed and feedback is incorporated, the priority scores are updated to reflect the effectiveness of each action plan. Over time, this leads to a hierarchy among the action plans, allowing the system to prioritize the most effective strategies.

This approach not only enhances the system's planning capabilities but also adapts to user preferences and the specific context of each task. By continuously updating the priority scores and incorporating a balance of exploration and exploitation, the AdaBot model achieves a higher degree of flexibility and precision in its operations.

B.3. Deployment Details for XArm and ACT

XArm

In all our real-world manipulation experiments, we utilize a UFactory XArm 6 robot, a flexible manipulator with six degrees of freedom. The arm has a reach of 700 mm, a payload capacity of 5 kg, a repeatability of ±0.1 mm, and a maximum speed of 1000 mm/s. It supports various motion control modes, including joint position and velocity control, which are easily integrated into our system. The XArm gripper is equipped with modified UMI compliant fingers. Although the arm is mounted on a mobile base for scene adaptability, we consider it a fixed-base manipulator rather than a mobile robot. The robot is controlled using the xArm-Python-SDK.

The system is outfitted with two Intel RealSense D435i cameras: one positioned on the wrist and the other on the shoulder of the robot. We use the pyrealsense2 library for communication with the cameras. The cameras capture two RGB images at a resolution of 640×480 pixels with a frame rate of 60 Hz.

To manage the integration of all components, we use Python's threading module. Neural networks are implemented and trained using PyTorch.

ACT

The Action Chunking with Transformers (ACT) algorithm is based on a transformer network architecture that integrates a conditional variational autoencoder (CVAE) with a transformer sequence model. This structure is specifically designed to tackle the complexities associated with fine-grained manipulation tasks.

The encoder compresses the action sequences and joint observations into a latent space, represented by a style variable that captures the variability of human demonstrations. This style variable is essential for modeling the stochastic nature of human actions, allowing the system to generate diverse but accurate action sequences during inference. The decoder, which acts as the policy, uses current observations—such as RGB images from both the shoulder and hand perspectives, joint positions, and the style variable—to predict the sequence of actions.

The transformer architecture is central to ACT’s functionality, with a transformer encoder integrating information across the observation sequence, while a transformer decoder produces the action sequence. The ability of transformers to model implicit representations within sequences makes them ideal for the action chunking strategy, where actions are predicted in segments rather than on a step-by-step basis.

B.4. Prompt

GPT-4o Prompt:

            You are controlling a one-armed robot.

            **Task Objective:**
            Your task is to complete the following goal:
            {goal}

            **Current Environment:**
            Please analyze the image carefully, ensuring that all objects, particularly those relevant to the goal, are identified and accounted for.

            **Past Success Example:**
            Here are some past success examples for the current goal:
            {Successes}

            **Reminders:**
            You need to pay careful attention to the following reminders from user {username}:
            {Reminders}

            **Available Actions:**
            Here are all the actions you can use to achieve the goal:
            - **turnright**: Rotate the robotic arm to the right.
            - **turnleft**: Rotate the robotic arm to the left.
            - **walkforward**: Advance the robotic arm forward.
            - **walktowards <obj>**: Move the robotic arm towards the specified object.
            - **grab <obj>**: Grasp the specified object. Release any held item before grabbing a new one.
            - **switchon <obj>**: Activate the specified object.
            - **switchoff <obj>**: Deactivate the specified object.
            - **open <obj>**: Open the specified object.
            - **close <obj>**: Close the specified object.
            - **lookat <obj>**: Focus on the specified object.
            - **find <obj>**: Locate the specified object.
            - **turnto <obj>**: Orient towards the specified object.
            - **pointat <obj>**: Point to the specified object.
            - **putin <obj1> <obj2>**: Place the first object into the second.
            - **puton <obj1> <obj2>**: Place the first object on top of the second.
            - **putback <obj1> <obj2>**: Return the first object into the second.
            - **push <obj>**: Push the specified object.
            - **pull <obj>**: Pull the specified object.
            - **rotate <obj>**: Rotate the specified object.
            - **tilt <obj>**: Tilt the specified object.
            - **cut <obj>**: Cut the specified object.
            - **flip <obj>**: Flip the specified object.
            - **wait <n>**: Pause for n minutes.
            - **done**: Signal task completion.

            **Output Structure:**
            - Formulate a step-by-step action plan using only the actions listed above.
            - Each step should adhere to the following syntax:
                '''
                [action] ("object1", (optional: "object2"))
                '''
            - Conclude the plan with the **done()** action.
            - In the action plan, briefly describe the color or shape of the object.

            **Example:**
            For the goal "Put the fork into the cup":
                '''
                def Put_the_fork_into_the_cup():
                    find ("grey fork")
                    walktowards ("grey fork")
                    grab ("grey fork")
                    find ("red cup")
                    walktowards ("red cup")
                    putin ("grey fork", "red cup")
                    done()
                '''
        

LLaVA-7B Prompt:

            {username} wants robot to {goal}. If you were {username}, what reminders would you give to the robot?
        

If want to try a larger LLaVA model, please try prompt:

            {username} wants robot to {goal}. If you were {username}, what reminders would you give to the robot?
                        
            Follow these guidelines:

            1. **Personalized Reminders:**
                - Generate reminders based on user
                - Include user preferences and specific rules for item placement.
                - Mention any habitual actions the user typically follows.

            2. **Scene Reminders:**
                - Generate reminders based on picture and goal
                - Describe all items visible in the image, including their status, position, and any notable details.
                - Reference any past successful scenarios or considerations similar to the current scene.

            3. **Step-by-Step Action Reminders:**
                - Generate reminders based on picture and goal
                - Provide a detailed, step-by-step reminder for each action, ensuring clarity and precision.
                - Consider the robot’s single-arm constraint: remind that only one action can be performed at a time.
                - Include any specific instructions related to the current scene and the user's preferences.

            **Output Structure:**
            1. **User Preferences:**
                - [Include relevant personalized information based on the user’s habits or rules.]

            2. **Scene Description:**
                - [Provide a detailed description of the items and their status in the image.]

            3. **Step-by-Step Action Reminders:**
                - Reminder 1: [Detailed reminder for the first action, considering robot’s single-arm constraint.]
                - Reminder 2: [Next detailed reminder.]
                - ...
                - Final Reminder: [Final detailed reminder, ensuring all tasks are completed according to the user’s preferences and scene specifics.]

            **Few-Shot Examples:**

            Example 1:
            - Carefully identify the objects on the table in the picture and put them into the drawers. 
            - Put drinks in drawer 1. 
            - Put kitchen utensils in drawer 3. The kitchen utensils in the picture are a white tray and a kitchen knife. Put them in drawer 3.
            - The yellow item is actually a tea beverage. You should open drawer 1 before grabbing things.

            Example 2:
            - Put snacks in drawer 2 and kitchen utensils in drawer 3.
            - The kitchen utensils in the picture are a cutting board and a fruit knife.
            - The pink fruit knife is on the cutting board. Put them into drawer 3. You should open drawer 3 before grabbing things.

            Example 3:
            - Put drinks in drawer 1.
            - Drawer 1 is already open. You do not need to open it again. Close drawer 1 at the end.

            Ensure each reminder is clear, follows the constraints, and aligns with the user's preferences and the scene description.