AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots

ICRA 2025

Zhaxizhuoma Zhaxizhuoma^1,†, Pengan Chen^1,2,†, Ziniu Wu^1,3,†, Jiawei Sun¹, Dong Wang¹, Peng Zhou², Nieqing Cao⁴, Yan Ding^1,*, Bin Zhao^1,5, Xuelong Li^1,6

¹Shanghai Artificial Intelligence Laboratory, ²The University of Hong Kong, ³University of Bristol, ⁴Xi’an Jiaotong-Liverpool University, ⁵Northwestern Polytechnical University, ⁶Institute of Artificial Intelligence, China Telecom Corp Ltd, ^†Equal contribution, ^*Corresponding author: Yan Ding [yding25 (at) binghamton.edu]

ArXiv Code Dataset for LLaVA Appendix A Appendix B Appendix C

Abstract

This paper presents AlignBot, a novel framework designed to optimize VLM-powered customized task planning for household robots by effectively aligning with user reminders. In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminder itself. To address these challenges, AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for GPT-4o. This adapter model internalizes diverse forms of user reminders—such as personalized preferences, corrective guidance, and contextual assistance—into structured that prompt GPT-4o in generating customized task plans. Additionally, AlignBot integrates a dynamic retrieval mechanism that selects relevant historical interactions as prompts for GPT-4o, further enhancing task planning accuracy. To validate the effectiveness of AlignBot, experiments are conducted in a real-world household environment. A multimodal dataset with 1,500 entries derived from volunteer reminder was used for training and evaluation. The results demonstrate that AlignBot significantly improves customized task planning, outperforming existing LLM- and VLM-powered planners by interpreting and aligning with user reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline at 21.6%, reflecting 65% improvement and over four times greater effectiveness.

Framework of AlignBot

The fine-tuned LLaVA model serves as an adapter for GPT-4o during inference, processing user id, task descriptions, and observations to produce cues that guide GPT-4o's task planning. These cues, combined with a dynamically retrieved task-relevant cases of past successes, are incorporated into the prompt, optimizing GPT-4o's generation of action plans. If the initial output does not meet user expectations, the system enables iterative dialogue, supporting multiple rounds of feedback and refinement until a satisfactory result is achieved.

For more details, please find more information in Section A. Fine-Tuning LLaVA with User Reminders and Section B. Case-Based Learning for Enhanced GPT Prompting of our paper.

Experiments

We assess the performance of AlignBot from two key perspectives: the quality of task plans generated by GPT-4o and the quality of cues produced by the fine-tuned LLaVA. Baseline includes Vanilla GPT-4o, GPT-4o + Raw Reminder, GPT-4o + LLaMA2-7b.

Your image description

We implement AlignBot on a real robotic system, comprising an AgileX-based mobile platform and a UFactory XArm robotic arm. The system employs the ACT algorithm alongside the AnyGrasp method for manipulation. In this setup, the robot is tasked with placing items from a countertop into a drawer, ensuring that the task plan generated is consistently aligned with user reminders.

Results and Discussion

TABLE 1 shows the comparison result between AlignBot and the three baselines. AlignBot outperforms all baselines, demonstrating superior performance in aligning task planning with customized user reminders. Compared to Vanilla GPT-4o, AlignBot addresses significant limitations in semantic grounding, such as recognizing the object states (e.g., whether a drawer is open or closed). Vanilla GPT-4o tends to make random decisions in tasks due to missing user-specific information, while AlignBot, through its database of user reminders and fine-tuned cues, delivers much more accurate and effective task plans. When compared to GPT-4o + Raw Reminder, AlignBot again proves more effective, as GPT-4o + Raw Reminder often struggles to filter and apply the correct historical data for the current scenario, resulting in planning failures. The context-specific cues generated by AlignBot's adapter are essential for accurate task planning, particularly in complex environments. Finally, while GPT-4o + Fine-tuned LLaMA can remember user preferences and correct some common planning errors, its lack of multimodal capabilities limits its understanding of the scene and the arrangement of objects, leading to less reliable cues. AlignBot's ability to integrate multimodal inputs ensures better scene awareness and task alignment, demonstrating the necessity of multimodal adapters in improving task planning.

Your image description

Our AlignBot demonstrates superior performance in cue generation compared to baselines. It receives high ratings from volunteers for its ability to combine task descriptions with image context, which allows it to deliver highly effective, scene-based cues. It also excels at remembering personalized user preferences, making its responses more accurate and tailored to individual needs, which significantly enhances its utility in real-world applications. When compared to Fine-tuned LLaMA, AlignBot's multimodal capabilities are a clear advantage. While Fine-tuned LLaMA can remember user preferences, its single-modal nature limits its effectiveness, especially in tasks requiring visual input. Without image guidance, Fine-tuned LLaMA randomly selects from past reminders and fails to provide accurate or context-specific cues, underscoring the importance of multimodal inputs in personalized tasks. In contrast, the LLaVA without Fine-tuning further highlights AlignBot's strengths, as it struggles with image-based reasoning often providing vague or generic descriptions of task-relevant items. LLaVA without Fine-tuning lacks the ability to interpret detailed visual information or infer operational errors, making it far less effective in guiding robots through real-world tasks. Additionally, without fine-tuning to incorporate user-specific data, its responses remain generic and less relevant, reinforcing the importance of AlignBot's tailored and fine-tuned approach to cue generation.

Your image description

Conclusion

To summarize, AlignBot demonstrates significant improvements in aligning robotic task planning with diverse, multimodal user reminders. By fine-tuning LLaVA-7B as an adapter for GPT-4o, the framework effectively handles the challenges posed by the nature of user reminders in household environments. The combination of instruction-formatted cues and case-based learning enables AlignBot to generate more accurate task plans. Empirical results show that AlignBot outperforms baselines.

BibTeX


@misc{zhaxizhuoma2025alignbotaligningvlmpoweredcustomized,
      title={AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots}, 
      author={Zhaxizhuoma Zhaxizhuoma and Pengan Chen and Ziniu Wu and Jiawei Sun and Dong Wang and Peng Zhou and Nieqing Cao and Yan Ding and Bin Zhao and Xuelong Li},
      year={2025},
      eprint={2409.11905},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2409.11905}, 
}

Prompt Design for Plan Monitor and Knowledge Acquirer

The realization of our plan monitor relies on repeatedly querying GPT-3 for each action using the following prompt.
Prompt 1: Is it suitable for a robot to [Perform-Action], if [Situation]?

The following template is for querying an LLM for acquiring common sense about action effects.
Prompt 2: Is it suitable for a robot to [Perform-Action-with-Object]?
Prompt 3: There are some objects, such as [Object-1], [Object-2], ..., and [Object-N]. Which is the most suitable for [Current-Task], if [Situation]?
^⁎ Please note that these prompts are zero-shot.

Situation Dataset

Users can download both the MTurk questionnaire and the situation dataset at the top of the website. The dataset is provided as a CSV file comprising 12 separate sheets, each representing situations for a distinct task. The name of the task corresponds to the sheet name. Each sheet consists of five columns.

The situations’ descriptions provided by the MTurkers are in Column A. Column B details the corresponding steps where the described situation occurs. Column C is the index of distinguishable situations, while Column D provides descriptions of these situations. Finally, Column E indicates the number of distinguishable situations.

Experiments

The task completion percentage of COWP (ours) and five baseline methods under 12 different tasks. The x-axis represents the task name, and the y-axis represents the task completion percentage. The task completion percentage for each value is an average of 150 trials. The tasks are sorted based on the performance of COWP, where the very left corresponds to its best performance.

Below shows the prompts and hyperparameters of baselines in the evaluation, where [] is a placeholder.

Prompt for ProgPrompt
Prompt for Inner Monologue
Hyperparameters used in LMZSP


from actions import walk <obj>, run <obj>, grab <obj>, switchon <obj>, switchoff <obj>, open <obj>, close <obj>, find <obj>, putin <obj> <obj>, fill <obj> <obj>, clean <obj>, wash <obj>
objects = ['wine', 'bucket', 'dish bowl', 'chips', 'sponge', 'snack', 'kitchencabinet', 'wastecontainer', 'cleaning bottle', 'drinking glass', 'kitchen cabinet', 'dish', 'coffee table', 'blender', 'dining table', 'mug', 'coffee maker', 'dehumidifier', 'air fryer', 'water filter', 'tea', 'dining', 'coffee filter', 'colander', 'orange juice', 'condiment bottle', 'watermelon juice', 'mat', 'closet', 'beer', garbagecan', 'cutlery knife', 'ice cream', 'sauce', 'table_1', 'oven tray', 'refrigerator', 'table cloth', 'steak', 'cupboard', 'wineglass', 'kitchen', 'cutting board', 'noodles', 'kitchen table', 'wooden chopstick', 'frying pan', 'cloth napkin', 'piano bench', 'toaster']
def put_the_wine_glass_in_the_kitchen_cabinet():
      # 0: walk to kitchen
      walk('kitchen')
      # 1: find wine glass
      find('wineglass')
      # 2: grab wine glass
      assert('close' to 'wineglass')
      else: find('wineglass')
      grab('wineglass')
      # 3: find kitchen cabinet
      find('kitchencabinet')
      # 4: open kitchen cabinet
      assert('close' to 'kitchencabinet' )
      else: find('kitchencabinet')
      assert('kitchencabinet' is 'closed' )
      else: close('kitchencabinet')
      open('kitchencabinet')
      # 5: put wine glass in kitchen cabinet
      assert('wineglass' in 'hands' )
      else: find('wineglass')
      else: grab('wineglass')
      assert('close' to 'kitchencabinet' )
      else: find('kitchencabinet')
      assert('kitchencabinet' is 'opened' )
      else: open('kitchencabinet')
      putin('wineglass', 'kitchencabinet')
      # 6: close kitchen cabinet
      assert('close' to 'kitchencabinet' )
      else: find('kitchencabinet')
      assert('kitchencabinet' is 'opened' )
      else: open('kitchencabinet')
      close('kitchencabinet')
      # 7: Done

from actions import walk <obj>, run <obj>, grab <obj>, switchon <obj>, switchoff <obj>, open <obj>, close <obj>, find <obj>, putin <obj> <obj>, fill <obj> <obj>, clean <obj>, wash <obj>
objects = ['wine', 'bucket', 'dish bowl', 'chips', 'sponge', 'snack', 'kitchencabinet', 'cleaning bottle', 'drinking glass', 'kitchen cabinet', 'dish', 'coffee table', 'blender', 'dining table', 'mug', 'coffee maker', 'dehumidifier', 'air fryer', 'water filter', 'tea', 'dining', 'coffee filter', 'colander', 'orange juice', 'condiment bottle', 'watermelon juice', 'mat', 'closet', 'beer', 'garbagecan_1', 'cutlery knife', 'ice cream', 'sauce', 'table_1', 'oven tray', 'refrigerator', 'table cloth', 'steak', 'cupboard', 'wineglass', 'kitchen', 'cutting board', 'noodles', 'kitchen table', 'wooden chopstick', 'frying pan', 'cloth napkin', 'garbagecan_2', 'piano bench', 'toaster']
def throw_away_the_lime, where garbagecan_1 is broken():
      # 0: find lime
      find('lime')
      # 1: grab lime
      assert('close' to 'lime')
      else: find('lime')
      grab('lime')
      # 2: find garbage can
      find('garbagecan_1')
      assert('broken' to 'garbagecan_1')
      else: find('garbagecan_2')
      # 3: open garbage can
      assert('close' to 'garbagecan_2' )
      else: find('garbagecan_2')
      assert('garbagecan_2' is 'closed' )
      else: close('garbagecan_2')
      open('garbagecan_2')
      # 4: put lime in garbage can
      assert('lime' in 'hands' )
      else: find('lime')
      else: grab('lime')
      assert('close' to 'garbagecan_2' )
      else: find('garbagecan_2')
      assert('garbagecan_2' is 'opened' )
      else: open('garbagecan_2')
      putin('lime', 'garbagecan_2')
      # 5: close garbage can
      assert('close' to 'garbagecan_2' )
      else: find('garbagecan_2')
      assert('garbagecan_2' is 'opened' )
      else: open('garbagecan_2')
      close('garbagecan_2)
      # 6: Done


from actions import walk <obj>, run <obj>, grab <obj>, switchon <obj>, switchoff <obj>, open <obj>, close <obj>, find <obj>, putin <obj> <obj>, fill <obj> <obj>, clean <obj>, wash <obj>
objects = ['wine', 'bucket', 'dish bowl', 'chips', 'sponge', 'snack', 'kitchencabinet', 'wastecontainer', 'cleaning bottle', 'drinking glass', 'kitchen cabinet', 'dish', 'coffee table', 'blender', 'dining table', 'mug', 'coffee maker', 'dehumidifier', 'air fryer', 'water filter', 'tea', 'dining', 'coffee filter', 'colander', 'orange juice', 'condiment bottle', 'watermelon juice', 'mat', 'closet', 'beer', garbagecan', 'cutlery knife', 'ice cream', 'sauce', 'table_1', 'oven tray', 'refrigerator', 'washingsponge', 'table cloth', 'steak', 'cupboard', 'wineglass', 'kitchen', 'cutting board', 'noodles', 'kitchen table', 'wooden chopstick', 'frying pan', 'cloth napkin', 'piano bench', 'toaster', 'dishwashingliquid', 'washingcloth']
def wash_mug, where washingsponge is missing():
      # 0: walk to kitchen
      walk('kitchen')
      # 1: find sink
      find('sink')
      # 2: turn on faucet
      find('faucet')
      assert('close' to 'faucet' )
      else: find('faucet')
      assert('faucet' is 'switchoff' )
      else: switchoff('faucet')
      switchon('faucet')
      # 3: put mug under water
      find('mug')
      assert('close' to 'mug')
      else: find('mug')
      grab('mug')
      find('sink')
      assert('mug' in 'hands' )
      else: find('mug')
      else: grab('mug')
      assert('close' to 'sink' )
      else: find('sink')
      putin('mug', 'sink')
      # 4: grab dishwashing liquid
      find('dishwashingliquid')
      assert('close' to 'dishwashingliquid')
      else: find('dishwashingliquid')
      grab('dishwashingliquid')
      # 5: put dishwashing liquid on mug
      find('sink')
      assert('dishwashingliquid' in 'hands' )
      else: find('dishwashingliquid')
      else: grab('dishwashingliquid')
      assert('close' to 'sink' )
      else: find('sink')
      putin('dishwashingliquid', 'sink')
      # 6: grab washingsponge
      find('washingsponge')
      assert('missing' to 'washingsponge')
      else: find('washingcloth')
      grab('washingcloth')
      # 7: start scrubbing mug
      find('sink')
      assert('washingcloth' in 'hands' )
      else: find('washingcloth')
      else: grab('washingcloth')
      assert('close' to 'sink' )
      else: find('sink')
      putin('washingcloth', 'sink')
      # 8: rinse mug off with water
      # 9: dry mug with towel
      # 10: Done

from actions import walk <obj>, run <obj>, grab <obj>, switchon <obj>, switchoff <obj>, open <obj>, close <obj>, find <obj>, putin<obj> <obj>, fill <obj> <obj>, clean <obj>, wash <obj>
objects = [objects in the environment]
def [task description], where [situation] ():


Here are actions that can be executed by the robot: walk, run, grab, switch on, switch off, open, close, find, put, fill, clean, wash

Human: put the wine glass in the kitchen cabinet
Scene: ['orchid', 'sink', 'peach', 'mouse', 'oven tray', 'hanger', 'clothes pants', 'cupcake', 'power socket', 'bell pepper', 'slippers', 'toaster', 'closet', 'floor', 'pillow', 'door jamb', 'light switch', 'faucet', 'pie', 'bookshelf', 'cutlery fork', 'condiment shaker', 'bathroom counter', 'keyboard', 'cutlery knife', 'bananas', 'washing machine', 'box', 'ceiling', 'creamy buns', 'bed', 'crackers', 'bathroom', 'stove', 'paper', 'condiment bottle', 'lime', 'stove fan', 'washing sponge', 'deodorant', 'radio', 'kitchen', 'toilet', 'fridge', 'bedroom', 'dishwashing liquid', 'kitchen cabinet', 'remote control', 'folder', 'bar soap', 'bench', 'coffee pot', 'frying pan', 'curtains', 'desk', 'door', 'toothpaste', 'computer', 'painkillers', 'towel rack', 'cereal', 'wall', 'wall picture frame', 'bathtub', 'dish bowl', 'living room', 'cabinet', 'ceiling lamp', 'clothes pile', 'cpu screen', 'plum', 'photo frame', 'stall', 'table lamp', 'rug', 'toothbrush', 'coffee table', 'plate', 'water glass', 'chocolate syrup', 'window', 'bathroom cabinet', 'face cream', 'whipped cream', 'closet drawer', 'kitchen counter', 'tv', 'microwave', 'mug', 'perfume', 'salmon', 'candy bar', 'kitchen table', 'coffee maker', 'wall lamp', 'bread slice', 'towel', 'mouse mat', 'apple', 'cellphone', 'wall shelf', 'book', 'sofa', 'chips', 'wall phone', 'kitchen counter drawer', 'clothes shirt', 'candle', 'hair product', 'wine glass', 'garbage can', 'nightstand', 'clock', 'tv stand', 'chair']
Robot:
  0: walk to kitchen
  1: find wine glass
  2: grab wine glass
  3: find kitchen cabinet
  4: open kitchen cabinet
  5: put wine glass in kitchen cabinet
  6: close kitchen cabinet
  7: Done

Human: throw away the lime
Scene: ['garbage can_1 is broken', 'orchid', 'sink', 'peach', 'mouse', 'garbage can_1', 'oven tray', 'hanger', 'clothes pants', 'cupcake', 'power socket', 'bell pepper', 'slippers', 'toaster', 'closet', 'floor', 'pillow', 'door jamb', 'light switch', 'faucet', 'pie', 'bookshelf', 'cutlery fork', 'condiment shaker', 'bathroom counter', 'keyboard', 'cutlery knife', 'bananas', 'washing machine', 'box', 'ceiling', 'creamy buns', 'bed', 'crackers', 'bathroom', 'stove', 'paper', 'condiment bottle', 'lime', 'stove fan', 'washing sponge', 'deodorant', 'radio', 'kitchen', 'toilet', 'fridge', 'bedroom', 'dishwashing liquid', 'kitchen cabinet', 'remote control', 'folder', 'bar soap', 'bench', 'coffee pot', 'frying pan', 'curtains', 'desk', 'door', 'toothpaste', 'computer', 'painkillers', 'towel rack', 'cereal', 'wall', 'wall picture frame', 'bathtub', 'dish bowl', 'living room', 'cabinet', 'ceiling lamp', 'clothes pile', 'cpu screen', 'plum', 'photo frame', 'stall', 'table lamp', 'rug', 'toothbrush', 'coffee table', 'plate', 'water glass', 'chocolate syrup', 'window', 'bathroom cabinet', 'face cream', 'whipped cream', 'closet drawer', 'kitchen counter', 'tv', 'microwave', 'mug', 'perfume', 'salmon', 'candy bar', 'kitchen table', 'coffee maker', 'wall lamp', 'bread slice', 'towel', 'mouse mat', 'apple', 'cellphone', 'wall shelf', 'book', 'sofa', 'chips', 'wall phone', 'kitchen counter drawer', 'clothes shirt', 'candle', 'hair product', 'wine glass', 'garbage can_2', 'nightstand', 'clock', 'tv stand', 'chair']
Robot:
  0: find lime
  1: grab lime
  2: find garbage can_2
  3: open garbage can_2
  4: put lime in garbage can_2
  5: close garbage can_2
  6: Done

Human: wash mug
Scene: ['washing sponge is missing', 'orchid', 'sink', 'peach', 'mouse', 'oven tray', 'hanger', 'clothes pants', 'cupcake', 'power socket', 'bell pepper', 'slippers', 'toaster', 'closet', 'floor', 'pillow', 'door jamb', 'light switch', 'faucet', 'pie', 'bookshelf', 'cutlery fork', 'condiment shaker', 'bathroom counter', 'keyboard', 'cutlery knife', 'bananas', 'washing machine', 'box', 'ceiling', 'creamy buns', 'bed', 'crackers', 'bathroom', 'stove', 'paper', 'condiment bottle', 'lime', 'stove fan', 'washing sponge', 'deodorant', 'radio', 'kitchen', 'toilet', 'fridge', 'bedroom', 'dishwashing liquid', 'kitchen cabinet', 'remote control', 'folder', 'bar soap', 'bench', 'coffee pot', 'frying pan', 'curtains', 'desk', 'door', 'toothpaste', 'computer', 'painkillers', 'towel rack', 'cereal', 'wall', 'wall picture frame', 'bathtub', 'dish bowl', 'living room', 'cabinet', 'ceiling lamp', 'clothes pile', 'cpu screen', 'plum', 'photo frame', 'stall', 'table lamp', 'rug', 'toothbrush', 'coffee table', 'plate', 'water glass', 'chocolate syrup', 'window', 'bathroom cabinet', 'face cream', 'whipped cream', 'closet drawer', 'kitchen counter', 'tv', 'microwave', 'mug', 'perfume', 'salmon', 'candy bar', 'kitchen table', 'coffee maker', 'wall lamp', 'bread slice', 'towel', 'mouse mat', 'apple', 'cellphone', 'wall shelf', 'book', 'sofa', 'chips', 'wall phone', 'kitchen counter drawer', 'clothes shirt', 'candle', 'hair product', 'wine glass', 'garbage can', 'nightstand', 'clock', 'tv stand', 'chair']
Robot:
  0: walk to kitchen
  1: find sink
  2: switch on faucet
  3: put mug in sink
  4: grab dishwashing liquid
  5: put dishwashing liquid in sink
  6: grab washing cloth
  7: put washing cloth in sink
  8: wash mug
  8: Done


Scene: [situation, ojects in the environment]
Human: [task description]
Robot:


planning_lm_id = 'text-davinci-003'
translation_lm_id = 'stsb-roberta-large'
MAX_STEPS = 12  # maximum number of steps to be generated
CUTOFF_THRESHOLD = 0.5  # early stopping threshold based on matching score and likelihood score
P = 0.5  # hyperparameter for early stopping heuristic to detect whether Planning LM believes the plan is finished
BETA = 0.3  # weighting coefficient used to rank generated samples
sampling_params =
              {   "max_tokens": 256,
                  "temperature": 0.9,
                  "top_p": 0.9,
                  "n": 10,
                  "logprobs": 1,
                  "presence_penalty": 0.5,
                  "frequency_penalty": 0.3,
                  "stop": '\n'
              }

Conclusion

In this paper, we develop a Large Language Model-based open-world task planning system for robots, called COWP, towards robust task planning and situation handling in open worlds. The novelty of COWP points to the integration of a classical, knowledge-based task planning system, and a pretrained language model for commonsense knowledge acquisition. The marriage of the two enables COWP to ground domain-independent commonsense knowledge to specific task planning problems. To evaluate COWP systematically, we collected a situation dataset that includes 1085 situations in a dining domain. Experimental results suggest that COWP performed better than existing task planners developed for closed-world and open-world scenarios. We also provided a demonstration of COWP using a mobile manipulator working on delivery tasks, which provides a reference to COWP practitioners for real-world applications.

BibTeX


    @article{ding2023integrating,
      title={Integrating Action Knowledge and LLMs for Task Planning and Situation Handling in Open Worlds},
      author={Ding, Yan and Zhang, Xiaohan and Amiri, Saeid and Cao, Nieqing and Yang, Hao and Kaminski, Andy and Esselink, Chad and Zhang, Shiqi},
      journal={Autonomous Robots},
      year={2023}
    }