MobileAgent系列学习 — Mobile Agent v2

type

status

date

slug

summary

实验设置

实验设置：文本识别用 ConvNextViT-document，图标识别用 GroundingDINO，图标描述用 Qwen-VL-Int4。规划智能体用 GPT-4，决策和反思智能体用 GPT-4V。采用动态评估法，在 Harmony OS 和 Android OS 系统，针对系统应用、外部应用和多应用场景设计多种指令进行评估，指标包括成功率、完成率、决策准确率和反思准确率 。

实验结果：相比 Mobile-Agent，Mobile-Agent-v2 在任务完成率上显著提升，尤其在高级指令场景。知识注入可进一步提升其性能。消融实验表明规划智能体对整体框架影响最大，反思智能体和记忆单元也不可或缺。多智能体架构使 Mobile-Agent-v2 在处理长序列操作时更具优势。

摘要

移动设备操作任务正逐渐成为热门的多模态人工智能应用场景。当前多模态大语言模型（MLLMs）受训练数据所限，难以有效充当操作助手。基于MLLM的智能体通过调用工具提升能力，正逐步应用于该场景。然而，移动设备操作任务存在两大导航难题：任务进度导航（task progress navigation）与聚焦内容导航（focus content navigation）。现有单智能体架构因标记序列过长、文本-图像数据格式交织，难以有效解决这些问题。为此，我们提出Mobile-Agent-v2，这是一种通过多智能体协作实现高效导航的移动设备操作辅助架构。该架构包含规划（planning agent）、决策（decision agent）和反思（Reflection agent）三个智能体。规划智能体将冗长且交织的图文历史操作及屏幕摘要，提炼成纯文本任务进度，供决策智能体使用，以降低上下文长度，方便其导航任务进度。我们还设计了记忆单元，由决策智能体根据任务进展更新，用于留存聚焦内容。此外，反思智能体负责观察操作结果，纠正错误操作。

实验显示，相较于Mobile - Agent单智能体架构，Mobile-Agent-v2的任务完成率提升超30%。代码已在https://github.com/X-PLUG/MobileAgent开源。

规划智能体

提炼历史操作和屏幕摘要，形成简短的纯文本进度

决策智能体

负责记忆单元更新，用于留存聚焦的nei

反思智能体

负责观察操作结果，纠正错误操作

Figure 1: Mobile device operation tasks require navigating focus content and task progress from history operation sequences, where the focus content comes from previous screens. As the number of operations increases, the length of the input sequences grows, making it extremely challenging for a single-agent architecture to manage these two types of navigation effectively.

Introduction

以GPT-4v为代表的多模态大语言模型（MLLMs）在各领域展现出卓越能力。随着基于大语言模型（LLMs）的智能体快速发展，基于MLLM的智能体通过各类视觉感知工具，克服MLLMs在特定应用场景的局限，已成为研究热点。

移动设备的自动化操作，作为实用的多模态应用场景，正成为人工智能智能手机发展的重大技术变革。然而，现有的MLLMs因屏幕识别、操作和定位能力有限，在该场景中面临挑战。为解决此问题，现有工作利用基于MLLM的智能体架构，赋予MLLMs感知和操作移动设备用户界面（UI）的能力。比如，AppAgent通过提取设备XML文件中的可点击位置，解决MLLMs定位局限，但对UI文件的依赖限制了其在其他平台和设备的适用性。为摆脱对底层UI文件的依赖，Mobile - Agent提出通过视觉感知工具定位的方案，利用MLLM感知屏幕、生成操作，并借助视觉感知工具确定操作位置。

移动设备操作任务涉及多步骤顺序处理。操作者需从初始屏幕开始，执行一系列连续操作直至指令完成。此过程存在两大挑战：一是为规划操作意图，需从历史操作中明确当前任务进度；二是一些操作可能需历史屏幕中与任务相关的信息，例如撰写体育新闻可能要用到先前查询的比赛结果，我们将这类重要信息称为聚焦内容，同样需从历史屏幕中获取。然而，随着任务推进，作为输入的冗长的图文交织的历史操作及屏幕信息，会显著降低单智能体架构中导航的有效性。

本文提出Mobile-Agent-v2，一种通过多智能体协作实现有效导航的移动设备操作助手。它包含三个智能体：规划智能体、决策智能体和反思智能体。规划智能体依据历史操作生成任务进度。为留存历史屏幕中的聚焦内容，设计记忆单元记录相关信息。决策智能体生成操作时参考该单元，检查屏幕上的聚焦内容并更新到记忆中。由于决策智能体无法基于上一屏幕反思，设计反思智能体观察决策智能体操作前后屏幕变化，判断操作是否符合预期，若不符合则采取措施重新执行操作。

本文主要贡献如下：

提出多智能体架构Mobile-Agent-v2，缓解移动设备操作任务中，单智能体框架固有的各类导航困难。

设计规划智能体，基于历史操作生成任务进度，辅助决策智能体有效生成操作；同时设计记忆单元与反思智能体，解决聚焦内容导航及反思能力缺失问题。记忆单元由决策智能体更新聚焦内容，反思智能体评估决策智能体操作，不符合预期时生成补救措施。

在多种操作系统、语言环境和应用程序中，对Mobile-Agent-v2进行动态评估。实验表明，该模型性能显著提升，手动操作知识注入可进一步增强其性能。

Related Work

类别	相关工作	具体内容	局限性
多智能体应用	AutoGPT、BabyAGI	将任务分解为多个子任务，使用多个智能体顺序执行这些子任务	-
多智能体应用	HiveMind	引入多智能体框架，每个智能体专注于特定类型推理（如演绎、归纳和类比推理），用于复杂推理任务	-
多智能体应用	Luo等人的工作	提出基于多智能体的方法解决多模态知识图谱构建中的实体对齐问题	-
多智能体应用	多智能体框架在机器人控制、自动驾驶领域应用	展现不同场景下强大的任务执行和协作能力	-
基于大语言模型的用户界面操作智能体	WebGPT	智能体与网页交互回答复杂问题，利用搜索引擎和网页内容提取工具获取信息，用强化学习优化行为	-
基于大语言模型的用户界面操作智能体	Toolformer	引入外部工具调用（如计算器和搜索引擎），增强大语言模型在网页操作任务中的能力	-
基于大语言模型的用户界面操作智能体	AppAgent	最早使用大语言模型实现移动应用自动化操作，通过解析应用的XML布局文件识别可点击UI元素并生成操作序列	严重依赖XML文件可用性，限制在不同平台和设备上的通用性
基于大语言模型的用户界面操作智能体	Mobile - Agent	提出基于视觉感知的方法，无需依赖XML文件，使用对象检测和文本识别技术定位UI元素，利用大语言模型生成操作指令	尚未采用多智能体架构处理移动设备操作任务
ㅤ	ㅤ	ㅤ	ㅤ

Mobile-Agent-v2

Firstly, the planning agent updates the task progress, allowing the decision agent to navigate the progress of the current task. The decision agent then operates based on the current task progress, current screen state, and the reflection (if the last operation is erroneous). Subsequently, the reflection agent observes the screens before and after the operation to determine if the operation meets expectations.

首先，规划智能体更新任务进度，使决策智能体能够把握当前任务的进展情况。随后，决策智能体依据当前任务进度、当前屏幕状态以及反思结果（若上一次操作有误）来执行操作。接着，反思智能体观察操作前后的屏幕，以判断该操作是否符合预期。

Visual Perception Module

即便对于最先进的多模态大语言模型（MLLMs）而言，端到端处理时，屏幕识别依旧颇具挑战性。因此，我们整合了一个视觉感知模块来提升屏幕识别能力。在这个模块中，我们运用三种工具：文本识别工具、图标识别工具以及图标描述工具。将屏幕截图输入此模块，最终会得到屏幕上呈现的文本和图标信息，以及它们各自的坐标。这一过程可用以下公式表示：，代表在第次iteration的感知结果

这实在是太糙了写的。

Memory Unit

The memory unit serves as a short-term memory module that is updated as the task progresses

Planning Agent

将决策工具在第t步产出的操作设为，在决策工具产出决策前，规划工具能够观察到上一步的决策,然后将更新任务进度从更新为，任务进度包括已经完成的各种子任务。在生成新的任务进度后，会把新的任务进度信息传给决策工具。这能够使得决策工具感知到尚未被完成的任务有哪些，从而能够让决策工具来产出下一步的操作是什么。如Figure 3所示，计划工具的输入包含四个部分：用户指示 Ins，记忆单元中的聚焦内容，前一个操作,和前一个任务状态 :

规划工具没有想象中那么容易做成，这里我简单地对豆包和GPT-4o进行测试：

GPT-4o:

豆包：

经过验证后，豆包给的航班信息还是虚假的。

Decision Agent

Ins：用户指令

:上一步任务进度，

: 上一步聚焦内容

：上一步从反思工具中来的反思结果

：第t步的屏幕截图

：对第t步的屏幕截图识别结果

决策工具的操作空间：

打开app：如果当前页面是主页面，则打开名为 <app_name> 的应用

点击操作：点击位置(x,y)

滑动操作：从位置1(x1,y1) 滑动到位置2 (x2,y2)

打字操作：如果当前键盘被激活，这个操作可以在输入框中输入字符串 <text>

回到主界面操作

停止操作：如果决策工具认为所有的要求都被完成，则可以通过此操作来结束整个操作流程

Reflection Agent

尽管拥有视觉感知模块，Mobile-Agent-V2仍然可能产生未预料到的操作，在一些情况下，MLLMs会产生严重的幻觉，因此我们设计了反思工具来观察操作前后的屏幕状态来决定当前的操作是否符合期望。

如Figure3所示，反思工具会在操作结束后生成三种反思结果：有害操作，无效操作，正确操作。

有害操作：将手机当前页面带入到与任务无关的页面中。例如工具本想用短信app联系A，结果打开了B的联系界面。

无效操作：操作前后的页面没有产生变化，例如本来想点击某个icon，结果点击位置错了。

正确操作：当前操作符合决策工具的预期，有助于进一步完成用户的要求。

如果被判定为有害操作，页面会被恢复为上一步操作的状态。如果是无效操作，页面会维持当前状态。无论是有害操作或者是无效操作都不会被记录在操作记录中以防止这些工具会follow这些操作。如果是正确操作，该操作会被记录在操作历史中。

实验部分

Visual Perception Module：

OCR：ConvNextViT-document

ICON Recognition：GroundingDINO

ICON Description Tool：Qwen-VL-Int4

MLLMs:

planning agent:it does not require screen perception, we use GPT-4

decision agent: 需要根据页面情况来决策了，用了GPT-4V

reflection agent:需要根据页面情况来决策了，用了GPT-4V

Evaluation Method

评测范围：

两种系统：鸿蒙和安卓（ADB 可以用来操作)

两种语言：英语和非英语

5种系统app和5种流行的外部app，每个app设计两种指令和两种高级指令。基础指令是指在app页面就能看到的简单操作指令，高级指令就相对复杂。

我们还评测了跨app操作的能力，我们设计了跨APP的两种基本操作和两种高级操作。

总的来说，非英语和英语的场景加起来有共88个任务指令，其中40个系统app指令，40个外部app指令，8个跨app指令

评测指标

成功率：当一条指令相应的所有要求都完成时，认为此次指令任务成功。成功率就是成功指令数除以总指令数的比例。

完成率：尽管一些比较难的指令无法成功执行完毕，但操作过程中已经成功执行地操作还是值得注意的，完成率指的是已经成功执行的操作数除以ground truth 操作数

决策准确率：成功决策数除以所有决策数

反思准确率：成功反思数除以所有反思数

实践细节：固定了随机种子，把温度设为0。此外，我们引入了一种叫 Knowledge injection（知识注入）的东西，这些知识包括一些用户指令相关的操作提示。我们仅仅在agents无法完成该条用户指令时进行知识注入，能顺利完成的情况下不进行注入。

结果

可以看到在英语和非英语上，mobile-agent-v2相比一代有明显提升，而且知识注入的这种形式也是有用的。

可以看到GPT-4V的能力还是远超Qwen-VL-Max，这里的评测方法是单步的SR和DA，因为有的模型不支持序列输入。GPT-4V w/o agent 是指只用GPT-4V来当做操作助手，而抛弃了前面的多agent结构。第二个表能看出哪个agent地位最重要，反思是作用最小的。

一个完整的执行样例

SystemPrompt Example

Planning Agent

prompt for Planning agent during the first operation

System You are a helpful AI mobile phone operating assistant. User

### Background ###

There is an user’s instruction which is: {User’s instruction}. You are a mobile phone operating assistant and are operating the user’s mobile phone.

### Hint ###

There are hints to help you complete the user’s instructions. The hints are as follow: If you want to tap an icon of an app, use the action "Open app"

### Current operation ###

To complete the requirements of user’s instruction, you have performed an operation. Your operation thought and action of this operation are as follows: Operation thought: {Last operation thought} Operation action: {Last operation}

### Response requirements ###

Now you need to combine all of the above to generate the "Completed contents". Completed contents is a general summary of the current contents that have been completed. You need to first focus on the requirements of user’s instruction, and then summarize the contents that have been completed.

### Output format ###

Your output format is:

### Completed contents ###

Generated Completed contents. Don’t output the purpose of any operation. Just summarize the contents that have been actually completed in the ### Current operation ###. (Please use English to output)

The prompt for the planning agent during subsequent operations.

System You are a helpful AI mobile phone operating assistant.

User

### Background ###

There is an user’s instruction which is: {User’s instruction}. You are a mobile phone operating assistant and are operating the user’s mobile phone.

### Hint ###

There are hints to help you complete the user’s instructions. The hints are as follow: If you want to tap an icon of an app, use the action "Open app"

### History operations ###

To complete the requirements of user’s instruction, you have performed a series of operations. These operations are as follow: Step-1: [Operation thought: {operation thought 1}; Operation action: {operation 1}] Step-2: [Operation thought: {operation thought 2}; Operation action: {operation 2}] ......

### Progress thinking ###

After completing the history operations, you have the following thoughts about the progress of user’s instruction completion: Completed contents: {Last "Completed contents"}

### Response requirements ###

Now you need to update the "Completed contents". Completed contents is a general summary of the current contents that have been completed based on the ### History operations ###.

### Output format ###

Your output format is:

### Completed contents ###

Updated Completed contents. Don’t output the purpose of any operation. Just summarize the contents that have been actually completed in the ### History operations ###.

Decision Agent‘s prompt

这里没在prompt中呈现，推测是直接经过Reflection后如果不成功直接从上一步重新开始了。Instruction也没有找到。

System You are a helpful AI mobile phone operating assistant. You need to help me operate the phone to complete the user’s instruction. User

### Background ###

This image is a phone screenshot. Its width is {Lateral resolution} pixels and its height is {Vertical resolution} pixels. The user’s instruction is: {User’s instruction}.

### Screenshot information ###

In order to help you better perceive the content in this screenshot, we extract some information on the current screenshot through system files. This information consists of two parts: coordinates; content. The format of the coordinates is [x, y], x is the pixel from left to right and y is the pixel from top to bottom; the content is a text or an icon description respectively. The information is as follow: (x1, y1); text or icon: text content or icon description ......

### Keyboard status ###

We extract the keyboard status of the current screenshot and it is whether the keyboard of the current screenshot is activated. The keyboard status is as follow: The keyboard has not been activated and you can’t type. or The keyboard has been activated and you can type.

### Hint ###

There are hints to help you complete the user’s instructions. The hints are as follow: If you want to tap an icon of an app, use the action "Open app"

### History operations ###

Before reaching this page, some operations have been completed. You need to refer to the completed operations to decide the next operation. These operations are as follow: Step-1: [Operation thought: {operation thought 1}; Operation action: {operation 1}] ......

### Progress ###

After completing the history operations, you have the following thoughts about the progress of user’s instruction completion: Completed contents: {Task progress from planning agent}

### Response requirements ###

Now you need to combine all of the above to perform just one action on the current page. You must choose one of the six actions below: Open app (app name): If the current page is desktop, you can use this action to open the app named "app name" on the desktop. Tap (x, y): Tap the position (x, y) in current page. Swipe (x1, y1), (x2, y2): Swipe from position (x1, y1) to position (x2, y2). Unable to Type. You cannot use the action "Type" because the keyboard has not been activated. If you want to type, please first activate the keyboard by tapping on the input box on the screen. or Type (text): Type the "text" in the input box. Home: Return to home page. Stop: If you think all the requirements of user’s instruction have been completed and no further operation is required, you can choose this action to terminate the operation process.

### Output format ###

Your output consists of the following three parts: ### Thought ###

Think about the requirements that have been completed in previous operations and the requirements that need to be completed in the next one operation. ### Action ###

You can only choose one from the six actions above. Make sure that the coordinates or text in the "()".

### Operation ###

Please generate a brief natural language description for the operation in Action based on your Thought.

Reflection Agent

System

You are a helpful AI mobile phone operating assistant. User

These images are two phone screenshots before and after an operation. Their widths are {Lateral resolution} pixels and their heights are {Vertical resolution} pixels. In order to help you better perceive the content in this screenshot, we extract some information on the current screenshot through system files. The information consists of two parts, consisting of format: coordinates; content. The format of the coordinates is (x, y), x is the pixel from left to right and y is the pixel from top to bottom; the content is a text or an icon description respectively The keyboard status is whether the keyboard of the current page is activated.

### Before the current operation ###

Screenshot information: (x1, y1); text or icon: text content or icon description ...... Keyboard status: The keyboard has not been activated. or The keyboard has been activated.

### After the current operation ###

Screenshot information: (x1, y1); text or icon: text content or icon description ...... Keyboard status: The keyboard has not been activated. or The keyboard has been activated.

### Current operation ###

The user’s instruction is: {User’s instruction}. You also need to note the following requirements: If you want to tap an icon of an app, use the action "Open app". In the process of completing the requirements of instruction, an operation is performed on the phone. Below are the details of this operation: Operation thought: {Last operation thought} Operation action: {Last operation}

### Response requirements ###

Now you need to output the following content based on the screenshots before and after the current operation: Whether the result of the "Operation action" meets your expectation of "Operation thought"? A: The result of the "Operation action" meets my expectation of "Operation thought". B: The "Operation action" results in a wrong page and I need to return to the previous page. C: The "Operation action" produces no changes.

### Output format ###

Your output format is:

### Thought ###

Your thought about the question

### Answer ###

A or B or C

代码简读

主函数结构

第一步：加载模型

vision-language模型（用作icon description的），有两个可选

qwen-vl-chat

qwen-vl-chat-int4

OCR、icon detection模型加载

第二步：初始化各种需要的参数和路径

single：当前思考，当前总结，当前行动，当前记忆，当前洞见

collection：思考历史，总结历史，行动历史

创建临时文件、截屏存储文件夹

主循环

当迭代次数等于1时，初始化：

截屏文件

截屏的感知结果

键盘状态感知

当迭代次数>1时，进行各个agent状态的推理和更新

这里的add_info就是告诉agent当希望打开app时，不要用tap操作，而用OpenAPP操作

操作之后，状态更新

perception_infos, width, height,keyboard

文件更新

反思操作

反思操作的结论如果是A，则表示一切按计划进行，否则（B 错误操作 or C 无效操作）可定是某一步出错了。如果是B那么就可以回退。

当结果是A时，表明这一步的结果是成功的，则各个thought，summary和action都可以进入记录。此外，用 get_process_prompt来对下一步计划进行规划

在反思步骤，这里又有个新的chat_planning。这里的init_memory_chat就是一个普通的system prompt，和init_reflection_chat一样。只是一个新的会话界面

关键引用函数

get_action_prompt | Decision Agent Prompt

get_memory_prompt

get_reflect_prompt

首先输入的信息是两个images和响应的信息描述（包括画面元素的解读、keyboard状态）

再输入了用户的instruction

然后让他判断

add_response_two_image

get_process_prompt

输入：

instruction：用户指示

thought_history：操作推理的详细版[实际没用到，只是用来作为一个数量判断依据]

summary_history：操作推理的简略版

action_history：操作历史

completed_requirements：一个字符串