top of page

AgentLM - Building an open source LLM agent

If you're a CXO, founder or investor - follow me on LinkedIn & Twitter, or join my newsletter on my website here. I share latest simplified AI research and tactical advice on building AI products.

Practical Uses

1. Startups can build open source assistants for customer service at low cost.

2. Companies can build internal assistants for automation of boring repetitive tasks - HR assistant, Scheduling assistant, AI Secretary etc.

3. Startups can build no-code & low-code platforms for enabling business teams to write their apps.

Pre-requisite definitions


An agent refers to an entity capable of perceiving its environment, making decisions, and taking actions.

LLM Agent

An autonomous system (or assistant) using LLM as its central brain.

It has three core functionalities - planning, memory and ability to use external APIs or tool for solving problems.

Few examples - Auto-GPT, GPT-Engineer, BabyAGI


Open source LLMs like LLaMA 2 and Vicuna perform very poorly as agents when compared to commercial LLMs like GPT-4 and GPT-3.5.


An open source LLM like LLaMA 2 is fine-tuned in a hybrid manner for agent purpose using a lightweight AgentInstruct dataset.

AgentInstruct database mixes specialized interaction trajectory data with high quality general data in a certain ratio to enhance agent capabilities to ensure general properties of LLM are not undermined.

How this approach is different

Few current approaches either use prompt-engineering or define rule-based framework for each particular task. Both of these are very slow and take lot of manual effort to scale or generalise.

Other current approaches use fine-tuning with instruction datasets built for particular tasks. However these methods compromise general capabilities of LLM which are must for use as agents.

In this approach(AgentTuning), a hybrid fine-tuning strategy is used by combining an instruction-tuning dataset (AgentInstruct) with high quality open-source general instruction dataset. This combined-tuning ensures specialization while maintaining generalisation capabilities of LLM.

How the model was tuned

  1. AgentInstruct dataset was constructed for 6 tasks (AlfWorld, WebShop, Mind2Web, Knowledge Graph, Operating System and Database) in 3 stages -

    1. Instruction Construction - For tasks having datasets available (first 4) training split was used for all 3 stages. For Database task, task derivation strategy was used, where GPT-4 was used on BIRD dataset to build instructions and trajectories. For OS task, where direct interaction is a challenge, GPT-4 was prompted to come up with some OS related tasks along with explanations to the task, a reference solution and an evaluation script. Then, another GPT-4 instance (the solver) was prompted with the task and its trajectory was collected. After the task was completed, reference solution was run and its result were compared to the one from solver GPT-4 using the evaluation script. Trajectories where the reference solution and the solver’s solution gave the same answer were collected. For the DB task, since BIRD only contains SELECT data, other types of database operations (INSERT, UPDATE and DELETE) were constructed in a similar self-instruct approach.

    2. Trajectory Construction - GPT-4 (gpt-4-0613) was used as agent for trajectory interaction. For the Mind2Web task, due to the large number of instructions and budget constraints, ChatGPT (gpt-3.5-turbo-0613) was partially employed for interactions. 1-shot evaluation approach was used, primarily due to the stringent requirements for the output format in agent tasks. For each task, a complete interaction process was provided from the training set. The interaction process has two main parts. First, the model was given a task description and a successful 1-shot example. Then, the actual interaction began. Model was supplied with the current instruction and necessary information. Based on this and previous feedback, the model formed a thought and took an action. The environment then provided feedback, including possible changes or new information. This cycle continued until the model either achieved its goal or reached its token limit. If the model repeated the same output three times consecutively, it was considered a repetitive failure. If the model’s output format was wrong, BLEU metric was used to compare it to all possible action choices and pick the closest match as the model’s action for that step. ReAct was employed as the reasoning framework for CoT rationale, which outputted CoT explanation (referred to as thought) before producing the final action. Consequently, every action within the collected interaction trajectories was accompanied by a detailed explanation trace, enabling the model to learn the reasoning process leading to the action. For trajectories generated using task derivation without thoughts, GPT-4 was used to supplement them with thoughts for consistency with ReAct prompting.

    3. Trajectory Filtering - High-quality trajectories were automatically selected based on the reward. Trajectories were filtered for all tasks, except for Mind2Web, based on a final reward of r = 1, indicating complete correctness. However, due to the difficulty of the Mind2Web task, a threshold of r ≥ 2/3 was used to ensure a sufficient number of trajectories.

  2. Using the ShareGPT dataset, English language conversation was selectively extracted, yielding 57,096 conversations with GPT-3.5 and 3,670 with GPT-4. Recognising the superior quality of GPT-4 responses, a sampling ratio of 1:4 between GPT-4 and GPT-3.5 was adopted for better performance.

  3. Using the base model π0, which represents the probability distribution π0(y | x) of response y given instruction and history x, two datasets were considered: the AgentInstruct dataset Dagent and the general dataset Dgeneral. The mixture ratio of Dagent and Dgeneral is defined as η. To determine the best η, scan was done from 0 to 1 in intervals of 0.1 on the 7B model and ultimately η = 0.2 was chosen since it performed the best on held-out tasks for final training.

  4. Chat version of open Llama 2 (Llama-2-{7,13,70}b-chat) was used as base model, given its better instruction-following capabilities than base models and commendable performance on traditional NLP tasks.

  5. Following Vicuna, all data was standardised into a multi-turn chatbot-style format, allowing conveniently mix of data from different sources.

  6. During fine-tuning, loss was only computed on the model’s output.

  7. Models of sizes 7B, 13B, and 70B were fine-tuned using Megatron-LM. Learning rate of 5e-5 was used for the 7B and 13B models, and 1e-5 for the 70B model. Batch size was set at 64 with 4,096 sequence length.

  8. AdamW optimizer was used with a cosine learning scheduler with 2% warm-up steps.

  9. For efficient training, tensor parallelism was employed for the 7B and 13B models, and for the 70B model, pipeline parallelism was also utilised.

How this approach was evaluated




bottom of page