Monday, February 12, 2024

Introducing GoodAI LTM Benchmark

As part of our research efforts in the area of continual learning, we are open-sourcing a benchmark for testing agents’ ability to perform tasks involving the advanced use of the memory over very long conversations. Among others, we evaluate the agent’s performance on tasks that require dynamic upkeep of memories or integration of information over long periods of time.

We are open-sourcing:

We show that the availability of information is a necessary, but not sufficient condition for solving these tasks. In our initial benchmark, our conversational LTM agents with 8k context are comparable to long context GPT-4-1106 with 128k tokens. In a larger benchmark with 10 times higher memory requirements, our conversational LTM agents with 8k context achieve performance which is 13% better than GPT-4-turbo with a context size of 128,000 tokens for less than 16% of the cost.

We believe that our results help illustrate the usefulness of the LTM as a tool, which not only extends the context window of LLMs, but also makes it dynamic and helps the LLM reason about its past knowledge and therefore better integrate the information in its conversation history. We expect that LTM will ultimately allow agents to learn better and make them capable of life-long learning.

Motivation

At GoodAI, we are developing LLM agents that can learn continually from the interactions with the user and the environment. Our goal is to create agents that are capable of life-long learning, which means that they are constantly gathering knowledge from every new experience and leveraging all past knowledge to act and learn better in the future. In the past we have organized the GoodAI Challenge, specifically the Gradual Learning round in 2017, to stimulate ideas on continual learning. 

While pursuing this goal, we quickly realized that we needed a way to objectively measure our progress on LLM agents’ ability to learn continually. Very often we found ourselves trying different solutions to the same problem and not knowing which one to choose. The methods were usually different, but the results felt equivalent or not significantly different. In addition to this, most existing benchmarks fell short for our purposes because of a strong focus on testing LLM-specific capabilities, like mathematical reasoning, instruction-following abilities, or being centered around testing specific methods or tools; such as vector databases, prompting, information placement within the context, or performance in question-answering tasks based on static memories or factual knowledge.

In short, most benchmarks focused either on aspects that were LLM-, method- or implementation-specific, and we wanted to have something that we wouldn’t need to throw away and rewrite from scratch in the future. On the contrary, we needed a frame of reference that was capable of standing the test of time and that would evolve as we discovered new caveats in our own agents and translated them into new goals to achieve. A stable benchmark for a constantly-changing agent: an incremental, continual, and conversational benchmark.

For these reasons, we developed the GoodAI LTM Benchmark, a framework that can test conversational agents’ abilities to learn and adapt in realistic scenarios and over long periods of time.

For more details, continue to GoodAI Blog Post

Github: https://github.com/GoodAI/goodai-ltm-benchmark

Discord: https://discord.gg/Pfzs7WWJwf

Authors: David Castillo, Joseph Davidson, Finlay Gray, José Solorzano, and Marek Rosa  

Thank you for reading this blog!

 

Best,
Marek Rosa
CEO, Creative Director, Founder at Keen Software House
CEO, CTO, Founder at GoodAI

 

For more news:
Space Engineers: www.SpaceEngineersGame.com
Keen Software House: www.keenswh.com
VRAGE Engine: www.keenswh.com/vrage/
GoodAI: www.GoodAI.com
Personal Blog: blog.marekrosa.org

 

Personal bio:

Marek Rosa is the founder and CEO of GoodAI, a general artificial intelligence R&D company, and Keen Software House, an independent game development studio, started in 2010, and best known for its best-seller Space Engineers (over 5 million copies sold). Space Engineers has the 4th largest Workshop on Steam with over 500K mods, ships, stations, worlds, and more!

Marek has been interested in game development and artificial intelligence since childhood. He started his career as a programmer and later transitioned to a leadership role. After the success of Keen Software House titles, Marek was able to fund GoodAI in 2014 with a $10 Million personal investment.

Both companies now have over 100 engineers, researchers, artists, and game developers.

Marek's primary focus includes Space Engineers, the VRAGE3 engine, the AI People game, long-term memory systems (LTM), an LLM-powered personal assistant with LTM named Charlie Mnemonic, and the Groundstation.

GoodAI's mission is to develop AGI - as fast as possible - to help humanity and understand the universe. One of the commercial stepping stones is the "AI People" game, which features LLM-driven AI NPCs. These NPCs are grounded in the game world, interacting dynamically with the game environment and with other NPCs, and they possess long-term memory and developing personalities. GoodAI also works on autonomous agents that can self-improve and solve any task that a human can.