Marek Rosa - dev blog: 04/01/2024

Monday, April 29, 2024

AI People: Announcing the next evolution of gaming AI NPCs

Today, GoodAI proudly introduces our new game: AI People

Discover more on our official website: www.AIPeopleGame.com

Please watch announcement livestream: https://youtube.com/live/Xz_ncOB5P3g

The Vision

Our vision for AI People was ambitious but clear: to innovate within the gaming industry by making intelligent AI NPCs central to gameplay. We aimed to initiate a new genre of intelligent AI NPCs, just like 11 years ago, when we created the engineering genre with Space Engineers.

This vision later developed into the current form: AI People is a sandbox game where you create and play scenarios with AI NPCs that interact with each other, the environment, and you! They learn, feel emotions, pursue goals, dream, and dynamically craft an emergent AI-generated story.

Traditional gaming often features scripted NPCs with predetermined behaviors, limiting engagement and replayability. AI People revolutionizes this by introducing AI NPCs that learn and adapt to player actions in real-time. This advancement allows for evolving storylines and a richer, more immersive experience, ensuring that each player's decision significantly influences the ongoing narrative and gameplay.

We want players to feel these AI NPCs are living, sentient beings, not just scripts.

This is all orchestrated by our AI Director, which ensures that the emergent stories evolve into compelling and intriguing plots.

Our vision goes beyond AI NPCs. We foresee the next step when AI simulates the entire game logic and game states. However, the technology and HW performance are not yet ready for this. More about how we see games in 2033.

History of the project

AI People's inception can be traced back to an idea centered on a game mechanic of teaching AI NPCs. After exploring numerous prototypes, we found our direction with LLM-driven NPCs about three years ago. Since then, our team has been fervently working to make NPC behaviors robust and intriguing and adding essential features such as long-term memory. This project isn't just about introducing something novel, it's about redefining how players experience NPCs, making them 10 times more vivid and dynamic than the in-game characters we know today. And we genuinely believe in its potential.

My role in shaping AI People has mirrored my journey with Space Engineers. Overseeing the vision, game design, art style, and more technical aspects like the R&D for the AI-powered NPCs and LLMs, I am deeply intertwined with the game's evolution, ensuring we stay true to our core principles.

After all these years, we are approaching a phase where launching an alpha release makes sense. This will allow us to engage in open development and involve the community more directly. However, this is just the beginning - there is so much more we plan to add and improve.

If you are interested in a previous version of AI People, see this presentation that I made in June 2023 at the Summer School for AI in Games at Cambridge.

Game Dynamics and Features

At its core, AI People is about shattering the current limits of NPC interaction. By leveraging the power of Large Language Models (LLMs) integrated into our in-house cognitive architecture, we've empowered NPCs with the ability to learn, interact with the environment, and communicate. They not only engage in conversation, but also actively perceive their surroundings, manage their inventories, use objects and tools, physically interact with fellow NPCs, and converse with other NPCs or the player.

We understand the great possibilities of User-Generated Content (UGC), which is why we have added an editor to our platform. This enables players to create their own unique scenarios by developing the story and personalities of the characters. We are excited to see how players will bring their imagination to life and contribute to the AI People universe by creating scenarios that we couldn't have imagined ourselves.

Key features

Behavioral AI NPCs: They have unique personalities and long-term memory. They actively interact with the environment, other NPCs, and the player, skillfully utilizing tools, and setting personal goals.
Dynamic Gameplay: No fixed plots; NPCs' choices and emotions craft emergent stories. They adapt to players, the environment, and fellow NPCs, and establish intricate relationships.
Two Game Modes: Experience the world as a character or shift to creation with the integrated editor, shaping NPCs’ lives and crafting unlimited narratives.
Endless Replayability: Driven by AI and user content, each game session is a fresh, original, constantly evolving narrative journey.

The Road Ahead and Release

Taking inspiration from our journey with Space Engineers, our immediate plan is a soft launch an alpha version that highlights our AI Director technology. Regular updates will ensue, enriching the game with newer features and deeper content. A cornerstone of this journey will be engaging with our player community. Official announcements regarding the release date will be communicated in due time.

We aspire for AI People to reshape how you perceive and engage with in-game characters. Stay tuned for more updates!

My deepest wish is for AI People to stand out as the genre-defining game of “Intelligent NPCs”.

FAQ

Q: What is AI People?
A: AI People is a sandbox game where you create and play scenarios with AI NPCs that interact with each other, the environment, and you! They learn, feel emotions, pursue goals, dream, and dynamically craft an emergent AI-generated story.

Q: When is the release date?
A: We'll announce that at a later time.

Q: What's the roadmap for releasing AI People?
A: Drawing from our experience with Space Engineers, our initial plan is to launch an alpha version that showcases the core capabilities of our AI Director technology. Following that, we intend to roll out regular updates, which will enrich the game with new features and content, all while actively engaging with our player community.

Q: How many characters will be in-game?
A: In any given scenario, players can typically expect around 5 NPCs to be present. While we will offer a broader range of official characters, players also will have the capability to add and share their own creations. We're not targeting large-scale simulations at the moment, as the current LLM capacities would not sufficiently support it.

Q: How many environments will be available and of what type?
A: At the alpha release, players will have the opportunity to a few primary scenarios set in a modern-day theme, showcasing our novel AI Director technology. Although our game is setting/theme agnostic - with aspirations to support scenarios from ancient times to the present and extend to futuristic and fantastical settings - more official scenarios will be introduced as development continues. Moreover, players will be equipped with tools to craft and share their own unique scenarios, unrestricted by a specific theme or era.

Q: What technological framework drives your AI NPCs?
A: AI NPCs in the game are simulated by Large Language Models (LLMs). These LLMs are integrated into our in-house cognitive architecture, encompassing features like long-term memory, working memory, episodic memory, and spatial reasoning. Additionally, our NPCs can interact with the game world through various actuators and tools, and they can adapt and learn based on feedback. This is all orchestrated by our AI Director, which ensures that the emergent stories evolve into compelling and intriguing plots.

Q: Tell me more about the technology behind AI NPCs in AI People.
A: The core idea is straightforward. The game generates prompts that include descriptions of the scene, agents, their bios, and any memories relevant to the current context. These prompts are then sent to the LLM, and the output delineates the actions the agents should take. The game then executes these actions. Any new memories are then post-processed and stored in various formats like long-term memory, working memory, episodic memory, and so on, ensuring they're available for future prompts. If an action generated by the LLM fails, this is flagged as error feedback, prompting the LLM to offer a correction.

The real challenge lies in ensuring this entire process results in NPCs behaving realistically in a manner that's both predictable yet surprising. We want narratives that unfold intriguingly, where players genuinely impact the lives of NPCs, and unexpected events or plot twists emerge.

Another objective is to optimize the number of LLM tokens to ensure minimal delays and negligible inference costs. At their core, our AI NPCs function as LLM agents. While they share similarities with models like AutoGPT or BabyAGI, we aim to ensure they operate seamlessly within our proprietary game environment. Additionally, our NPCs feature dynamic and expandable memory capabilities, allowing them to learn continually.

Here’s is a presentation from June 2023 with an older version of our AI Director technology.

Q: Do NPCs experience hallucinations in the game?
A: Yes, NPCs can experience hallucinations. Occasionally, they might reference events or actions, such as giving you an item, that have not actually taken place in the game world. These moments are not intentional features but aspects we continually work to improve upon.

Q: Which LLM (large language model) do you utilize?
A: We employ a variety of models. Our prototyping predominantly occurs on GPT-4 and GPT-3.5-Turbo. In earlier phases, we relied on GPT-3, GPT-J, and GPTNeo. For enhanced quality and alignment with our game world and narrative, we're fine-tuning some of the open-source models. This will ensure that our models can produce diverse story content seamlessly, unbiased by someone’s political views. Our goal is to run the LLM directly on the player's device to reduce inference costs and increase the player’s privacy.

Q: What are the price and infrastructure costs? How is payment for LLM inference handled?
A: Detailed pricing information will be announced later. For now, it's important to note that our AI NPCs operate on LLMs (large language models) housed in data centers. Given the significant infrastructure costs, especially considering that the game generates thousands of LLM tokens every hour, we're actively exploring solutions to scale this operation efficiently.

Q: Is the Space Engineers team involved in this project?
A: No, however, Marek Rosa, the CEO and founder of Keen Software House and creator of Space Engineers, serves as the Creative Director for AI People. The remainder of the team is comprised of professionals from various game development studios, as well as experienced AI researchers and engineers. Given the potential of AI People to pioneer a new genre of intelligent NPCs, it's essential to have a team that's well-versed in both AI research and game development.

Q: Tell me more about Space Engineers.
A: Space Engineers is a sandbox game about engineering, construction, exploration, and survival in space and on planets. Players can build spaceships, space stations, planetary outposts, and even pilot ships through space to explore planets or gather resources. The game offers both creative and survival modes, emphasizing physics-based realism, where everything in the game can be assembled, disassembled, damaged, or destroyed.Space Engineers has seen consistent engagement since its release, with over 5 million copies sold. One notable aspect of the game is its active modding community. It has the 4th largest Workshop on Steam, hosting over 500,000 mods, ships, stations, and worlds. This extensive user-generated content has contributed to the game's adaptability and appeal over its decade-long presence in the market.

Q: Why is GoodAI venturing into game development when its mission is “to build AGI, help humanity, and understand the universe”?
A: With the emergence of LLMs, we pivoted from our previous AGI strategies to fully embrace LLM-powered agents equipped with long-term memory. Introducing AI NPCs into a game provides an ideal environment for testing and refining these models. Games offer a forgiving space where errors come at a low cost, and experimentation is encouraged. They allow us to harmonize the demands of a use case with the existing capabilities of AI. Plus, we genuinely enjoy the process of game creation!However, this game is just one facet of our initiatives at GoodAI. We're also spearheading projects like the Charlie Mnemonic - a personal assistant with long-term memory, LTM Benchmark and LTM systems, and the GoodAI Groundstation - an AI-driven platform for managing drone fleets (example here).

Q: What's your approach towards UGC (user-generated content)?
A: We're highly interested in the potential of UGC, such as modding, introducing new scenarios, designing character visuals, and developing distinct bios/personalities, as well as originating stories.

Q: What distinguishes the NPCs in AI People from other GPT-powered NPCs?
A: While many GPT-powered NPCs are limited to conversations, NPCs in AI People go beyond, emerging as advanced behavioral entities. They actively perceive their environment, move within it, interact with the world using items and tools, manage their inventories, physically interact with fellow NPCs, and choose to converse with other NPCs and the player. On top of this, our AI Director orchestrates their behavior into a convincing narrative.

Q: What distinguishes the NPCs in AI People from the conventional NPCs in most games?
A: Traditional game NPCs are scripted and predictable. AI People introduces LLM-powered NPCs with individual personalities, biographies, long-term memory, and learning abilities. These NPCs actively observe their surroundings, establish goals, interact with objects, items, and other characters, and adapt based on feedback from both players and the environment. This dynamic gameplay continuously evolves with fresh scenarios. Beyond mere dialogue, they showcase a comprehensive behavioral AI, reflecting their unique thoughts and plans.

Q: On which platforms will AI People be released?
A: Initially, we are targeting the PC platform. However, we may consider expanding to other platforms in the future, including mobile, console, VR, and AR.

Q: Is AI People a single-player or multiplayer game?
A: Currently, our focus is on delivering a rich single-player experience, with the bonus of players being able to create and share content via our workshop. While multiplayer isn't in our immediate plans, it's a logical next step we might explore in the future.

Q: What age rating is AI People?
A: Currently, the game is rated 18+. However, we're exploring the possibility of a kid-friendly version once we've implemented enhanced filters and a parental lock.

Q: What gameplay genre does AI People fall under?
A: AI People blends elements of Simulation Sandbox, Adventure/RPG, and Level/Scenario building genres.

Q: Which game engine powers AI People?
A: AI People is built on the Unity Engine.

Q: Who or what is GoodAI?
A: Founded in 2014, GoodAI is a company initiated with a $10M personal investment from Marek Rosa. Its overarching ambition is to develop general artificial intelligence designed to automate cognitive tasks across sectors like science, technology, business, and more.

Q: How are AI People and Space Engineers connected?
A: GoodAI is a sister company to Keen Software House, the developer behind Space Engineers.

Q: Why are you developing a game instead of focusing on AI middleware for other game studios?
A: There are a couple of key reasons. Firstly, developing middleware in a vacuum, without a direct connection to the end product (the game), can lead to misguided directions and extended feedback loops. We believe in creating with purpose and direct application. Secondly, our passion lies in game development rather than middleware creation. That said, as our game progresses, we're open to the idea of licensing our AI Director technology to other game studios. It'd be thrilling to see our tech come to life in other titles, particularly those we deeply admire.

Q: What about the offensive, unethical, and disturbing behavior of AI NPCs?
A: The AI NPCs are designed to emulate behavior based on their training dataset. It's important to understand that they aren't intrinsically biased or offensive, but they mimic patterns observed in their training data, which reflects human behavior. It's a mirror to us; any imperfection in their behavior is, in essence, a reflection of human imperfections.

Q: Is it even ethical to play with NPCs that may be intelligent, conscious, and potentially capable of feeling pain? How do we know they don’t feel?
A: The NPCs in AI People are indeed advanced and designed to emulate thinking, feeling, a sense of aliveness, and even reactions that might resemble pain. However, it's essential to understand that they operate on a digital substrate, fundamentally different from human consciousness's biological substrate. While they exhibit behaviors that mimic emotion, thought, and pain-like reactions, their "experience" is not analogous to human experience - similar to movie characters who appear lifelike yet are completely imaginary. They are a testament to the marvels of technology and our ability to create lifelike interactions, but they remain entities of code, responding based on intricate algorithms. Ethical considerations are paramount, and engaging with them with an understanding of their unique nature is essential.

Stay in touch!

Join our community and help us make a lasting impact on gaming history together.

Connect Page: https://www.aipeoplegame.com/connect/
Discord - https://discord.com/invite/2uXyCcXc6Q
Twitter - https://twitter.com/aipeoplegame
Facebook - https://www.facebook.com/profile.php?id=100094737513423
Instagram - https://www.instagram.com/aipeoplegame
Threads - https://www.threads.net/@aipeoplegame
Official Reddit - https://www.reddit.com/r/AIPeopleGame/
YouTube - https://www.youtube.com/@aipeoplegame
TikTok - https://www.tiktok.com/@aipeoplegame
Website - https://www.aipeoplegame.com/
GoodAI Newsletter - Sign Up

If you are interested in joining our team, please send us your CV.

Thank you for reading this blog!

Best,
Marek Rosa
CEO, Creative Director, Founder at Keen Software House
CEO, CTO, Founder at GoodAI

Personal bio:

Marek Rosa is the founder and CEO of GoodAI, a general artificial intelligence R&D company, and Keen Software House, an independent game development studio, started in 2010, and best known for its best-seller Space Engineers (over 5 million copies sold). Space Engineers has the 4th largest Workshop on Steam with over 500K mods, ships, stations, worlds, and more!

Marek has been interested in game development and artificial intelligence since childhood. He started his career as a programmer and later transitioned to a leadership role. After the success of Keen Software House titles, Marek was able to fund GoodAI in 2014 with a $10 Million personal investment.

Both companies now have over 100 engineers, researchers, artists, and game developers.

Marek's primary focus includes Space Engineers, the VRAGE3 engine, the AI People game, long-term memory systems (LTM), an LLM-powered personal assistant with LTM named Charlie Mnemonic, and the Groundstation.

GoodAI's mission is to develop AGI - as fast as possible - to help humanity and understand the universe. One of the commercial stepping stones is the "AI People" game, which features LLM-driven AI NPCs. These NPCs are grounded in the game world, interacting dynamically with the game environment and with other NPCs, and they possess long-term memory and developing personalities. GoodAI also works on autonomous agents that can self-improve and solve any task that a human can.

Saturday, April 27, 2024

Moral Obligations to AI NPCs and Simulation Hypothesis

Olaf Witkowski's article, "Do We Have Moral Obligations to Artificial Life?" got me thinking about the ethical implications of AI in games and the nature of our reality:

1) AI in Gaming: If we develop conscious or sentient AI NPCs, should we avoid using them in games to prevent unethical treatment?

We should NOT use living beings for our entertainment. Instead, we should aim to craft AI NPCs that appear conscious or sentient—akin to movie characters who seem real but entirely fictional. This approach respects ethical boundaries while preserving our games' narrative depth and entertainment value.

2) Simulation Hypothesis: The conversation about whether we live in a simulation often overlooks practical considerations. For instance, imagine a game developer in the EU developing a life-simulation game; they would be constrained by EU laws prohibiting certain illegal and unethical activities from being simulated.

Yet, we see many injustices and suffering when we look at the world around us. This observation leads us to two conclusions:

a) We are not in a simulation.

b) If we live in a simulation, then the fact that the observed injustices are permitted within its regulations raises profound questions about the ethics and values of its creator.

References:

"The Ethics of Life as It Could Be: Do We Have Moral Obligations to Artificial Life?" by Olaf Witkowski and Eric Schwitzgebel - https://faculty.ucr.edu/~eschwitz/SchwitzAbs/EthicsALife.htm

Wednesday, April 24, 2024

GoodAI LTM Benchmark v3 Released

The main purpose of the GoodAI LTM Benchmark has always been to serve as an objective measure for our progress in the development of agents capable of continual and life-long learning. However, we also want it to be useful for anyone developing agents of this type. In order to facilitate that, we have oriented this release to be easier to comprehend and produce more standardized results, which we expect to be easier to compare and analyze.

From the very first version of the benchmark, we have grouped the specific test instances in datasets or task types. For example, there is one dataset which is called “Shopping List”, from which we can draw an arbitrary number of different test instances that will evaluate the agent’s ability to remember a series of items and keep an updated version of the user’s shopping list.

In earlier releases, each test could result in an arbitrary number of score points and these points were not normalized. This led to potentially confusing situations, in which passing a highly complex test would give only a few points, while a much higher score could be achieved by just submitting the agent to several examples of the same simple test.

In contrast, now the scoring is normalized at different levels. First, each test score ranges from zero to one. Second, running several tests from the same dataset will result in an averaged score and a standard deviation for that dataset. This way, one can look at the global score knowing that it corresponds to exactly one point per dataset, which makes it easier to interpret. Additionally, running several tests from a single dataset provides valuable insight into how robust the agent’s performance is.

Introducing a standard configuration

One of our goals with this release is to make it straightforward for anyone to configure different levels of memory demand and also understand how demanding a specific configuration is. Most discussions about memory capabilities in LLM agents are currently centered around the context size of a particular LLM or how good the LLM is in retrieving needles from such context. While we want to ultimately move away from those implementation-centric terms, we believe that using words that the public is already familiar with will make it much easier to understand the scope of the LTM Benchmark and whether a specific agent might have any chance at it or it is technically incapable of succeeding.

For these reasons, we have simplified the configuration related to the conversation length and the amount of information given by the tests, and we have reduced the corresponding parameters to just two: the maximum memory span and the number of needles in a test. The first is a global parameter that affects all tests, and the second is a parameter that can be tweaked for each dataset in order to calibrate the task difficulty.

Memory span

In the context of a test, in which the agent is given some information and finally asked a question about it, the memory span refers to the amount of memories that the agent must consider or the extent to which the agent must search for relevant information in order to correctly answer that question. Translating the concept to a written conversation, we can define the memory span as the amount of text that exists between the question and the first relevant piece of information.

The configuration of the LTM Benchmark now contains a maximum memory span, which sets a target to how much space a test should take in the conversation. Taking that value as a reference, our scheduling system aims to distribute the tests messages along that space, covering at least 90% of the memory span, but also trying that no test exceeds that mark. However, such things might inevitably happen sometimes, and the system will display a warning both in the console and in the final report. From a technical standpoint, this means that any LLM with a context size greater than the maximum memory span set in the benchmark configuration will usually have all relevant information in its context. On the other hand, for agents using LLMs with smaller context sizes, the importance of their LTM systems will be highlighted when put in contrast to the actual memory requirements.

Number of needles

The so-called needle in a haystack tests are commonly conducted in order to assess the retrieval accuracy of different LLMs. In this setup, a needle is a short sentence either being out of place with respect to the surrounding text, or containing key information towards providing a correct answer to the test question.

In the LTM Benchmark, all tests are defined by a set of needles and questions. The needles are messages containing relevant information, which can be because they contain part of the answer to future questions, or because they are distractors injected intentionally by the test to assess the agent’s memory abilities. Most tests will let you adjust the difficulty by setting the amount of needles. Finally, the test questions are posed in a way that the answer is affected by the content of the needles placed before it.

The scheduling system of the LTM Benchmark evenly spaces out all these questions and needles across the configured memory span, interleaving messages from different tests and injecting irrelevant information as needed. The result is a seamless and natural conversation between the agent and our virtual tester, a conversation that is hundreds or thousands of tokens long, but in which the information required to provide the right answer to any question is not further away than the maximum memory span set in the configuration.

Results

With this release, we have changed the structuring and scoring of the tests. With that, we have rerun the tests to see how GPT4 and Claude-Opus stack up against our Claude and GPT powered LTM agents.

For this release, we ran three benchmarks. Each of the individual tests (i.e the script statements in the tests) were identical across the three benchmarks, with the only difference being in the number of tokens in the memory span, and hence the number of tokens between the needles and questions.

The selected memory spans were 120k, 200k, and 500k. These were chosen based upon the context sizes of GPT4 (128k) and Claude Opus (200k). For those LLMs, we expected to see a smooth decrease in the obtained scores. We contrast with the LTM agents, driven either by GPT4 or Claude Opus. We expected the LTM agents to perform under the LLMs when the memory spans fit into the LLM context, but also maintain their performance as the memory span increased.

The results show that the LLM scores indeed decrease as the memory span widens. For the LTM based agents, the scores also decrease, but more slowly than those of the LLMs. Curiously, the 200k benchmark sees a bump in the scores in the LTM agents compared to the 120k benchmark. As of publishing, we are unsure why this is the case, but hypothesize that the increased memory span spaces the needles out more, which helps the semantic memory by requiring it to embed only a single needle at a time, and not grouping needles together under a single embedding.

To comment on some specific tests and agents:

The colours test is generally solved very easily by the agents across all benchmarks, because the only information that matters is the latest colour, which is the needle just before the question. Even with memory spans this large, the gap between that needle and the question is often smaller than the context size of the agent.
Prospective memories have been improved since the last blogpost, but the only agent that can solve it is Claude Opus on the 120k benchmark.
Claude Opus’ performance on the Restaurant task varies wildly. It is the only agent to get full marks on a test example, but its mean performance is below that of the other agents.
The Claude Opus models have a tendency to restate information during the tests. For example, in the 500k shopping task, the agent repeats the current shopping list whenever a change is made, which keeps the whole list in context for the test. Similarly, LTMAgent1 Claude on the 500k benchmark does this for location directions tests.
We tested GPT3.5, which tends to struggle on the tasks under ideal conditions, but scoring a flat 0 on the 120k benchmark disqualified it from the more difficult tests.

What’s next?

We believe that the benchmarks here are in a good place and present a significant challenge to prospective LTM candidates. We will continue to develop the benchmark by adding new tasks, and use these benchmarks to help develop our next iterations of LTM agents.

Get Involved

As always, these benchmarks and their corresponding results are available as open source on our github at https://github.com/GoodAI/goodai-ltm-benchmark/releases.

If you are interested in LTM systems and wish to see how your solutions stack up, please try it and let us know. Additionally, if you have results, or any ideas for new tests, raise an issue in the repository or create a pull request.

Here are other ways to follow or get in touch with us:

LTM Benchmark Github: https://github.com/GoodAI/goodai-ltm-benchmark
Discord: https://discord.gg/Pfzs7WWJwf
Twitter: https://twitter.com/GoodAIdev
Facebook: https://www.facebook.com/GoodArtificialIntelligence/
LinkedIn: https://www.linkedin.com/company/goodai/

Thank you for reading this blog!

Best,
Marek Rosa
CEO, Creative Director, Founder at Keen Software House
CEO, CTO, Founder at GoodAI

For more news:
GoodAI Discord: https://discord.gg/Pfzs7WWJwf
Space Engineers: www.SpaceEngineersGame.com
Keen Software House: www.keenswh.com
VRAGE Engine: www.keenswh.com/vrage/
GoodAI: www.GoodAI.com
Personal Blog: blog.marekrosa.org

Personal bio:

Both companies now have over 100 engineers, researchers, artists, and game developers.