Engineering the Flow of Artificial Thoughts

Just like humans, AI also needs space to think — space that we can create together

Feb 01, 2024

Not all answers come to us immediately. Sometimes we need to think, get additional information, or seek help from others. There are also cases where our actions do not yield the desired result, and then we must entirely change our perspective and assumptions, searching for an entirely different path.

While writing the next paragraphs of this post, I find myself modifying or even outright deleting them, replacing them with others. Sometimes it is also useful for me to write down a few points or a draft of what I want to write about. This helps me in shaping my thoughts, which directly translates into the quality of the content I create. This process also works the other way around, and my subsequent publications positively influence the way I think. Moreover, the fact that I create so much content is not accidental and is related to one of Jordan Peterson's speeches, where he directly spoke about the connection between thinking and writing.

In "Exploring Artificial Senses in Real Life", I wrote that in the case of large language models, the thinking process takes place during content generation or more precisely — completing the processed text. In other words, one could say that LLMs always "think out loud." Therefore, when our instructions force the model to immediately provide us with the expected answer, we almost eliminate the "thinking" process, which can positively affect the quality of the generated result.

Just like with humans, in the case of simple tasks, it is possible to provide an immediate correct answer. Naturally, as complexity increases, the situation begins to change. This is well illustrated by the following example of a relatively simple classification of queries. Here, the task of the model is to determine whether the user's message contains a direct command requiring contact with an external service. The problem is that a direct command may be present in messages that do not require it, and the model must notice the difference.

In the case of a prompt aiming to generate an immediate response, the risk of misclassification is high. The most mistakes occur for messages in which a direct command is given, but their overall meaning is different. Misclassification also occurs when there is a command, but it is not related to the need to connect with external services.

When GPT-4-Turbo does not have space to "think," the response to a sample message is almost always wrong. However, when we expand our prompt with a request to think about the answer beforehand, the situation changes significantly. Of course, on a larger scale, errors still occur, and we should always consider that we are talking only about increasing the chance of getting the correct answer.

The above behavior is quite obvious because LLM generates text by predicting the next token, considering the content so far. Moreover, the process of selecting a token is done anew each time. This means that the content of the response is influenced by both our entire instruction and every subsequent fragment generated by the model. This is clearly seen in the screenshots below, where after the token "are," the probability of the token "you" occurring was determined to be as high as 98.86%.

Since the model considers the entire content so far, there is nothing to prevent us from providing a fragment that increases the probability of obtaining the correct answer. In the example below, I included an additional section with "reflection" and a request to consider it when generating the answer. As a result, the model correctly classified the user's message, but this time it generated only one token!

So here, the space for "thinking" is not in the content generated by the model, but in the instruction provided by us. As a result, there is nothing to prevent the final response of the model from consisting of a series of previous queries, whose results will be included as the context of the final interaction.

In practical scenarios that include processing a larger amount of information, the above knowledge may lead us to want to include larger data sets that would increase the space for "thinking," offering a higher quality of answers. Unfortunately, such an approach will have a very negative impact on both the speed of operation, higher costs, and above all — it will work contrary to our expectations because currently LLMs still have significant limitations related to the ability to maintain attention, as can be read in Lost in the Middle.

The problem related to the ability to process a larger context is attracting a lot of attention, and many indications are that it has been largely addressed by Claude 2.1 and a certain prompt modification described on the Anthropic blog. As a result, we have seen an increase in effectiveness in working with long context from 27% to as much as 98%!

The example of Claude 2.1 shows us how the capabilities related to content processing by LLMs are changing. And although it may seem now that we can freely load any information, in large quantities, we must remember what I said in my earlier post about LLM's attention being scattered by irrelevant context. This is seen below, where a sentence about music was accidentally loaded into the context and was considered information about the user.

When designing a system whose task is to combine many queries to LLMs, considering dynamic data loading, we can easily reach the above situation unless we consider it already at the stage of organizing the knowledge base and the way it is searched.

This all leads us to the following observations:

The quality of responses generated by LLMs is significantly influenced by the content we provide and that which is generated by the model. The more effectively we lead the "narrative" of the entire interaction, the greater the chance of obtaining the desired result. Unfortunately, the whole process is based on quite intuitive action, as we cannot clearly determine the influence of individual content elements on the model's behavior.
The second factor directly affecting the quality of responses is the "thinking time" we give the model. In this case, this time is expressed in the form of the length of the processed content. Depending on the case, we may want to put more emphasis on the part developed by us, and at other times on the model itself. In both cases, our attention must be focused on leading the model in the right direction.
Although usually increasing the space for "thinking" leads LLMs to better results, in practice we will not always care about this, due to the need to balance between the quality of the response and the time and costs of its generation.
At the foundation of all techniques leading the model to desired responses lie quality data and a precise way of delivering it. Therefore, this area deserves the most of our active attention.

And ultimately, our perspective on LLMs should unequivocally include interactions that go beyond immediate response generation and strive towards designing systems capable of flexible, possibly autonomous operation. Then we move from thinking of LLMs as chatbots to the realm of a new operating system, as Andrej wrote about.

Latent Space Activation

Getting to know LLMs quickly leads us into the area of Prompt Engineering, filled with various techniques for interacting with the model to obtain the best possible results. Some of them are so general that they can work in many scenarios, for example, it is worth looking at prompts such as:

...explain your reasoning step by step
...verify your answer step by step
... Answer questions as truthfully as possible using the context below and nothing else. If you don't know the answer, say, "I don't know."
... Here is the most relevant sentence in the context: (...)

Similarly, useful are techniques that involve a gradual approach to the final answer, such as Chain of Thought or Tree of Thoughts. Knowledge about them is undoubtedly useful for controlling the behavior of the model, but it is also worth directing some of our attention to building a broader perspective. A great example is David Shapiro and his thoughts on Latent Space Activation, which is the space in the model that concentrates a specific type of knowledge or skills.

David excellently uses well-known concepts to shape a certain image of how large language models work. An example could be the types of thinking presented by Daniel Kahneman in Thinking Fast and Slow. Using such concepts, we can design interactions with models at a slightly higher level, which currently makes a lot of sense, considering the fact that we know little about the specific factors influencing their operation.

My own experiences also suggest certain simplifications in understanding LLMs. One of them concerns focusing on the way content is generated, that is, predicting the next fragment. Since the model focuses on this task, the entire interaction should remain as close as possible to this activity.

Therefore, instead of designing instructions that draw the context of a given conversation, explaining its rules and important details, I focus directly on the conversation itself, which is to be supplemented by the generated content.

Such an approach usually allows for leading the narrative for the entire interaction with the model in a very natural way, considering many details and nuances that do not have to be mentioned explicitly. An example could be a specific tone of the model's speech visible in the second example. Even though I did not mention that it should be very colloquial, the model, continuing the previous content, naturally chose the right direction.

I choose a similar strategy when I want the generated response to be written in JSON format. I wrote more about this in User Interfaces may change because "AI knows what we mean". There, instead of describing the rules for generating responses and their structures, I simply created a JSON object, the content of which was at the same time a description of what I cared about. Such an approach helps me maintain high conciseness and precision of speech, while presenting the model with the result I need.

Naturally, I do not have specific studies that confirm my assumptions and can only rely on my own experiences. However, in the face of still very limited knowledge about LLMs that we currently have, designing our own strategy for working with models seems to make a lot of sense, especially when we base it on understanding the fundamentals known to us.

Pitfalls of Combining Prompts

Building logic that includes multiple queries to LLMs can indeed lead to significantly better results. At the same time, the same path can lead us to the opposite outcome.

In the case of simple, direct interactions with the model, we can quickly catch any mistakes or errors in reasoning and make the necessary corrections. However, in the case of complex and often partially autonomous actions of the model, protecting against such challenges becomes much more difficult. What's worse, if an error occurs at the beginning of the entire process, it is very likely that its subsequent stages will also not end successfully.

The most common problem I encounter is overwriting the model's behavior through content that is supposed to be processed. In the example below, we see how the initial instruction containing a request to correct errors in the text was overwritten by the user's message content, which sounds like a direct command.

Such situations can occur during the automatic processing of content, where we usually rely on dividing the original into smaller parts and making queries to the model for each of them. To prevent this, we can both refine our original instruction, clearly separating it from the rest of the conversation. Then the model's attention will focus on the original command, and although its dispersion will still be possible, we are again talking about reducing the likelihood of such an event.

Sometimes it can be difficult to predict behaviors that deviate from initial assumptions. Then, the model itself can also be involved in the verification of the result, which can compare the generated outcome with the initial instruction. The concept of such mechanisms is much broader and is referred to as "guardrails," whose task is to ensure that the responses generated by the model do not exceed certain norms.

Problems related to implementing planned logic may be associated not only with changes in the model's behavior but also with its non-deterministic nature, which may result in not obtaining the expected response format or experiencing a different kind of error during its generation. Therefore, introducing mechanisms capable of making corrections without the need for human involvement becomes critical. Just as following only the "happy path" is not advisable in coding, here we must also remember to handle potential errors.

Ultimately, regardless of the precision of the designed system, we see that at every step we are dealing with the probability of certain situations occurring. For this reason, when designing systems that engage LLM in any logic elements, it is necessary to include human involvement at least at the last stage related to the verification of the results developed by our system.

Downplaying the role of LLMs and programming support

Despite the continuous growth in the capabilities of models, their efficiency, the availability of open-source versions, or the increasingly lower prices of providers, it should be kept in mind that the role of LLM does not have to cover all actions. Wherever code has worked so far, it is worth continuing to use it. Especially since, as I wrote in Plain text is what you need, LLM can independently reach for various tools, taking the burden of different operations off themselves.

Therefore, instead of focusing solely on prompts, space for "thinking," or building complex logic, it is worth redirecting part of the attention to creating tools and instructions for the model to use them.

Deciding when to engage the model directly and when to combine it with external services requires a good understanding of the available options, including both the AI field and market solutions. However, knowledge of what we have at our disposal is not enough; it is also necessary to monitor new opportunities that appear almost every week, often changing the rules of the game.

When integrating LLM with external tools, the foundation will almost always be the API and generating queries and their content in JSON format. Very helpful in these processes are Function Calling present, among others, in models from OpenAI, and JSON mode. In the first case, the model can independently indicate the function and data needed to run it, and in the second, we simply increase the chance of obtaining a correct object that we can use in the application code.

Controlling the dream of a large language model

I am increasingly encountering the perspective that LLM hallucinations are not a flaw in their operation, but are the way models function. Jon Stokes wrote about this in more detail, so I refer you to his article. Meanwhile, for us, this is a critical hint that redirects our attention from the programming world where we can describe everything precisely, to a space where nothing is certain, and often impossible to explain. Such a change will primarily help to set our expectations correctly, which in turn facilitates exploring the available possibilities and limitations of models.

In conclusion, I have prepared a list aimed at directing attention to places that have helped me integrate LLM with the logic of my applications and automation over the past few months.

LLM can conduct conversations with itself, which can be described as self-reflection, including preparing to give a response, enriching the original query, or verifying its own statements.
The effectiveness of models is highest where their role is not to generate, but to process existing content. In almost every case, my use of LLM involves transforming data or possibly actions related to understanding the conveyed content.
Limiting the involvement of the model can include not only the use of external tools, but also transferring actions directly to me. Then I design their operation so that they do not aim for a final result, but only prepare the necessary information in a form that will facilitate the task for me.
Although I continuously monitor emerging prompt design techniques and available tools, I usually strive to develop my own solutions. Due to the current pace of model development, it is difficult to rely on external tools, which are characterized by moderate stability and difficulty in keeping up with the latest model capabilities. To move quickly myself, I avoid creating complex solutions whose foundations are difficult for me to modify.
No source of knowledge replaces my own experiences and building solutions that help me in various tasks. On the other hand, practice alone is not sufficient and must be directly linked to knowledge from publications by people involved in model development.
I always work with the latest available versions of OpenAI models and continuously monitor the progress of competing companies searching for new strategies, techniques, or capabilities.
The logic I design involving LLM usually does not go beyond a dozen actions, and where possible, I strive to maintain low complexity of instructions and dependencies between queries to LLM. This is a strategy for limited AI behavior debugging capabilities.
Whenever I design interactions involving the model, I use the model itself. In other words, I collaborate with AI in designing prompts and finding solutions that allow for maintaining low complexity in achieving the set assumptions.

In summarizing the topic of controlling the behavior of a model, we are talking about an area that poses many questions for which no one currently has definitive answers. Therefore, we need to balance what is commonly known with what we ourselves can discover on the path to achieving the results that matter to us. In such a situation, the highest effectiveness seems to stem from understanding the nature of models and the characteristics they exhibit, often entering the realm of psychology and many human traits.

That's all for this time,
Adam