De cerca, nadie es normal

Chain-of-Verification (CoVe): An Approach for Reducing Hallucinations in LLM Outcomes

Posted: February 25th, 2024 | Author: | Filed under: Artificial Intelligence, Natural Language Processing | Tags: , , , , , , | Comments Off on Chain-of-Verification (CoVe): An Approach for Reducing Hallucinations in LLM Outcomes

Upon coping with LLM generative linguistic capabilities and prompt engineering, one of the main challenges to be tackled is the risk of hallucinations. In the fouth quarter 2023 a new approach to fight and reduce them in LLM outcomes was tested and published by a group of researchers from Meta AI: Chain-of-Verification (CoVe).

What these researchers aimed at was to prove the ability of language models to deliberate on the responses they give in order to correct their mistakes. In the Chain-of-Verification (CoVe) method the model first drafts an initial response; then plans verification questions to fact-check its draft; subsequently answers those questions independently so that the answers are not biased by other responses; and eventually generates its verified improved response.

Setting up the stage

Large Language Models (LLMs) are trained on huge corpora of text documents with billions of tokens of text. It has been shown that as the number of model parameters is increased, performance improve in accuracy, and larger models can generate more correct factual statements. 

However, even the largest models can still fail, particularly on lesser known long-tailed distribution facts; i.e. those that occur relatively rarely in the training corpora. In those cases where the model is incorrect, they instead generate an alternative response which is typically plausible looking, but an incorrect one: a hallucination.

The current wave of language modeling research goes beyond next word prediction, and has focused on their ability to reason. Improved performance in reasoning tasks can be gained by encouraging language models to first generate internal thoughts or reasoning chains before responding, as well as updating their initial response through self-critique. This is the line of research followed by the Chain-of-Verification (CoVe) method: given an initial draft response, firstly it plans verification questions to check its work, and then systematically answers those questions in order to finally produce an improved revised response.

The Chain-of-Verification Approach

This approach assumes access to a base LLM that is capable of being prompted with general instructions in either a few-shot or zero-shot fashion. A key assumption in this method is that this language model, when suitably prompted, can both generate and execute a plan of how to verify itself in order to check its own work, and finally incorporate this analysis into an improved response.

The process entails four core steps:

1. Generate Baseline Response: Given a query, generate the response using the LLM.

2. Plan Verifications: Given both query and baseline response, generate a list of verification questions that could help to self-analyze if there are any mistakes in the original response.

3. Execute Verifications: Answer each verification question in turn, and hence check the answer against the original response to check for inconsistencies or mistakes.

4. Generate Final Verified Response: Given the discovered inconsistencies (if any), generate a revised response incorporating the verification results.

Conditioned on the original query and the baseline response, the model is prompted to generate a series of verification questions that test the factual claims in the original baseline response. For example, if response may contains the statement “The Mexican–American War was an armed conflict between the United States and Mexico from 1846 to 1848”, then one possible verification question to check those dates could be “When did the Mexican American war start and end?” It is important to highlight that verification questions are not templated and the language model is free to phrase these in any form it wants and they also do not have to closely match the phrasing of the original text.

Given the planned verification questions, the next step is to answer them in order to assess if any hallucinations exist: the model is used to check its own work. In their paper, the Meta AI researchers investigated several variants of verification execution: Joint, 2-Step, Factored and Factor+Revise.

  1. Joint: In the Joint method, the afore-mentioned planning and execution steps (2 and 3) are accomplished by using a single LLM prompt, whereby the few-shot demonstrations include both verification questions and their answers immediately after the questions. 
  1. 2-Step:  in this method, there is a first step in which verification prompts are generated and then these verification questions are answered in a second step, where crucially the context given to the LLM prompt only contains the questions, and not the original baseline response and hence cannot repeat those answers directly.
  1. Factored:  this method consists of answering all questions independently as separate prompts. those prompts do not contain the original baseline response and are hence not prone to simply copying or repeating it.
  1. Factor+Revise: in this method, after answering the verification questions, the overall CoVe pipeline then has to either implicitly or explicitly cross-check whether those answers indicate an inconsistency with the original responses. For example, if the original baseline response contained the phrase “It followed in the wake of the 1845 U.S. annexation of Texas. . . ” and CoVe generated a verification question such as “When did Texas secede from Mexico?”, which would be answered with 1836 then an inconsistency should be detected by this step.

And in the final part of this four-step process, the improved response that takes verification into account is generated. This is executed through taking into account all of the previous reasoning steps -the baseline response and verification question answer pairs-, so that the corrections can happen.

As a conclusion, Chain-of-Verification (CoVe) is an approach to reduce hallucinations in a large language model by deliberating on its own responses and self-correcting them. LLMs are able to answer verification questions with higher accuracy than when answering the original query, by breaking down the verification into a set of simpler questions. And besides, when answering the set of verification questions, controlling the attention of the model so that it cannot attend to its previous answers (factored CoVe) helps alleviate copying the same hallucinations.

Stated the above, CoVe does not remove hallucinations completely from the generated outcomes. While this approach gives clear improvements, the upper bound to the improvement is limited by the overall capabilities of the model, e.g. in identifying and knowing what it knows. In this regard, the use of external tools by language models -for instance,RAG,-to gain further information beyond what is stored in its weights- would grant very likely promising results.


Curtailing Hallucinations in Large Language Models

Posted: December 9th, 2023 | Author: | Filed under: Artificial Intelligence | Tags: , , , , | Comments Off on Curtailing Hallucinations in Large Language Models

In 2020 Meta (then known as Facebook) published the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, which came up with a framework called retrieval-augmented generation (RAG) to give LLMs access to information beyond their training data.

Large language models can be inconsistent. Sometimes they can grant a perfect answer to a question, but other times they regurgitate aleatory facts from their training data.

Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources. In other words, it fills a gap in how LLMs work. Under the hood, LLMs are neural networks, typically measured by how many parameters they contain. An LLM’s parameters essentially represent the general patterns of how humans use words to form sentences. That deep understanding makes LLMs useful in responding to general prompts extremely fast. Nonetheless, it does not serve users who want a deeper dive into a current or more specific topic. Retrieval-augmented generation (RAG) gives models sources they can cite, so users can check any claims. That builds trust. What’s more, the technique can help models clear up ambiguity in a user query.

The roots of the technique go back at least to the early 1970s. That’s when researchers in information retrieval (IR) prototyped what they called question-answering systems, apps that use natural language processing to access text initially in narrow topics.

Implementing RAG in an LLM-based question answering system has two main benefits: It ensures that the model has access to the most current, reliable facts, and that users have access to the model’s sources, ensuring that the accuracy of responses can be easily checked.

By grounding an LLM on a set of external, verifiable facts, the model has fewer opportunities to “hallucinate” or mislead information. RAG allows LLMs to build on a specialized body of knowledge to answer questions in more accurate way. It also reduces the need for users to continuously train the model on new data and update its parameters, as circumstances evolve. In this way, RAG can lower the computational and financial costs of running LLM-powered chatbots in an enterprise setting.

RAG has two phases: retrieval and content generation. In the retrieval phase, algorithms search for and retrieve snippets of information relevant to the user’s prompt or question. In an open-domain, consumer setting, those facts can come from indexed documents on the internet; in a closed-domain, enterprise setting, a narrower set of sources are typically used for added security and reliability.

This collection of outside information is sent to the language model along with the user’s request. During the generative phase, the LLM creates an appealing answer that is customized for the user currently using the enhanced prompt and its internal representation of its training data. A chatbot can then be given the response together with connections to its original sources. The entire procedure can be represented graphically as follows:

Summing up, customer queries are not always straightforward. They can be ambiguously worded, complex, or require knowledge the model either doesn’t have or can’t easily parse. These are the conditions in which LLMs are prone to making things up. LLMs need to be explicitly trained to recognize questions they can’t answer, it may need though to see thousands of examples of questions that can and can’t be answered. Only then can the model learn to identify an unanswerable question, and probe for more detail until it hits on a question that it has the information to answer. RAG is currently the best-known tool for grounding LLMs on the latest, verifiable information, and it allows LLMs to go one step further by greatly reducing the need to feed and retrain the model on fresh examples.


On Graphs of Thoughts (GoT), Prompt Engineering, and Large Language Models

Posted: November 5th, 2023 | Author: | Filed under: Artificial Intelligence | Tags: , , , , | Comments Off on On Graphs of Thoughts (GoT), Prompt Engineering, and Large Language Models

For much time it seemed that in the computing landscape the main application of graphs were only related to ontology engineering, so when my colleague Mihael shared with me the paper “Graph of Thoughts: Solving Elaborate Problems with Large Language Models” -published by the end of August-, I thought we might be in the right path to re-discover the power to representing knowledge of these structures. In the afore-mentioned paper, the authors harness the graph abstraction as a key mechanism that enhances prompting capabilities in LLMs

Prompt engineering is one of the central new domains of the large language model research. However, designing effective prompts is a challenging task. Graph of Thoughts (GoT) is a new paradigm that enables the LLM to solve different tasks effectively without any model updates.The key idea is to model the LLM reasoning as a graph, where thoughts are vertices and dependencies between thoughts are edges. 

Human’s task solving is often non-linear, and it involves combining intermediate solutions into final ones, or changing the flow of reasoning upon discovering new in sights. For example, a person could explore a certain chain of reasoning, backtrack and start a new one, then realize that a certain idea from the previous chain could be combined with the currently explored one, and merge them both into a new solution, taking advantage of their strengths and eliminating their weaknesses. GoT reflects this, so to say, anarchic reason process with its graph structure.

Nonetheless, let’s take a step back: besides Graph of Thoughts, there are other approaches for prompting: 

  1. Input-Output (IO): a straightforward approach in which we use an LLM to turn an input sequence x into the output y directly, without any intermediate thoughts.
  2. Chain-of-Thought (CoT): one introduces intermediate thoughts a1, a2,… between x and y. This strategy was shown to significantly enhance various LLM tasks over the plain IO baseline, such as mathematical puzzles or general mathematical reasoning.
  3. Multiple CoTs: generating several (independent) k CoTs, and returning the one with the best output, according to certain metrics.
  4. Tree of Thoughts (ToT): it enhances Multiple CoTs by modeling the process of reasoning as a tree of thoughts. A single tree node represents a partial solution. Based on a given node, the thought generator constructs a given number k of new nodes. Then, the state evaluator generates scores for each such new node.

Explained in a more visual way:

Image taken from the paper “Graph of Thoughts: Solving Elaborate Problems with Large Language Models”

The design and implementation of GoT, according to the authors, consists of four main components: the Prompter, the Parser, the Graph Reasoning Schedule (GRS), and the Thought Transformer:

  • The Prompter prepares the prompt to be sent to the LLM, using a use-case specific graph encoding. 
  • The Parser extracts information from the LLM’s thoughts, and updates the graph structure accordingly. 
  • The GRS specifies the graph decomposition of a given task, i.e., it prescribes the transformations to be applied to LLM thoughts, together with their order and dependencies. 
  • The Thought Transformer applies the transformations to the graph, such as aggregation, generation, refinement, or backtracking. 

Finally, the authors evaluate GoT on four use cases -sorting, keyword counting, set operations, and document merging-, and compare it to other prompting schemes in terms of quality, cost, latency, and volume. The authors show that GoT outperforms other schemes, especially for tasks that can be naturally decomposed into smaller subtasks, are solved individually, and then merged for a final solution. 

Summing up, another breath of fresh air in this hecticly evolving world of AI; this time combining abstract reasoning, linguistics, and computer sciences. Pas mal at all.


On Natural Language Processing, Game Theory, and Diplomacy

Posted: April 11th, 2023 | Author: | Filed under: Artificial Intelligence | Tags: , , , , , | Comments Off on On Natural Language Processing, Game Theory, and Diplomacy

Beyond GPT in its different evolutions, there are other LLMs -as stated in Large Language Models  (LLMs): an Ontological Leap in AI– developed with a perfectly defined industry focus in mind. This is the case of CICERO.  

In November 2022, the Meta Fundamental AI Research Diplomacy Team (FAIR) and researchers from other academic institutions published the seminal paper Human-level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning, laying the foundations for CICERO. 

CICERO is an AI agent that can use language to negotiate, persuade, and work with people to achieve strategic goals similar to the way humans do. It was the first AI to achieve human-level performance in the strategy game No-press Diplomacy

No-press Diplomacy is a complex strategy game, involving both cooperation and competition, that has served as a benchmark for multi-agent AI research. It is a 7-player zero-sum cooperative/competitive board game, featuring simultaneous moves and a heavy emphasis on negotiation and coordination. In the game a map of Europe is divided into 75 provinces. 34 of these provinces contain supply centers, and the goal of the game is for a player to control a majority (18) of the SCs. Each players begins the game controlling three or four supply centers and an equal number of units. Importantly, all actions occur simultaneously: players write down their orders and then reveal them at the same time. This makes Diplomacy an imperfect-information game in which an optimal policy may need to be stochastic in order to prevent predictability. 

Diplomacy is a game about people rather than pieces. It is designed in such a way that cooperation with other players is almost essential to achieve victory, even though only one player can ultimately win. It requires players to master the art of understanding other people’s motivations and perspectives; to make complex plans and adjust strategies; and then to use natural language to reach agreements with other people and to persuade them to form partnerships and alliances.

How Was Cicero Developed by FAIR?

In two-player zero-sum (2p0s) settings, principled self-play algorithms ensures that a player will not lose in expectation regardless of the opponent’s strategy, as exposed by John von Neumann in 1928 in his work Zur Theorie der Gesellschaftsspiele.

Theoretically, any finite 2p0s game -such as chess, go, or poker- can be solved via self-play given sufficient computing power and memory. However, in games involving cooperation, self-play alone no longer guarantees good performance when playing with humans, even with infinite computing power and memory. The clearest example of this is language. A self-play agent trained from scratch without human data in a cooperative game involving free-form communication channels would almost certainly not converge to using English, for instance, as the medium of communication. Owing to this, the afore-mentioned researchers developed a self-play reinforcement learning algorithm -named RL-DiL-piKL-, that provided a model of human play while simultaneously training an agent that responds well to this human model. The RL-DiL-piKL was used to train an agent, named Diplodocus. In a 200-game No-press Diplomacy tournament involving 62 human participants, two Diplodocus agents both achieved a higher average score than all other participants who played more than two games, and ranked first and third according to an Elo rating system -a method for calculating the relative skill levels of players in zero-sum games.

Which Are the Implications of this Breakthrough?

Despite almost silenced by the advent of GPT in its different versions, firstly this is an astonishing advance in the field of negotiation, and more particularly in the realm of diplomacy. Never an AI model has had such a brilliant performance in a fuzzy environment, seasoned by information asymmetries, common sense reasoning, ambiguous natural language, and statistical modeling. Secondly and more importantly, this is another evidence we are in a completely new AI era in which machines can and are scaling knowledge

These LLMs have caused a deep shift: we went from attempting to encode human-distilled insights into machines to delegating the learning process itself to machines. AI is ushering in a world in which decisions are made in three primary ways: by humans (which is familiar), by machines (which is becoming familiar), and by collaboration between humans and machines (which is not only unfamiliar but also unprecedented). We will begin to give AI fewer specific instructions about how exactly to achieve the goals we assign it. Much more frequently we will present AI with ambiguos goals and ask: “How, based on your conclusions, should we proceed?”

AI promises to transform all realms of human experience. And the core of its transformations will ultimately occur at the philosophical level, transforming how humans understand reality and our roles within it. In an age in which machines increasingly perform tasks only humans used to be capable of: what, then, will constitute our identity as human beings? 

With the rise of AI, the definition of the human role, human aspirations, and human fulfillment will change. For humans accustomed to monopoly on complex intelligence, AI will challenge self-perception. To make sense of our place in this world, our emphasis may need to shift from the centrality of human reason to the centrality of human dignity and autonomy. Human-AI collaboration does not occur between peers. Our task will be to understand the transformations that AI brings to human experience, the challenges it presents to human identity, and which aspects of these developments require regulation or counterbalancing by other human commitments.

The AI revolution has come to stay. Unless we develop new concepts to explain, interpret, and organize its consequent transformations, we will be unprepared to navigate them. We must rely on our most solid resources -reason, moral and ethical values, tradition…- to adapt our relationship with reality so it keeps on being human. 


Large Language Models (LLMs): an Ontological Leap in AI

Posted: December 27th, 2022 | Author: | Filed under: Artificial Intelligence, Natural Language Processing | Tags: , , , , , | Comments Off on Large Language Models (LLMs): an Ontological Leap in AI

More than the quasi-human interaction and the practically infinite use cases that could be covered with it, OpenAI’s ChatGPT has provided an ontological jolt of a depth that transcends the realm of AI itself.

Large language models (LLMs), such as GPT-3, YUAN 1.0, BERT, LaMDA, Wordcraft, HyperCLOVA, Megatron-Turing Natural Language Generation, or PanGu-Alpha represent a major advance in artificial intelligence and, in particular, toward the goal of human-like artificial general intelligence. LLMs have been called foundational models; i.e., the infrastructure that made LLMs possible –the combination of enormously large data sets, pre-trained transformer models, and the requirement of significant computing power– is likely to be the basis for the first general purpose AI technologies.

In May 2020, OpenAI released GPT-3 (Generative Pre-trained Transformer 3), an artificial intelligence system based on deep learning techniques that can generate text. This analysis is done by a neural network, each layer of which analyzes a different aspect of the samples it is provided with; e.g., meanings of words, relations of words, sentence structures, and so on. It assigns arbitrary numerical values to words and then, after analyzing large amounts of texts, calculates the likelihood that one particular word will follow another. Amongst other tasks, GPT-3 can write short stories, novels, reportages, scientific papers, code, and mathematical formulas. It can write in different styles and imitate the style of the text prompt. It can also answer content-based questions; i.e., it learns the content of texts and can articulate this content. And it can grant as well concise summaries of lengthy passages.

OpenAI and the likes endow machines with a structuralist equipment: a formal logical analysis of language as a system in order to let machines participate in language. GPT-3 and other transformer-based language models stand in direct continuity with the linguist Saussure’s work: language comes into view as a logical system to which the speaker is merely incidental. These LLMs give rise to a new concept of language, implicit in which is a new understanding of human and machine. OpenAI, Google, Facebook, or Microsoft effectively are indeed catalyzers, which are triggering a disruption in the old concepts we have been living by so far: a machine with linguistic capabilities is simply a revolution.

Nonetheless, critiques have appeared as well against LLMs. The usual one is that no matter how good they may appear to be at using words, they do not have true language; based on the primeval seminal trailblazing work from the philologist Zipf, criticism have stated they are just technical systems made up of data, statistics, and predictions.

According to the linguist Emily Bender, “a language model is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot. Quite the opposite we, human beings, are intentional subjects who can make things into objects of thought by inventing and endowing meaning.

Machine learning engineers in companies like OpenAI, Google, Facebook, or Microsoft have experimentally established a concept of language at the center of which does not need to be the human. According to this new concept, language is a system organized by an internal combinatorial logic that is independent from whomever speaks (human or machine). They have undermined one of the most deeply rooted axioms in Western philosophy: humans have what animals and machines do not have, language and logos.

Some data: monthly, on average, humans publish about seventy million posts on the content management platform WordPress. Humans produce about fifty-six billion words a month, or 1.8 billion words a day on this content management platform. GPT-3 -before its scintillating launch- was producing around 4.5 billion words a day, more than twice what humans on WordPress were doing collectively. And that is just GPT-3; there are other LLMs. We are exposed to a flood of non-human words. What will it mean to be surrounded by a multitude of non-human forms of intelligence? How can we relate to these astonishingly powerful content-generator LLMs? Do machines require semantics or even a will to communicate with us?

These are philosophical questions that cannot be just solved with an engineering approach. The scope is much wider and the stakes are extremely high. LLMs can, as well as master and learn our human languages, make us reflect and question ourselves about the nature of language, knowledge, and intelligence. Large language models illustrate, for the first time in the history of AI, that language understanding can be decoupled from all the sensorial and emotional features we, human beings, share with each other. Gradually, it seems we are entering eventually a new epoch in AI.