diff --git a/docs/RAG.md b/docs/RAG.md new file mode 100644 index 0000000..184f024 --- /dev/null +++ b/docs/RAG.md @@ -0,0 +1,299 @@ +# RAG +Retrieval Augmented Generation (RAG) is a method of minimizing LLM hallucinations and extending the model's context +without consuming a significant portion of the context length. It uses documents and other additional resources that you +provide to give the model more context for all of your prompts. + +Loki has a built-in vector database and full-text search engine to support RAG knowledge bases for your queries. + +The generated knowledge bases are stored in the `rag` subdirectory of your Loki configuration directory. The location of +this directory varies by system, so you can use the following command to find your RAG directory: + +```shell +loki --info | grep 'rags_dir' | awk '{print $2}' +``` + +## Quick Links + +- [Usage](#usage) + - [Persistent RAG](#persistent-rag) + - [Ephemeral RAG](#ephemeral-rag) +- [How It Works](#how-it-works) + - [1. Build](#1-build) + - [2. Lookup](#2-lookup) + - [2a. Reranking (Optional)](#2a-reranking-optional) + - [3. Prompt](#3-prompt) +- [Supported Document Sources](#supported-document-sources) +- [Document Loaders](#document-loaders) + - [Document Loader Usage](#document-loader-usage) +- [Advanced Customizations](#advanced-customizations) + - [Embedding Model](#embedding-model) + - [Reranker](#reranker) + - [Chunk Size](#chunk-size) + - [Trade-Offs](#chunk-size-trade-offs) + - [Chunk Overlap](#chunk-overlap) + - [Top K](#top-k) + - [Trade-Offs](#top-k-trade-offs) + - [RAG Template](#rag-template) + + +--- + +## Usage +There's two ways to use RAG in Loki: A persistent RAG that can be loaded on-demand for queries, and an ephemeral one for +adding RAG to a single specific query. + +### Persistent RAG +In the REPL, persistent RAG is initialized via the `.rag` command: + +![Persistent RAG example](./images/rag/persistent-rag.gif) + +The generated RAG is then saved to the `rag` subdirectory of the Loki configuration, and can then be loaded whenever you +want that knowledge base via either `.rag ` or `loki --rag `. + +### Ephemeral RAG +Short-lived RAG that is only used for a single session or query is loaded using `.file`/`--file`. + +You can use it to either execute a prompt from a file, or for temporary RAG. The difference is the usage of the `--` +separator. If you only specify a filename and no `--` separator, Loki will know to read the file contents and pass them +as a query to the model. Otherwise, the `--` separator is read to indicate that this is the end of the list of documents +to load into the ephemeral RAG, and what follows is the query to pass to the model. + +```shell +.file prompt.md # Read the file as a prompt +.file %% -- translate the last reply to italian +.file `git diff` -- generate a commit message +``` + +![Ephemeral RAG Example](./images/rag/ephemeral-rag.gif) + +Once the session ends, this RAG will no longer be accessible and is only visible to the current session. + +#### The `%%` Document Type +In addition to the usual documents that can be specified for persistent RAG, ephemeral RAG has a special `%%` value. +This value references the content of the last reply. So you can use it like this: + +```shell +.file %% -- translate the last reply to italian +``` + +The `--` indicates that this is the end of your documents and the beginning of your prompt. + +#### The `cmd` Document Type +Loki also lets you use command outputs for ephemeral RAG input. Simply enclose the command in backticks: + +```shell +.file `git diff` -- generate a commit message +``` + +The `--` indicates that this is the end of your documents and the beginning of your prompt. + +## How It Works +#### 1. Build +When you define RAG, Loki will first "build" the RAG. This means that Loki will consume the documents you specified and +generate [embeddings](https://huggingface.co/spaces/hesamation/primer-llm-embedding) for that text. This essentially just means that Loki translates the document into a language +the LLM can understand. + +These embeddings are then stored in an in-memory vector database. + +#### 2. Lookup +Loki sits between you and the model. So when you submit a prompt to the model, before Loki ever sends it, it will first +convert your prompt into embeddings (LLM language), and look for relevant snippets of text in the vector database. + +Loki then passes the top `n`-snippets of text that it finds in the vector database as additional context to the model +before your prompt. + +#### 2a. Reranking (Optional) +The lookup for relevant snippets of texts uses embeddings to find text that is semantically similar to your prompt, and +returns the top `n`-results. This often works fairly well, however these top results aren't always the most relevant for +answering the specific query. + +Reranking improves these initial results (say, the top 20-100 text snippets) and re-scores them using a more +sophisticated model. The reranker model will rank documents by their actual usefulness for answering the query to ensure +the most relevant context is passed to the model alongside your query. + +This reranking model can be customized for each RAG you build in Loki. See the [Custom Reranker](#reranker) section +below for more details on how to customize this. + +#### 3. Prompt +Finally, the text snippets that were looked up in RAG are passed to the model as additional context to your prompt, +giving the model query-specific context to answer your question. + +## Supported Document Sources +Loki supports a number of document sources that can be used for RAG: + +| Source | Example | Comments | +|--------------------------|-----------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------| +| Files | `/tmp/dir1/file1;/tmp/dir1/file2` | | +| Directory | `/tmp/dir` | Picks up all files in a directory and all its subdirectories | +| Directory (extensions} | `/tmp/dir2/**/*.{md,txt}` | Finds all files in all subdirectories with the specified extensions | +| Recursive Filename | `/tmp/*/LOKI.md` | The following files will be picked up:
| +| URL | `https://www.ohdsi.org/data-standardization/` | Downloads and loads the specified webpage into the
knowledge base | +| Recursive URL (Websites) | `https://github.com/OHDSI/Vocabulary-v5.0/wiki/**` | Crawls all pages under the given URL and loads them
into the knowledge base | +| Document Loader (custom) | `jina:https://cloud.google.com/bigquery/docs/reference/standard-sql/` | Use a custom document loader to parse the given document | + +## Document Loaders +Loki only has built-in support for loading text files. But that functionality can be extended to read all kinds of files +into your knowledge bases. These custom loaders are used by both RAG and for documents specified using the +`.file`/`--file` flags. + +In the global configuration file, you can specify loaders for specific document types using the `document_loaders` +setting. Each loader is defined by specifying a name and then a command that Loki will execute to load the document. + +The following variables are interpolated at runtime by Loki and can be used as placeholders in your command definitions: +* `$1` (Required) - The input file +* `$2` (Optional) - The output file. If omitted, `stdout` is used as the output destination + +**Note:** It is your responsibility to ensure that any tools used to parse documents into text that Loki can read are +installed on your system and are available on your `$PATH`. Loki does not have any built-in way of installing +dependencies for document loaders for you. + +The following are some example loaders: +```yaml +document_loaders: + pdf: 'pdftotext $1 -' # Use pdftotext to convert a PDF file to text + # (see https://poppler.freedesktop.org for details on how to install pdftotext) + docx: 'pandoc --to plain $1' # Use pandoc to convert a .docx file to text + # (see https://pandoc.org for details on how to install pandoc) + jina: 'curl -fsSL https://r.jina.ai/$1 -H "Authorization: Bearer {{JINA_API_KEY}}' # Use Jina to translate a website into text; + # Requires a Jina API key to be added to the Loki vault + git: > # Use yek to load a git repository into the knowledgebase (https://github.com/bodo-run/yek) + sh -c "yek $1 --json | jq 'map({ path: .filename, contents: .content })'" +``` + +### Document Loader Usage +Once you have your loaders defined, you can specify when Loki should use them by prefixing any RAG file/directory/URI +with the name of the loader. + +**Example: Load a git repo into RAG** +![Git Repo Loader Example](./images/rag/git-loader.png) + +**Example: Use pdf loader for ephemeral RAG** +```shell +$ loki --file pdf:some-file.pdf +``` + +## Advanced Customizations +For those familiar with RAG, Loki exposes a handful of advanced global settings that can be used to tweak your default +RAG configurations. + +### Embedding Model +When Loki queries your RAG knowledge bases, it needs to first convert your query into embeddings. By default, Loki uses +the same embedding model that was used to create the knowledge base in the first place. + +This can be customized to any other embedding model available in your configured clients by setting the +`rag_embedding_model` setting in your global Loki configuration file: + +```yaml +rag_embedding_model: null # Specifies the embedding model used for context retrieval +``` + +### Reranker +By default, Loki uses [Reciprocal Rank Fusion (RRF)](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion) to merge vector and keyword search results. + +You can change the default reranker model to any other reranking model in your configured clients. To change the default +reranker model, simply change the value of the `rag_reranker_model` setting in your global configuration file: + +```yaml +rag_reranker_model: null # By default, +``` + +### Chunk Size +In the context of RAG, the chunk size is the maximum length of each text chunk (measured in characters) that is created +when splitting documents. In Loki, this defaults to `2000` characters. + +You can specify a different global default by setting the `rag_chunk_size` property in your global configuration file: + +```yaml +rag_chunk_size: null # Defines the size of chunks for document processing in characters +``` + +#### Chunk Size Trade-Offs +Keep in mind the following trade-offs when changing the chunk size: + +* **Smaller chunks (e.g. 256 characters):** More precise retrieval, better semantic focus, but may lack context or split + important information +* **Larger chunks (e.g. 1024 characters):** More context preserved, fewer chunks to manage, but less precise matching + and more noise in retrieved document + +### Chunk Overlap +Chunk overlap in RAG is the number of characters that overlap between consecutive chunks to maintain continuity. + +--- + +**Example:** If the following sentence is cut off at the end of one chunk + +`I was doing fine until someone brought up` + +You'll ideally want that full sentence to be picked up at the beginning of the next chunk to make sure the full meaning +is captured. So in this example, if your chunk overlap is 42 characters, then the start of the next chunk would look +like this: + +`I was doing fine until someone brought up the game. ` + +--- + +Often, this value is 10%-20% of the chunk size. + +By default, in Loki, this value is 5% the chunk size. You can override this and specify the default chunk overlap (in +characters) that Loki should use as a global default by setting the `rag_chunk_overlap` property in the global Loki +configuration file: + +```yaml +rag_chunk_overlap: null # Defines the overlap between chunks +``` + +### Top K +In RAG, `top_k` represents the top `k`-chunks to return from the vector database query. Think of it like if you search +something on Google and only care about the top 10 results, that's what you'll use for your context. + +In Loki, the default value for this is `5`. You can customize this global default by setting the `rag_top_k` property in +your global configuration file: + +```yaml +rag_top_k: 5 # Specifies the number of documents to retrieve for answering queries +``` + +#### Top K Trade-Offs +When customizing this value, keep in mind the following trade-offs so you get the best performance: + +* **Lower top_k (e.g. 3):** Faster, more focused context, lower cost, but risks missing relevant information +* **Higher top_k (e.g. 10):** More comprehensive coverage, but more noise, higher latency, increased token costs, and + potential context window constraints + +### RAG Template +When you use RAG in Loki, after Loki performs the lookup for relevant chunks of text to add as context to your query, it +will add the retrieved text chunks as context to your query before sending it to the model. The format of this context +is determined by the `rag_template` setting in your global Loki configuration file. + +This template utilizes two placeholders: +* `__INPUT__`: The user's actual query +* `__CONTEXT__`: The context retrieved from RAG + +These placeholders are replaced with the corresponding values into the template and make up what's actually passed to +the model at query-time. + +The default template that Loki uses is the following: + +```text +Answer the query based on the context while respecting the rules. (user query, some textual context and rules, all inside xml tags) + + +__CONTEXT__ + + + +- If you don't know, just say so. +- If you are not sure, ask for clarification. +- Answer in the same language as the user query. +- If the context appears unreadable or of poor quality, tell the user then answer as best as you can. +- If the answer is not in the context but you think you know the answer, explain that to the user then answer with your own knowledge. +- Answer directly and without using xml tags. + + + +__INPUT__ + +``` + +You can customize this template by specifying the `rag_template` setting in your global Loki configuration file. Your +template *must* include both the `__INPUT__` and `__CONTEXT__` placeholders in order for it to be valid.