Step 2: Load data into Ragie

This section will teach you how to load data into Ragie using the data ingest APIs

In this section, we will load data into Ragie from a directory that has some files in it. The example files that we are using are available in our ragie-examples GitHub repository. For this example, we will use Ragie's multipart file uploader API, but when you build your application, you ingest data into Ragie using three different methods:

  1. The multipart file uploader API
  2. The raw content JSON API
  3. Using a connector to keep data synchronized from a source

Here is some code to accomplish what we just described. It's similar, but not exactly the same as the code in our GitHub repository. For this example, we've removed error handling and some output so we can focus on only the data ingest logic. For a more robust example, see the tutorial in ragie-examples.

import fs from "fs";
import { readFile } from "node:fs/promises";

const apiKey = "<YOUR API KEY>";
const directory = "path/to/directory";

const files = fs.readdirSync(directory);

for (const file of files) {
  const filePath = path.join(directory, file);
  const blob = new Blob([await readFile(filePath)]);

  const formData = new FormData();
  formData.append("file", blob, file);
  formData.append("metadata", JSON.stringify({ title: file, environment: "tutorial" }));

  const response = await fetch("https://api.ragie.ai/documents",{
    method: "POST",
    headers: {
      authorization: `Bearer ${apiKey}`,
      accept: "application/json",
    },
    body: formData,
  });
  
  if (!response.ok) {
    throw new Error("Upload failed");
  }
}

Each document in the directory is loaded into Ragie and then processed. The amount of time it takes to process each document can vary based on the type of document that is passed to Ragie as well as the mode in which the document is processed.

After this runs successfully, we will have added all of our podcasts into Ragie. We can then use this data later in our application to generate content which includes data that the LLM model has not been trained on. This is the beauty of RAG and how Ragie helps you build these types of applications quickly.

Using Metadata
Notice that in our example above, we added some metadata to our API call. You can attach metadata to any document that you send to Ragie. Metadata can be used later during retrieval to filter results. Typical usages of metadata include attaching an "environment" or including an "organization_id" to segment your customer data. Metadata may also include more information about the document that you want to reference in your application such as the title of the document or the source url of the document. What you put in metadata is up to you and your application.

Querying document status

You may query the document to see if it has been fully processed. When you see a status of ready, the document has been fully processed and is ready to be used in retrieval.

const apiKey = "<YOUR API KEY>";
const id = "21881de1-65f7-4816-b571-3ef69661c375";

// Retrieve document status
(async () => {
  const response = await fetch(`https://api.ragie.ai/documents/${id}`, {
    headers: { authorization: `Bearer ${apiKey}` },
  });

  const data = await response.json();

  console.log(data);
})();

Sample output would looks like this:

{
  "id": "21881de1-65f7-4816-b571-3ef69661c375",
  "created_at": "2024-07-05T22:21:21.099975Z",
  "updated_at": "2024-07-05T22:21:37.521837Z",
  "status": "ready",
  "name": "All In Pod Episode E112.pdf",
  "metadata": {
    "title": "All In Pod Episode E112.pdf",
    "environment": "tutorial"
  },
  "chunk_count": 86,
  "external_id": null
}

It is typical to poll for the document status to know when a document is ready. In the future, Ragie will support webhooks that can push document status changes to your application.

Explaining what Ragie did for you

With a simple API call, Ragie has done the following for you:

  • Extract all of the information from the documents. By default, Ragie uses "fast" mode to extract data. This is just fine for text files, but Ragie can do much more. In "hi_res" mode, Ragie will analyze all portions of the document and use multiple extraction methods to squeeze every bit of information out of a document. This includes using LLM steps and OCR to pull data from an image as well as using LLM steps to pull data from charts and graphs. You can change the resolution mode for ingest by including the "mode" parameter in the request. Keep in mind that there is a speed tradeoff when using "fast" vs. "hi_res" modes.
  • Create chunks to be used later in your prompt's context window (if you don't know what that means, don't worry, we will explain this in the Retrieve Chunks section of the tutorial). Ragie uses the latest and greatest chunking techniques so you don't have to worry about which one to choose. We are always experimenting with new methods to make sure you get the most out of your data.
  • Index each file for retrieval via a semantic search query. Ragie handles the tedium of creating embeddings and storing them in a vector search database.
  • Out of scope for this basic tutorial, but Ragie also creates summaries and indexes them in a special Summary Index. If you have configured instructions to be used in Entity Extraction, then entities will be extracted as well. If you don't understand what this means, don't worry. These are some of the more advanced features of Ragie and they are described in the "Advanced Features" section. You can get started using Ragie without knowing what this all means, but as you get deeper into using RAG and Ragie, we think you will find these features extremely useful.

Let's recap

In this section, you learned how to:

  1. Use the mulipart document upload API to load data into Ragie. You learned that this is one of three ways that Ragie can ingest data.
  2. Use the document API to query for document status.
  3. Use Ragie for heavy lifting that you would ordinarily need to build yourself. This work is abstracted away from you as a developer so you can focus on building your application.