Step 2: Load data into Ragie

This section will teach you how to load data into Ragie using the data ingest APIs

In this section, we will load data into Ragie from a directory that has some files in it. The example files that we are using are available in our ragie-examples GitHub repository. For this example, we will use Ragie's multipart file uploader API, but when you build your application, you ingest data into Ragie using three different methods:

  1. The multipart file uploader API
  2. The raw content JSON API
  3. Using a connector to keep data synchronized from a source

Here is some code to accomplish what we just described. It's similar, but not exactly the same as the code in our GitHub repository. For this example, we've removed error handling and some output so we can focus on only the data ingest logic. For a more robust example, see the tutorial in ragie-examples.

import fs from "fs";
import { readFile } from "node:fs/promises";

const apiKey = "<YOUR API KEY>";
const directory = "path/to/directory";

const files = fs.readdirSync(directory);

for (const file of files) {
  const filePath = path.join(directory, file);
  const blob = new Blob([await readFile(filePath)]);

  const formData = new FormData();
  formData.append("file", blob, file);
  formData.append("metadata", JSON.stringify({ title: file, scope: "tutorial" }));

  const response = await fetch("https://api.ragie.ai/documents",{
    method: "POST",
    headers: {
      authorization: `Bearer ${apiKey}`,
      accept: "application/json",
    },
    body: formData,
  });
  
  if (!response.ok) {
    throw new Error("Upload failed");
  }
}

Each document in the directory is loaded into Ragie and then processed. The amount of time it takes to process each document can vary based on the type of document that is passed to Ragie as well as the mode in which the document is processed.

After this runs successfully, we will have added all of our podcasts into Ragie. We can then use this data later in our application to generate content which includes data that the LLM model has not been trained on. This is the beauty of RAG and how Ragie helps you build these types of applications quickly.

Using Metadata
Notice that in our example above, we added some metadata to our API call. You can attach metadata to any document that you send to Ragie. Metadata can be used later during retrieval to filter results. Typical usages of metadata include attaching an "scope", "environment" or including an "organization_id" to segment your customer data. Metadata may also include more information about the document that you want to reference in your application such as the title of the document or the source url of the document. What you put in metadata is up to you and your application.

Querying document status

You may query the document to see if it has been fully processed. When you see a status of ready, the document has been fully processed and is ready to be used in retrieval.

const apiKey = "<YOUR API KEY>";
const id = "21881de1-65f7-4816-b571-3ef69661c375";

// Retrieve document status
(async () => {
  const response = await fetch(`https://api.ragie.ai/documents/${id}`, {
    headers: { authorization: `Bearer ${apiKey}` },
  });

  const data = await response.json();

  console.log(data);
})();

Sample output would looks like this:

{
  "id": "21881de1-65f7-4816-b571-3ef69661c375",
  "created_at": "2024-07-05T22:21:21.099975Z",
  "updated_at": "2024-07-05T22:21:37.521837Z",
  "status": "ready",
  "name": "All In Pod Episode E112.pdf",
  "metadata": {
    "title": "All In Pod Episode E112.pdf",
    "scope": "tutorial"
  },
  "chunk_count": 86,
  "external_id": null
}

It is typical to poll for the document status to know when a document is ready. In the future, Ragie will support webhooks that can push document status changes to your application.

What did Ragie do?

  • Ragie extracted all of the information from your documents. By default, Ragie uses “fast” mode to extract data. “Fast” mode is perfect for text documents, but will miss some information in complex documents that include images, charts, and tables. Ragie has a “hi_res” mode for processing complex documents that uses a combination of models, such as multi-modal LLMs and OCR, to extract non-textual information. This allows your generative ai applications to use the typically inaccessible information embedded in charts, graphs, and other non-textual content. Keep in mind that there is a speed tradeoff when using “fast” vs “hi_res” modes.
  • Ragie created optimized chunks from your documents for use in your prompt’s context window (you’ll see an example of this later in the tutorial). Naive approaches to chunking can be fairly simple, but optimal chunking is a complex and rapidly evolving area of research. Ragie stays at the forefront of this research and implements the most promising techniques.
  • Ragie indexed the generated chunks for semantic retrieval in a vector search database. This allows your ai applications to perform natural language queries of your data that return the most accurate and relevant information.
  • Ragie created summaries of your documents and indexed them in a Summary Index, to improve retrieval results. Learn more about the Summary Index.
  • If you’ve created Entity Extraction Instructions, Ragie may have extracted structured data from your documents. Learn more about Entity Extraction.

Let's recap

In this section, you learned how to:

  1. Use the mulipart document upload API to load data into Ragie. You learned that this is one of three ways that Ragie can ingest data.
  2. Use the document API to query for document status.
  3. Use Ragie for heavy lifting that you would ordinarily need to build yourself. This work is abstracted away from you as a developer so you can focus on building your application.