Entity Extraction

RAG is excellent for scenarios where the query depends on the semantic information in the document. However, it falls short when it comes to tasks like "extract all emails" or "extract all contacts" which require extracting precise semi-structured information from unstructured data. Ragie's LLM-powered entity extraction feature addresses this specific need.

The entity extraction feature enables you to create an instruction within Ragie's platform. An instruction is a natural language prompt that guides Ragie on what action to take on a document. Once an instruction is created, it is automatically applied using a LLM to every created or updated document. Extracted entities can then be retrieved using our APIs.

Let's walk through a detailed, step-by-step example of how to use entity extraction within Ragie's platform.

1. Create an Instruction

First create an instruction using Ragie's create instruction API.

For example an instruction to extract all emails from a document would look like

{
  "name": "emails",
  "active": true,
  "scope": "chunk",
  "prompt": "extract all emails from the document",
  "entity_schema": {"type": "object","properties": {"emails": {"type": "array","items": {"type": "string"}}}},
  "filter": { "type": "customer_doc" }
}

There are four core components to an instruction.

Prompt
- A prompt is written in natural language, instructing the LLM what data you want to extract. More detailed and precise prompts produce better results. You can also give a few shot example as part of the prompt.
- For example a prompt could be "Extract all the contacts in the document" or "Extract the vendor name which follows Payable to. For example Payable to Acme Corp".
Scope
- You can set extraction to happen at chunk level or document level.
- Use chunk level scope when fine grained extraction is desired like "extract all emails". Internally the document is chunked into smaller parts and entity extraction is run on each chunk. For entity extraction, we do specialized document chunking with no overlap to ensure there are no duplicate results.
- Use document level scope when analyzing the full document is desired. For example "categorize the tone of this document into friendly or professional".

Entity Schema

This is the JSON schema definition of the entity generated by an instruction. This can be as simple or complex as the entity to be extracted.
The root of the entity schema must be an object. If you intend to extract a list it should be wrapped in an object like the "Contacts List" example below.
Sample JSON schema for extracting emails:

  "entity_schema": {
    "type": "object","properties": {"emails": {"type": "array","items": {"type": "string"}}}
  }

Sample JSON schema for extracting contacts:

"entity_schema": {
  "title": "Contacts List",
  "type": "object",
  "properties": {
    "contacts": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "firstName": {
            "type": "string",
            "description": "The person's first name"
          },
          "lastName": {
            "type": "string",
            "description": "The person's last name"
          },
          "address": {
            "type": "object",
            "properties": {
              "streetAddress": {
                "type": "string",
                "description": "The street address"
              },
              "city": {
                "type": "string",
                "description": "The city"
              },
              "state": {
                "type": "string",
                "description": "The state or province"
              },
              "postalCode": {
                "type": "string",
                "description": "The postal or ZIP code"
              },
              "country": {
                "type": "string",
                "description": "The country"
              }
            },
            "description": "The address of the person"
          },
          "phoneNumber": {
            "type": "string",
            "description": "The person's phone number",
            "pattern": "^\\+?[0-9\\-\\.\\(\\)\\s]+$"
          },
          "email": {
            "type": "string",
            "description": "The person's email address",
            "format": "email"
          }
        },
        "required": ["firstName", "lastName"],
        "additionalProperties": false
      }
    }
  },
  "required": ["contacts"],
  "additionalProperties": false
}

Filter - An optional metadata filter that is applied to the documents as they are created or updated. If the filter matches the document's metadata the instruction will be run on the document, otherwise it will be skipped. Learn more about metadata filters here.

2. Upload Documents

Next, upload documents to Ragie using our create documents API. We support entity extraction on all document types including images.

Our document ingestion pipeline automatically runs all active instructions on each document as they are created and updated if the instruction either omits a filter or the filter matches the document's metadata.

Please note that on document updates, previously extracted entities for a document are deleted and replaced by entities extracted from the new version.

3. Retrieve Entities

Once entities have been extracted you can retrieve the extracted entities using our APIs. There are two APIs for retrieving extracted entities.

Get all extracted entities for a document. This API returns all extracted entities for a particular document across all active instructions.
Get all extracted entities for an instruction. This API returns all extracted entities for a particular instruction across all documents.