Metadata & Filters

Default metadata on each document

The following metadata properties are supported by default with each document. You can not overwrite these properties. If you give a duplicate property it will be ignored.

{
  "document_id": string,
  "document_type": string,      // one of pdf, doc, docx, img, jpg, etc
  "document_source": string,    // one of api, google_drive
  "document_name": string,      // human readable document name given during upload or derived if none was given
  "document_uploaded_at": int   // seconds since unix epoch
}

In addition, the following metadata properties are included for each Connector document.

{
  "_source_created_at": int, // e.g. 1726374688. seconds since unix epoch. When the connector document was created
  "_source_updated_at": int  // e.g. 1726374688. seconds since unix epoch. When the connector document last updated
}

Different Connectors might have their own additional metadata, check the relevant Connector page for more information.


📘

Reserved Metadata Keys

Apart from the above default properties, Ragie reserves metadata keys with leading underscores for internal metadata. If you try to add metadata with leading underscores, you will get a 422 error. So _name will fail as a metadata key.

Supported metadata types

You can associate a metadata payload with each document, as key-value pairs in a JSON object where keys are strings and values are one of:

  • String
  • Number (integer or floating point, gets converted to a 64 bit floating point)
  • Booleans (true, false)
  • List of String

Metadata based filtering is a pre-filter in Ragie which means it guarantees retrieval of top_k results if they exist. The only exception is when re-ranking is turned on. Re-ranking supports atmost top_k results.

There is a max limit of 1000 values per metadata object

Metadata query language

The following metadata operators are supported during retrieval

  • $eq - Equal to (number, string, boolean)
  • $ne - Not equal to (number, string, boolean)
  • $gt - Greater than (number)
  • $gte - Greater than or equal to (number)
  • $lt - Less than (number)
  • $lte - Less than or equal to (number)
  • $in - In array (string or number)
  • $nin - Not in array (string or number)

The metadata filters can be combined with AND and OR:


Examples

Using arrays of strings as metadata values or as metadata filters

A document with metadata payload…

{ "genre": ["comedy", "documentary"] }

…means the "genre" takes on both values.

Retrievals with the following filters will match the document:

{"genre":"comedy"}
{"genre": {"$in":["documentary","action"]}}
{"$and": [{"genre": "comedy"}, {"genre":"documentary"}]}

Retrievals with the following filter will not match the document:

{ "$and": [{ "genre": "comedy" }, { "genre": "drama" }] }

And retrievals with the following filters will not match the document because they are invalid. They will result in an error during retrieval:

# INVALID QUERY:
{"genre": ["comedy", "documentary"]}

# INVALID QUERY:
{"genre": {"$eq": ["comedy", "documentary"]}}

More example filter expressions

A comedy, documentary, or drama:

{
  "genre": { "$in": ["comedy", "documentary", "drama"] }
}

A drama from 2020 or later:

{
  "genre": { "$eq": "drama" },
  "year": { "$gte": 2020 }
}

A drama from 2020 or later (equivalent to the previous example):

{
  "$and": [{ "genre": { "$eq": "drama" } }, { "year": { "$gte": 2020 } }]
}

A drama or a movie from 2020 or later:

{
  "$or": [{ "genre": { "$eq": "drama" } }, { "year": { "$gte": 2020 } }]
}