Metadata & Filters

Default metadata on each document

The following metadata properties are supported by default with each document. You can not overwrite these properties. If you give a duplicate property it will be ignored.

{
  “document_id”: string,
  “document_type”: string,      // one of pdf, doc, docx, img, jpg, etc
  “document_source”: string,    // one of api, google_drive
  “document_name”: string,      // human readable document name given during upload or derived if none was given
  “document_uploaded_at”: int   // seconds since unix epoch
}


Supported metadata types

You can associate a metadata payload with each document, as key-value pairs in a JSON object where keys are strings and values are one of:

  • String
  • Number (integer or floating point, gets converted to a 64 bit floating point)
  • Booleans (true, false)
  • List of String

Metadata based filtering is a pre-filter in Ragie which means it guarantees retrieval of top_k results if they exist. The only exception is when re-ranking is turned on. Re-ranking supports atmost top_k results.

There is a max limit of 1000 values per metadata object


Metadata query language

The following metadata operators are supported during retrieval

  • $eq - Equal to (number, string, boolean)
  • $ne - Not equal to (number, string, boolean)
  • $gt - Greater than (number)
  • $gte - Greater than or equal to (number)
  • $lt - Less than (number)
  • $lte - Less than or equal to (number)
  • $in - In array (string or number)
  • $nin - Not in array (string or number)

The metadata filters can be combined with AND and OR:


Examples

Using arrays of strings as metadata values or as metadata filters

A document with metadata payload…

{ "genre": ["comedy", "documentary"] }

…means the "genre" takes on both values.

Retrievals with the following filters will match the document:

{"genre":"comedy"}
{"genre": {"$in":["documentary","action"]}}
{"$and": [{"genre": "comedy"}, {"genre":"documentary"}]}

Retrievals with the following filter will not match the document:

{ "$and": [{ "genre": "comedy" }, { "genre": "drama" }] }

And retrievals with the following filters will not match the document because they are invalid. They will result in an error during retrieval:

# INVALID QUERY:
{"genre": ["comedy", "documentary"]}

# INVALID QUERY:
{"genre": {"$eq": ["comedy", "documentary"]}}

More example filter expressions

A comedy, documentary, or drama:

{
  "genre": { "$in": ["comedy", "documentary", "drama"] }
}

A drama from 2020 or later:

{
  "genre": { "$eq": "drama" },
  "year": { "$gte": 2020 }
}

A drama from 2020 or later (equivalent to the previous example):

{
  "$and": [{ "genre": { "$eq": "drama" } }, { "year": { "$gte": 2020 } }]
}

A drama or a movie from 2020 or later:

{
  "$or": [{ "genre": { "$eq": "drama" } }, { "year": { "$gte": 2020 } }]
}