Metadata & Filters
Default metadata on each document
The following metadata properties are supported by default with each document. You can not overwrite these properties. If you give a duplicate property it will be ignored.
{
"document_id": string,
"document_type": string, // one of pdf, doc, docx, img, jpg, etc
"document_source": string, // one of api, google_drive
"document_name": string, // human readable document name given during upload or derived if none was given
"document_uploaded_at": int // seconds since unix epoch
}
Supported metadata types
You can associate a metadata payload with each document, as key-value pairs in a JSON object where keys are strings and values are one of:
- String
- Number (integer or floating point, gets converted to a 64 bit floating point)
- Booleans (true, false)
- List of String
Metadata based filtering is a pre-filter in Ragie which means it guarantees retrieval of top_k results if they exist. The only exception is when re-ranking is turned on. Re-ranking supports atmost top_k results.
There is a max limit of 1000 values per metadata object
Metadata query language
The following metadata operators are supported during retrieval
- $eq - Equal to (number, string, boolean)
- $ne - Not equal to (number, string, boolean)
- $gt - Greater than (number)
- $gte - Greater than or equal to (number)
- $lt - Less than (number)
- $lte - Less than or equal to (number)
- $in - In array (string or number)
- $nin - Not in array (string or number)
The metadata filters can be combined with AND and OR:
Examples
Using arrays of strings as metadata values or as metadata filters
A document with metadata payload…
{ "genre": ["comedy", "documentary"] }
…means the "genre" takes on both values.
Retrievals with the following filters will match the document:
{"genre":"comedy"}
{"genre": {"$in":["documentary","action"]}}
{"$and": [{"genre": "comedy"}, {"genre":"documentary"}]}
Retrievals with the following filter will not match the document:
{ "$and": [{ "genre": "comedy" }, { "genre": "drama" }] }
And retrievals with the following filters will not match the document because they are invalid. They will result in an error during retrieval:
# INVALID QUERY:
{"genre": ["comedy", "documentary"]}
# INVALID QUERY:
{"genre": {"$eq": ["comedy", "documentary"]}}
More example filter expressions
A comedy, documentary, or drama:
{
"genre": { "$in": ["comedy", "documentary", "drama"] }
}
A drama from 2020 or later:
{
"genre": { "$eq": "drama" },
"year": { "$gte": 2020 }
}
A drama from 2020 or later (equivalent to the previous example):
{
"$and": [{ "genre": { "$eq": "drama" } }, { "year": { "$gte": 2020 } }]
}
A drama or a movie from 2020 or later:
{
"$or": [{ "genre": { "$eq": "drama" } }, { "year": { "$gte": 2020 } }]
}
Updated about 2 months ago