Sync Filter

A Sync Filter allows you to specify what files you want to skip ingesting.

For example, if you have a Google Drive folder with lots of text files that you want to ingest and the occasional image or pdf you don't, you can use a Sync Filter to filter out the images and pdfs.

A Sync Filter acts on document metadata to decide if a file should be ingested. You can add a glob or list of globs to each metadata key. If any glob matches, that file will be skipped.

If any glob matches for any metadata key, that file will be skipped.

Examples

Suppose we have the following document metadata for a Google Drive file. See the list of metadata you can filter on here

 {
        "source_type": "google_drive",
        "connector_id": "b12237e3-2cdc-4c8b-a06f-f0e8f26ff1d1",
        "created_at": "2026-04-03T22:43:58.641000+00:00",
        "source_name": "invoice.pdf",
        "folder": "files",
        "file_path": "/My Drive/files/invoice.pdf",
        "file_path_array": [
            "My Drive",
            "files",
            "invoice.pdf"
        ],
        "folder_path": "/My Drive/files",
        "created_by_email": "[email protected]",
        "created_by_name": "Andrey"
    }

Filter out PDFs

Let's filter out any pdfs, then our sync filter can be

{
    "file_path": "*.pdf"
}

This will match anything that ends in pdf like

  • invoice.pdf
  • sample.pdf

It will NOT match

  • dog.png
  • data.docx

Filter out certain names

What if you want to only reject invoice like file names, but include other pdfs?

You can do

{
  "file_path": "*invoice.pdf"
}

This matches

  • invoice.pdf
  • 2025_invoice.pdf

It will NOT match

  • ivoice.pdf - this has the wrong spelling
  • invoice.doc

Filter out multiple extensions

What if you want to filter out many file extensions like all pngs, jpegs, jpgs, and pdfs?

You can do

{
    "file_path": ["*.png", "*.jpg", "*.jpeg", "*.pdf"]
}

This matches

  • dog.png
  • invoice.pdf
  • cat.jpeg
  • cat.jpg

It does NOT match

  • report.doc
  • report_2026.txt
  • report.md

Skipping a folder

Let's skip the folder "old_documents" in our Google drive

Suppose we have this folder structure

My Drive

  • documents
    • old_documents
      • doc_1.pdf
    • invoice.pdf
    • etc

You can do

{
    "file_path": "*/old_documents/*
}

This matches

  • My Drive/old_documents/doc_1.pdf

It does NOT match

  • My Drive/invoice.pdf
  • My Drive/invoices/invoice2026.pdf

Multiple Keys

Suppose you want to filter out any files that are in a "old_documents" folder or were created by [email protected]

You can do

{
  "file_path": "*/old_documents/*",
  "created_by_email": "[email protected]"
}

As long as either metadata key matches, the file will be skipped.