Skip to main content

Documentation Index

Fetch the complete documentation index at: https://morphik.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

This cookbook demonstrates Morphik’s advanced metadata filtering capabilities with rich typed metadata fields including dates, decimals, booleans, arrays, and nested objects.
Prerequisites
  • Install the Morphik SDK: pip install morphik
  • Provide credentials via Morphik URI
  • Basic understanding of document ingestion

1. Ingest Documents with Rich Typed Metadata

Morphik supports various metadata types for sophisticated filtering:
from datetime import date, datetime, timezone
from decimal import Decimal
from morphik import Morphik

client = Morphik("morphik://your-app:token@api.morphik.ai")

# Rich metadata with multiple types
metadata = {
    # Strings
    "region": "andes",
    "project_code": "hydro-life-2024",

    # Dates and datetimes
    "fieldwork_date": date(2024, 9, 18),
    "monitoring_window_start": datetime(2024, 9, 18, 9, 10, tzinfo=timezone.utc),
    "monitoring_window_end": datetime(2024, 9, 18, 17, 35, tzinfo=timezone.utc),

    # Numbers
    "hazard_score": 41,                    # Integer
    "ph_reading": Decimal("6.3"),          # Decimal (precise)
    "water_depth_cm": 12.4,                # Float
    "samples_collected": 18,

    # Boolean
    "is_priority_site": True,

    # Arrays
    "tags": ["wildlife", "flood-risk", "community"],

    # Nested objects
    "sensor_loadout": {
        "drone": "Skydio X10",
        "camera": "multispectral",
        "thermal_gain": 0.43,
    },
}

# Ingest document with metadata
doc = client.ingest_text(
    content="Laguna Amazonas boardwalk inspection for wetlands buffers...",
    filename="laguna-amazonas-field-brief.md",
    metadata=metadata,
    use_colpali=True,
)

# Wait for completion
doc.wait_for_completion(timeout_seconds=150)
print(f"Ingested: {doc.external_id}")

2. Build Complex Filters

Combine multiple operators to create sophisticated queries:
from datetime import date

# Complex filter with multiple conditions
filters = {
    "$and": [
        # Exact match
        {"project_code": {"$eq": "hydro-life-2024"}},

        # Array membership
        {"region": {"$in": ["andes"]}},

        # Date range (>= September 15, 2024)
        {"fieldwork_date": {"$gte": date(2024, 9, 15).isoformat()}},

        # Number range (<= 45)
        {"hazard_score": {"$lte": 45}},

        # Boolean match
        {"is_priority_site": True},

        # Array contains value
        {"tags": {"$contains": {"value": "wildlife"}}},

        # Decimal comparison
        {"ph_reading": {"$lte": "6.5"}},
    ]
}

Filtering by Folder Name

Documents ingested with a folder_name parameter can be filtered using that value in metadata. This enables cross-folder queries and pattern matching:
# Filter specific folder
filters = {"folder_name": "reports"}

# Query multiple folders
filters = {
    "folder_name": {"$in": ["reports", "invoices", "contracts"]}
}

# Exclude archived folders
filters = {
    "folder_name": {"$nin": ["archived", "drafts", "test"]}
}

# Pattern matching on folder names
filters = {
    "folder_name": {"$regex": {"pattern": "^project_", "flags": "i"}}
}

# Combine folder with other metadata
filters = {
    "$and": [
        {"folder_name": {"$in": ["legal", "compliance"]}},
        {"priority": {"$gte": 70}},
        {"status": "active"},
        {"year": 2024}
    ]
}

3. List Documents with Filters

Find documents matching your criteria:
# Query documents with filters
response = client.list_documents(
    filters=filters,
    include_total_count=True,
    completed_only=True
)

print(f"\nFound {response.total_count} matching documents:")
for doc in response.documents:
    print(f"- {doc.filename}")
    print(f"  Hazard Score: {doc.metadata.get('hazard_score')}")
    print(f"  Tags: {doc.metadata.get('tags')}")

4. Retrieve Chunks with Filters

Get document chunks that match your metadata filters:
# Retrieve filtered chunks
chunks = client.retrieve_chunks(
    query="Summarize wildlife or flood risks that impact the wetlands buffer program",
    filters=filters,
    k=4,
    padding=1,
    use_colpali=True,
)

print(f"\nRetrieved {len(chunks)} filtered chunks:")
for chunk in chunks:
    print(f"\nChunk {chunk.chunk_number} from {chunk.filename} (score={chunk.score:.3f})")
    print(f"Content preview: {chunk.content[:200]}...")
    print(f"Metadata: {chunk.metadata}")

Supported Filter Operators

OperatorDescriptionExample
$eqExact match{"status": {"$eq": "active"}}
$inValue in array{"region": {"$in": ["andes", "altiplano"]}}
$gteGreater than or equal{"date": {"$gte": "2024-01-01"}}
$lteLess than or equal{"score": {"$lte": 45}}
$gtGreater than{"temperature": {"$gt": 0}}
$ltLess than{"count": {"$lt": 100}}
$containsArray contains value{"tags": {"$contains": {"value": "urgent"}}}
$andAll conditions must match{"$and": [condition1, condition2]}
$orAny condition must match{"$or": [condition1, condition2]}

Use Cases

Complex metadata filtering is ideal for:
  • Document management systems with multi-dimensional categorization
  • Compliance and audit systems requiring date-based queries
  • Scientific data repositories with measurements and precise numerical filtering
  • Multi-tenant applications with scope-based isolation
  • Time-series document collections with date range queries
  • Hierarchical data with nested metadata structures

Best Practices

1. Use Appropriate Types

Use the correct Python types for metadata:
# ✅ Correct
metadata = {
    "date": date(2024, 9, 15),        # Use date objects
    "price": Decimal("19.99"),        # Use Decimal for precision
    "is_active": True,                # Use bool for flags
}

# ❌ Avoid
metadata = {
    "date": "2024-09-15",            # String instead of date
    "price": 19.99,                  # Float loses precision
    "is_active": "true",             # String instead of bool
}

2. Convert Dates for Filtering

Always convert date objects to ISO format when building filters:
# ✅ Correct
{"fieldwork_date": {"$gte": date(2024, 9, 15).isoformat()}}

# ❌ Wrong
{"fieldwork_date": {"$gte": date(2024, 9, 15)}}  # Date object won't work

3. Combine Operators Strategically

  • Use $and for required conditions that must all match
  • Use $in when a field can have multiple possible values
  • Use range operators ($gte, $lte) for numerical and date filtering
  • Use $contains for array membership checks

4. Index Important Fields

Frequently filtered fields benefit from proper indexing. Consider performance when adding many metadata fields.

Running the Example

# Set your Morphik URI
export MORPHIK_URI="morphik://your-app:your-token@api.morphik.ai"

# Run your Python script with the code above
python your_script.py