Skip to main content
Morphik comes with built-in support for running both embeddings and completions locally, ensuring your data never leaves your machine. Choose between two powerful local inference engines:
  • Lemonade - Windows-only, optimized for AMD GPUs and NPUs
  • Ollama - Cross-platform (Windows, macOS, Linux), supports various hardware
Both are pre-configured in Morphik and can be selected through the UI or configuration file.

Why Local Inference?

Running models locally provides several key advantages:
  • Complete Privacy: Your data never leaves your machine
  • No API Costs: Eliminate ongoing API expenses
  • Low Latency: No network round-trips for inference
  • Offline Capability: Work without internet connectivity
  • Hardware Acceleration: Leverage your local GPU, NPU, or specialized AI processors
  • Lemonade
  • Ollama
AMDπŸ‹

Lemonade

Run embeddings & completions locally with AMD GPU/NPU acceleration

Lemonade SDK provides high-performance local inference on Windows, with optimizations for AMD hardware. It exposes an OpenAI-compatible API and is already configured in Morphik.
Built-in Support: Lemonade models are pre-configured in morphik.toml for both embeddings and completions. Simply install Lemonade Server and select the models in the UI.

System Requirements

  • Windows 10/11 only (x86/x64)
  • 8GB+ RAM (16GB recommended)
  • Python 3.10+
  • Optional but recommended:
    • AMD Ryzen AI 300 series (NPU acceleration)
    • AMD Radeon 7000/9000 series (GPU acceleration)

Quick Start

1

Download Lemonade

Download and install Lemonade from the official site: lemonade-server.ai.
2

Start Lemonade Server

Start the Lemonade server following their documentation. Make sure it is running and note the port. The API is OpenAI-compatible (e.g., /api/v1/models).
3

Configure Morphik - Two Options

  1. Open the Morphik UI and go to Settings β†’ API Keys
  2. Select β€œLemonade” (πŸ‹). No API key is required
  3. Enter the host and port where Lemonade is running Lemonade provider settings with host and port
  4. Open Chat and use the model selector pill (top left) to pick a Lemonade model Chat model selector showing Lemonade models
Running inside Docker? Use host.docker.internal instead of localhost for the host field.
If you are not using a vision-capable model, turn off ColPali in chat settings (settings β†’ ColPali) to avoid vision-dependent paths.

Option 2: Edit morphik.toml

You can also set Lemonade models directly in morphik.toml so they’re used by default. Ensure the api_base points to your Lemonade server:
lemonade_qwen = {
  model_name = "openai/Qwen2.5-VL-7B-Instruct-GGUF",
  api_base = "http://localhost:8020/api/v1",
  vision = true
}
lemonade_embedding = {
  model_name = "openai/nomic-embed-text-v1-GGUF",
  api_base = "http://localhost:8020/api/v1"
}

[completion]
model = "lemonade_qwen"

[embedding]
model = "lemonade_embedding"
If your system has under 16GB RAM, prefer models under ~4B parameters or smaller quantizations (e.g., Q4/Q5). Larger models may fail to load or will be very slow on low-memory systems.

Performance Tips

  • Model Quantization: Use GGUF quantized models for better performance
  • Low-memory systems: Under 16GB RAM, prefer models under 4B parameters
  • Hardware Acceleration: Automatically detects and uses AMD GPUs/NPUs when available
  • Memory Management: Models are cached after first download

Troubleshooting

  • Verify server health: curl http://localhost:8020/health
  • List models: curl http://localhost:8020/api/v1/models
  • For Docker: Use host.docker.internal instead of localhost
  • Check firewall settings for port 8020
  • Ensure sufficient disk space (5-15GB per model)
  • Try smaller quantized versions (Q4, Q5)
  • Check model compatibility with lemonade list
  • Use GGUF quantized models for better performance
  • Monitor GPU/NPU usage with system tools
  • Adjust batch size and context length in model config
⌘I