AIApril 25, 20268 min read

From Tickets to Tokens: Preparing Zendesk Data for RAG Pipelines


If you are building a custom LLM support bot, your Zendesk history is your most valuable training set. However, most teams realize too late that raw JSON data from the Zendesk API is "Garbage In, Garbage Out" for AI.

If you're extracting Zendesk data for AI workflows specifically because your support tool is changing, see our migration field guide.

Why Your Vector DB Hates Raw JSON

Zendesk ticket data is "noisy." A single JSON object for a ticket contains system headers, automated trigger notifications, and CSS-heavy HTML signatures. If you feed this directly into an embedding model (like text-embedding-3-small), you waste tokens on noise and dilute the "Signal" of the actual resolution.

The "Relational JSONL" Standard

To build a high-fidelity RAG (Retrieval-Augmented Generation) pipeline, your data needs three things:

1. Thread Flattening

You need to convert nested ticket events into a clean, chronological dialogue.

2. Metadata Enrichment

Every "chunk" of text must be tagged with contextual metadata so your bot understands the context of the advice it's giving — who wrote it, what kind of account they came from, and how critical the issue was.

3. PII Sanitization

You cannot feed customer passwords or credit cards into a third-party LLM without violating SOC2 or HIPAA.

Exporting Zendesk Data for OpenAI Fine-tuning

OpenAI's fine-tuning API expects JSONL where each line is a {"messages": [...]} object in chat format. Raw Zendesk ticket JSON does not map to this directly — you need thread flattening first. Getting the raw data out reliably means contending with the Zendesk Incremental Export API quirks — rate limits, ghost pages, and cursor pagination edge cases that silently corrupt your output if not handled correctly.

Each Zendesk ticket becomes a training example where the conversation thread is the messages array: customer turns become {"role": "user", "content": "..."} and agent replies become {"role": "assistant", "content": "..."}. Internal notes are excluded. The ticket subject becomes the first system message.

Two things break this if you skip preprocessing: (1) HTML signatures in agent replies inject thousands of tokens of noise per example, (2) automated trigger responses (SLA notifications, autoresponders) appear as "assistant" turns and teach the model to produce robotic non-answers. Strip both before export.

PII is the other hard constraint. The OpenAI fine-tuning pipeline sends data to OpenAI's servers. Customer email addresses, phone numbers, and account identifiers in ticket bodies need to be redacted or pseudonymized before the file is assembled.

Zendesk + Claude RAG: Schema and Setup

Claude's context window is large enough to process multiple tickets in a single prompt, which changes the RAG retrieval strategy. Rather than retrieving one chunk per query, you can retrieve 3–5 related tickets and pass them together — Claude will synthesize across them rather than pattern-match to a single example.

The optimal schema for Claude RAG has each document as a self-contained ticket object:

{
  "ticket_id": 8821,
  "subject": "Webhook not firing on ticket update",
  "channel": "api",
  "organization": "Acme Corp",
  "priority": "high",
  "conversation": [
    {"role": "user", "content": "..."},
    {"role": "agent", "content": "..."}
  ],
  "resolution_time_hours": 4,
  "tags": ["webhooks", "api-v2"]
}

The organization, priority, and tags fields give Claude enough context to calibrate its answer without needing a separate metadata lookup. Store these as vector embeddings with the full JSON as the payload — retrieve by embedding similarity, pass the full object in the prompt.

Zendesk to LangChain: A Practical Pipeline

LangChain's document loaders expect a Document object with page_content (string) and metadata (dict). Zendesk tickets map cleanly to this — the flattened conversation thread is page_content, everything else goes into metadata.

from langchain.schema import Document

docs = [
    Document(
        page_content=ticket["conversation_text"],
        metadata={
            "ticket_id": ticket["id"],
            "category": ticket["custom_fields"]["issue_category"],
            "resolved": ticket["status"] == "solved",
            "organization_id": ticket["organization_id"],
        }
    )
    for ticket in zendesk_jsonl
]

The metadata dict is what enables filtered retrieval — querying only resolved tickets, or only tickets from a specific customer segment, without embedding those filters into the vector search itself. This is only possible if your export preserves custom field values as named columns rather than opaque IDs. A raw Zendesk export gives you custom_field_8321: "premium" — you need the mapped name (customer_segment: "premium") to write a useful metadata filter. For the underlying Postgres schema with foreign keys preserved, see the export format reference.

Zendesk to Vector Databases: Pinecone, Weaviate, pgvector

The embedding model you choose determines retrieval quality. For support conversation data, text-embedding-3-small (OpenAI) or voyage-3 (Voyage AI) perform well — they handle the mix of technical jargon and informal customer language better than general-purpose models. Chunk each flattened ticket as a single embedding; don't split mid-conversation.

At small to medium scale (under 500K tickets), pgvector is the pragmatic choice — it runs on your existing Postgres instance and requires zero new infrastructure. For larger corpora, Pinecone offers managed scaling with metadata filtering built in, while Weaviate gives you multi-tenancy and hybrid search (BM25 + vector) out of the box.

The schema shape is consistent across all three: a primary key, the embedding vector, the raw text payload, and a JSONB metadata column with ticket_id, created_at, resolution_status, and any custom field values you need for filtered retrieval. Evicta's JSONL output is structured exactly for this — every ticket is a flattened conversation with PII pre-flagged.

AI-Native Pre-Processing

Evicta's Premium tier was designed specifically for AI Teams. We don't just "dump" data; we architect it for the token window. By delivering data in Relational JSONL, we ensure that every conversation is linked to the user who wrote it and the organization it belongs to. All PII sanitization happens in a zero-persistence environment — nothing touches our disks.

The Bottom Line

A vector database that actually understands your support history — not one that just parrots back raw API responses.


Frequently Asked Questions

Why is raw Zendesk JSON bad for RAG pipelines?

Raw Zendesk ticket JSON contains system headers, automated trigger notifications, and CSS-heavy HTML signatures. Feeding this directly into an embedding model (like text-embedding-3-small) dilutes the signal of actual resolutions — classic Garbage In, Garbage Out.

What is thread flattening for Zendesk data?

Thread flattening converts Zendesk's nested ticket events into a clean, chronological conversation structure. This lets your vector database ingest coherent dialogue instead of raw, unordered API blobs.

How do you handle PII when preparing Zendesk data for AI training?

Customer passwords, credit card numbers, and personal data must be sanitized before being fed into any third-party LLM. Skipping this step risks violating SOC2 or HIPAA. Evicta's Premium tier includes PII sanitization as part of its AI-ready JSONL output.

How do I connect Zendesk data to OpenAI for fine-tuning?

OpenAI's fine-tuning API expects JSONL where each line is a {"messages": [...]} object in chat format. The conversion from Zendesk data requires three preprocessing steps: thread flattening (converting nested comments into a sequential conversation), HTML signature stripping (to avoid wasting tokens on noise), and PII redaction (since fine-tuning data goes to OpenAI's servers). Once preprocessed, each Zendesk ticket maps to one training example.

What is the best vector database for Zendesk support data?

At small to medium scale (under 500K tickets), pgvector is the pragmatic choice — it runs on your existing Postgres instance with zero new infrastructure. For larger corpora, Pinecone offers managed scaling with built-in metadata filtering, while Weaviate provides hybrid search (BM25 + vector) and multi-tenancy. For support conversation data specifically, text-embedding-3-small (OpenAI) or voyage-3 (Voyage AI) embedding models outperform general-purpose models.

BUILT FOR AI WORKLOADS

Get clean training data from your Zendesk in under 24 hours.

Evicta exports Zendesk to AI-ready JSONL with thread flattening, PII sanitization, and metadata enrichment pre-applied. Drop directly into OpenAI fine-tuning, Claude RAG, LangChain, or your vector DB of choice. Free schema preview against your real data — create an account and connect Zendesk inside the dashboard.