If you are building a custom LLM support bot, your Zendesk history is your most valuable training set. However, most teams realize too late that raw JSON data from the Zendesk API is "Garbage In, Garbage Out" for AI.
If you're extracting Zendesk data for AI workflows specifically because your support tool is changing, see our migration field guide.
Why Your Vector DB Hates Raw JSON
Zendesk ticket data is "noisy." A single JSON object for a ticket contains system headers, automated trigger notifications, and CSS-heavy HTML signatures. If you feed this directly into an embedding model (like text-embedding-3-small), you waste tokens on noise and dilute the "Signal" of the actual resolution.
The "Relational JSONL" Standard
To build a high-fidelity RAG (Retrieval-Augmented Generation) pipeline, your data needs three things:
1. Thread Flattening
You need to convert nested ticket events into a clean, chronological dialogue.
2. Metadata Enrichment
Every "chunk" of text must be tagged with contextual metadata so your bot understands the context of the advice it's giving — who wrote it, what kind of account they came from, and how critical the issue was.
3. PII Sanitization
You cannot feed customer passwords or credit cards into a third-party LLM without violating SOC2 or HIPAA.
Exporting Zendesk Data for OpenAI Fine-tuning
OpenAI's fine-tuning API expects JSONL where each line is a {"messages": [...]} object in chat format. Raw Zendesk ticket JSON does not map to this directly — you need thread flattening first. Getting the raw data out reliably means contending with the Zendesk Incremental Export API quirks — rate limits, ghost pages, and cursor pagination edge cases that silently corrupt your output if not handled correctly.
Each Zendesk ticket becomes a training example where the conversation thread is the messages array: customer turns become {"role": "user", "content": "..."} and agent replies become {"role": "assistant", "content": "..."}. Internal notes are excluded. The ticket subject becomes the first system message.
Two things break this if you skip preprocessing: (1) HTML signatures in agent replies inject thousands of tokens of noise per example, (2) automated trigger responses (SLA notifications, autoresponders) appear as "assistant" turns and teach the model to produce robotic non-answers. Strip both before export.
PII is the other hard constraint. The OpenAI fine-tuning pipeline sends data to OpenAI's servers. Customer email addresses, phone numbers, and account identifiers in ticket bodies need to be redacted or pseudonymized before the file is assembled.
Zendesk + Claude RAG: Schema and Setup
Claude's context window is large enough to process multiple tickets in a single prompt, which changes the RAG retrieval strategy. Rather than retrieving one chunk per query, you can retrieve 3–5 related tickets and pass them together — Claude will synthesize across them rather than pattern-match to a single example.
The optimal schema for Claude RAG has each document as a self-contained ticket object:
{
"ticket_id": 8821,
"subject": "Webhook not firing on ticket update",
"channel": "api",
"organization": "Acme Corp",
"priority": "high",
"conversation": [
{"role": "user", "content": "..."},
{"role": "agent", "content": "..."}
],
"resolution_time_hours": 4,
"tags": ["webhooks", "api-v2"]
}
The organization, priority, and tags fields give Claude enough context to calibrate its answer without needing a separate metadata lookup. Store these as vector embeddings with the full JSON as the payload — retrieve by embedding similarity, pass the full object in the prompt.
Zendesk to LangChain: A Practical Pipeline
LangChain's document loaders expect a Document object with page_content (string) and metadata (dict). Zendesk tickets map cleanly to this — the flattened conversation thread is page_content, everything else goes into metadata.
from langchain.schema import Document
docs = [
Document(
page_content=ticket["conversation_text"],
metadata={
"ticket_id": ticket["id"],
"category": ticket["custom_fields"]["issue_category"],
"resolved": ticket["status"] == "solved",
"organization_id": ticket["organization_id"],
}
)
for ticket in zendesk_jsonl
]
The metadata dict is what enables filtered retrieval — querying only resolved tickets, or only tickets from a specific customer segment, without embedding those filters into the vector search itself. This is only possible if your export preserves custom field values as named columns rather than opaque IDs. A raw Zendesk export gives you custom_field_8321: "premium" — you need the mapped name (customer_segment: "premium") to write a useful metadata filter. For the underlying Postgres schema with foreign keys preserved, see the export format reference.
Zendesk to Vector Databases: Pinecone, Weaviate, pgvector
The embedding model you choose determines retrieval quality. For support conversation data, text-embedding-3-small (OpenAI) or voyage-3 (Voyage AI) perform well — they handle the mix of technical jargon and informal customer language better than general-purpose models. Chunk each flattened ticket as a single embedding; don't split mid-conversation.
At small to medium scale (under 500K tickets), pgvector is the pragmatic choice — it runs on your existing Postgres instance and requires zero new infrastructure. For larger corpora, Pinecone offers managed scaling with metadata filtering built in, while Weaviate gives you multi-tenancy and hybrid search (BM25 + vector) out of the box.
The schema shape is consistent across all three: a primary key, the embedding vector, the raw text payload, and a JSONB metadata column with ticket_id, created_at, resolution_status, and any custom field values you need for filtered retrieval. Evicta's JSONL output is structured exactly for this — every ticket is a flattened conversation with PII pre-flagged.
AI-Native Pre-Processing
Evicta's Premium tier was designed specifically for AI Teams. We don't just "dump" data; we architect it for the token window. By delivering data in Relational JSONL, we ensure that every conversation is linked to the user who wrote it and the organization it belongs to. All PII sanitization happens in a zero-persistence environment — nothing touches our disks.
The Bottom Line
A vector database that actually understands your support history — not one that just parrots back raw API responses.
Frequently Asked Questions
Why is raw Zendesk JSON bad for RAG pipelines?
Raw Zendesk ticket JSON contains system headers, automated trigger notifications, and CSS-heavy HTML signatures. Feeding this directly into an embedding model (like text-embedding-3-small) dilutes the signal of actual resolutions — classic Garbage In, Garbage Out.
What is thread flattening for Zendesk data?
Thread flattening converts Zendesk's nested ticket events into a clean, chronological conversation structure. This lets your vector database ingest coherent dialogue instead of raw, unordered API blobs.
How do you handle PII when preparing Zendesk data for AI training?
Customer passwords, credit card numbers, and personal data must be sanitized before being fed into any third-party LLM. Skipping this step risks violating SOC2 or HIPAA. Evicta's Premium tier includes PII sanitization as part of its AI-ready JSONL output.
How do I connect Zendesk data to OpenAI for fine-tuning?
OpenAI's fine-tuning API expects JSONL where each line is a {"messages": [...]} object in chat format. The conversion from Zendesk data requires three preprocessing steps: thread flattening (converting nested comments into a sequential conversation), HTML signature stripping (to avoid wasting tokens on noise), and PII redaction (since fine-tuning data goes to OpenAI's servers). Once preprocessed, each Zendesk ticket maps to one training example.
What is the best vector database for Zendesk support data?
At small to medium scale (under 500K tickets), pgvector is the pragmatic choice — it runs on your existing Postgres instance with zero new infrastructure. For larger corpora, Pinecone offers managed scaling with built-in metadata filtering, while Weaviate provides hybrid search (BM25 + vector) and multi-tenancy. For support conversation data specifically, text-embedding-3-small (OpenAI) or voyage-3 (Voyage AI) embedding models outperform general-purpose models.