LycheeMem is a compact memory framework for LLM agents. It starts from efficient conversational memory—through structured organization, lightweight consolidation, and adaptive retrieval—and gradually extends toward action-aware, usage-aware memory for more capable agentic systems.
🔥 News
- [03/28/2026] Semantic memory has been upgraded to Compact Semantic Memory (SQLite + LanceDB), no Neo4j required. See /quick-start for details.
- [03/27/2026] OpenClaw Plugin is now available at /openclaw-plugin ! Setup guide →
- [03/26/2026] MCP support is available at /mcp !
- [03/23/2026] LycheeMem is now open source: GitHub Repository →
📚 Memory Architecture
LycheeMem organizes memory into three complementary stores:
| Working Memory | Semantic Memory | Procedural Memory |
|---|---|---|
|
(Episodic)
|
(Typed Action Store)
|
(Skills)
|
💾 Working Memory
The working memory window holds the active conversation context for a session. It operates under a dual-threshold token budget:
- Warn threshold (70%) — triggers asynchronous background pre-compression; the current request is not blocked.
- Block threshold (90%) — the pipeline pauses and flushes older turns to a compressed summary before proceeding.
Compression produces summary anchors (past context, distilled) + raw recent turns (last N turns, verbatim). Both are passed downstream as the conversation history.
🗺️ Semantic Memory — Compact Semantic Memory
Semantic memory is organised around typed, action-annotated MemoryRecords. The storage layer is SQLite (FTS5 full-text search) + LanceDB (vector index).
Memory Record Types
Each memory entry is stored as a MemoryRecord. The memory_type field distinguishes seven semantic categories:
| Type | Description |
|---|---|
fact | Objective facts about the user, environment, or world |
preference | User preferences (style, habits, likes/dislikes) |
event | Specific events that have occurred |
constraint | Conditions that must be respected |
procedure | Reusable step-by-step procedures / methods |
failure_pattern | Previously failed action paths and their causes |
tool_affordance | Capabilities and applicable scenarios of tools/APIs |
Beyond text, every MemoryRecord carries action-facing metadata (tool_tags, constraint_tags, failure_tags, affordance_tags) and usage statistics (retrieval_count, action_success_count, etc.) to seed future reinforcement-learning signals.
Related MemoryRecords can be fused online by the Record Fusion Engine into a denser CompositeRecord; composite entries are ranked above source fragments during retrieval.
Four-Module Pipeline
Module 1: Compact Semantic Encoding
A single-pass pipeline that converts conversation turns into a list of MemoryRecords:
- Typed extraction — LLM extracts self-contained facts and assigns a semantic category to each record.
- Decontextualization — Pronouns and context-dependent phrases are expanded into full expressions, so each record is understandable without the original dialogue.
- Action metadata annotation — LLM annotates each record with
memory_type,tool_tags,constraint_tags,failure_tags,affordance_tags, and other structured labels.
record_id = SHA256(normalized_text) — naturally idempotent; duplicate content is deduplicated automatically.
Module 2: Record Fusion
Triggered online after each consolidation:
- FTS detects existing entries whose text is similar to the new records (candidate pool).
- LLM judges whether the candidate pool is worth merging (
synthesis_judge). - If yes, LLM executes the merge and produces a
CompositeRecordwritten to both SQLite and LanceDB; original records are retained.
Module 3: Action-Aware Search Planning
Before retrieval, ActionAwareRetrievalPlanner analyses the user query and emits a SearchPlan:
mode:answer(factual Q&A) /action(needs execution) /mixedsemantic_queries: content-facing search termspragmatic_queries: action/tool/constraint-facing search termstool_hints: tools likely needed for this requestrequired_constraints: constraints that are missingmissing_slots: parameters / slots that are absent
The plan drives five-channel recall:
- FTS channel — SQLite FTS5 keyword recall over
MemoryRecord+CompositeRecord - Semantic vector channel — LanceDB ANN over
semantic_textembeddings - Normalised vector channel — LanceDB ANN over
normalized_textembeddings (for pragmatic queries) - Tag filter channel — exact filter by
tool_hints/constraint_tags - Temporal channel — filter by
SearchPlan.temporal_filtertime window
Module 4: Multi-Dimensional Scorer
Candidates from all channels are de-duplicated and ranked by MemoryScorer using a weighted linear combination:
$$\text{Score} = \alpha \cdot S_\text{sem} + \beta \cdot S_\text{action} + \gamma \cdot S_\text{temporal} + \delta \cdot S_\text{recency} + \eta \cdot S_\text{evidence} - \lambda \cdot C_\text{token}$$
| Weight | Meaning | Default |
|---|---|---|
| α | SemanticRelevance (vector distance -> similarity) | 0.30 |
| β | ActionUtility (tag match score, mode-aware) | 0.25 |
| γ | TemporalFit (temporal reference match) | 0.15 |
| δ | Recency (memory freshness) | 0.10 |
| η | EvidenceDensity (evidence span density) | 0.10 |
| λ | TokenCost penalty (text length penalty) | 0.10 |
🛠️ Procedural Memory — Skill Store
The skill store preserves reusable how-to knowledge as structured skill entries, each carrying:
- Intent — a short description of what the skill does.
doc_markdown— a full Markdown document describing the procedure, commands, parameters, and caveats.- Embedding — a dense vector of the intent text, used for similarity search.
- Metadata — usage counters, last-used timestamp, preconditions.
Skill retrieval uses HyDE (Hypothetical Document Embeddings): the query is first expanded into a hypothetical ideal answer by the LLM, then that draft text is embedded to produce a query vector that matches well against stored procedure descriptions, even when the user's original phrasing is vague.
⚙️ Pipeline
Every request passes through a fixed sequence of five agents. Four are synchronous stages in the LangGraph pipeline; one is a background post-processing task.
Stage 1 — WMManager
Rule-based agent (no LLM prompt). Appends the user turn to the session log, counts tokens, and fires compression if either threshold is crossed. Produces compressed_history and raw_recent_turns for downstream stages.
Stage 2 — SearchCoordinator
ActionAwareRetrievalPlanner first analyses the user query and produces a SearchPlan containing mode, semantic_queries, pragmatic_queries, tool_hints, and more. Five parallel recall channels (FTS full-text, semantic vector, normalised vector, tag filter, temporal filter) then query SQLite + LanceDB, and the resulting candidates are ranked by the six-dimensional Scorer formula before being merged into background_context. Skill sub-queries use HyDE embedding against the skill store.
Stage 3 — SynthesizerAgent
Acts as an LLM-as-Judge: scores every retrieved memory fragment on an absolute 0-1 relevance scale, discards fragments below the threshold (default 0.6), and fuses the survivors into a single dense background_context string. It also identifies skill_reuse_plan entries that can directly guide the final response. This stage outputs provenance — a citation list containing scoring breakdown and source references for each kept memory item.
Stage 4 — ReasoningAgent
Receives compressed_history, background_context, and skill_reuse_plan and generates the final assistant reply. It appends the assistant turn back to the session store, completing the feedback loop.
Background — ConsolidatorAgent
Triggered immediately after ReasoningAgent completes, runs in a thread pool and does not block the response. It:
- Performs a novelty check — LLM judges whether the conversation introduced new information worth persisting. Skips consolidation for pure retrieval exchanges.
- Compact consolidation — calls
CompactSemanticEngine.ingest_conversation(), which runs a single-pass encoder (typed extraction → decontextualization → action metadata annotation), writesMemoryRecords to SQLite + LanceDB, then triggers Record Fusion to merge related entries intoCompositeRecords. - Skill extraction — identifies successful tool-usage patterns in the conversation and adds skill entries to the skill store. Runs in parallel with compact consolidation (ThreadPoolExecutor).
⚡ Quick Start
Prerequisites
- Python 3.11+
- An LLM API key (OpenAI, Gemini, or any litellm-compatible provider)
Installation
git clone https://github.com/LycheeMem/LycheeMem.git
cd LycheeMem
pip install -e ".[dev]"
Configuration
Copy .env.example to .env and fill in your values:
# LLM — litellm format: provider/model
LLM_MODEL=openai/gpt-4o-mini
LLM_API_KEY=sk-...
LLM_API_BASE= # optional
# Embedder
EMBEDDING_MODEL=openai/text-embedding-3-small
EMBEDDING_DIM=1536
# Semantic memory storage paths (optional, defaults to data/ directory)
COMPACT_MEMORY_DB_PATH=data/compact_memory.db
COMPACT_VECTOR_DB_PATH=data/compact_vector
Supported LLM providers (via litellm):
openai/gpt-4o-mini·gemini/gemini-3.0-flash·ollama_chat/qwen2.5· any OpenAI-compatible endpoint
Start the Server
python main.py
# with hot-reload:
python main.py --reload
The API is served at http://localhost:8000. Interactive docs at /docs.
🎨 Web Demo
A frontend demo is included under web-demo/. It provides a chat interface alongside live views of semantic memory, skill library, and working memory state.
cd web-demo
npm install
npm run dev # served at http://localhost:5173
Make sure the backend is running on port 8000 (or update proxy settings in
web-demo/vite.config.ts) before starting the frontend.
🦞 OpenClaw Plugin
LycheeMem ships a native OpenClaw plugin that gives any OpenClaw session persistent long-term memory with zero manual wiring.
What the plugin provides:
lychee_memory_smart_search— default long-term memory retrieval entry point- Automatic turn mirroring via hooks — the model does not need to call
append_turnmanually- User messages are appended automatically
- Assistant messages are appended automatically
/new,/reset,/stop, andsession_endautomatically trigger boundary consolidation- Proactive consolidation on strong long-term knowledge signals
Under normal operation:
- The model only calls
lychee_memory_smart_searchwhen recalling long-term context - The model may call
lychee_memory_consolidatemanually when an immediate persist is warranted - The model does not need to call
lychee_memory_append_turnat all
Quick Install
openclaw plugins install "/path/to/LycheeMem/openclaw-plugin"
openclaw gateway restart
See the full setup guide: openclaw-plugin/INSTALL_OPENCLAW.md
🔧 MCP
LycheeMem also exposes an HTTP MCP endpoint at http://localhost:8000/mcp.
- Available tools:
lychee_memory_search,lychee_memory_synthesize,lychee_memory_consolidate - Use
Authorization: Bearer <token>if you want per-user memory isolation lychee_memory_consolidateonly works for sessions that were already written through/chator/memory/reason
MCP Transport
POST /mcphandles JSON-RPC requestsGET /mcpexposes the SSE stream used by some MCP clients- The server returns
Mcp-Session-Idduringinitialize; reuse that header on later requests
Authentication
If you want isolated memory per user, first obtain a JWT token from /auth/register or /auth/login, then send:
Authorization: Bearer <token>
Without a token, requests run with an empty user_id, so anonymous traffic shares the same namespace.
Client Configuration
For any MCP client that supports remote HTTP servers, configure the MCP URL as:
http://localhost:8000/mcp
Generic config example:
{
"mcpServers": {
"lycheemem": {
"url": "http://localhost:8000/mcp",
"headers": {
"Authorization": "Bearer <token>"
}
}
}
}
Manual JSON-RPC Flow
- Call
initialize - Reuse the returned
Mcp-Session-Id - Send
initialized - Call
tools/list - Call
tools/call
Initialize example:
curl -i -X POST http://localhost:8000/mcp \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {
"protocolVersion": "2025-03-26",
"capabilities": {},
"clientInfo": {
"name": "debug-client",
"version": "0.1.0"
}
}
}'
Tool call example:
curl -X POST http://localhost:8000/mcp \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-H "Mcp-Session-Id: <session-id>" \
-d '{
"jsonrpc": "2.0",
"id": 2,
"method": "tools/call",
"params": {
"name": "lychee_memory_search",
"arguments": {
"query": "what tools do I use for database backups",
"top_k": 5,
"include_graph": true,
"include_skills": true
}
}
}'
Recommended MCP Usage Pattern
- Use
/chator/memory/reasonwith a stablesession_idto write conversation turns. - Use
lychee_memory_searchto retrieve relevant long-term memory. - Use
lychee_memory_synthesizeto compress retrieval results intobackground_context. - After the conversation ends, call
lychee_memory_consolidatewith the samesession_id.
🔌 API Reference
POST /memory/search — Unified Memory Retrieval
Query both the semantic memory channel and the skill store in a single call.
// Request
{
"query": "what tools do I use for database backups",
"top_k": 5,
"include_graph": true,
"include_skills": true
}
// Response
{
"query": "...",
"graph_results": [
{
"anchor": {
"node_id": "compact_context",
"name": "CompactSemanticMemory",
"label": "Context",
"score": 1.0
},
"constructed_context": "...",
"provenance": [ { "id": "...", "source": "semantic_memory", "relevance": 0.91, ... } ]
}
],
"skill_results": [ { "id": "...", "intent": "pg_dump backup to S3", "score": 0.87, ... } ],
"total": 6
}
POST /memory/synthesize — Memory Fusion
Takes raw retrieval results and produces a fused memory context using LLM-as-Judge.
// Request
{
"user_query": "what tools do I use for database backups",
"graph_results": [...], // from /memory/search
"skill_results": [...]
}
// Response
{
"background_context": "User regularly uses pg_dump with a cron job...",
"skill_reuse_plan": [ { "skill_id": "...", "intent": "...", "doc_markdown": "..." } ],
"provenance": [ { "id": "...", "source": "semantic_memory", "relevance": 0.91, ... } ],
"kept_count": 4,
"dropped_count": 2
}
POST /memory/reason — Grounded Reasoning
Runs the ReasoningAgent given pre-synthesized context. Can be chained after /memory/synthesize for full pipeline control.
// Request
{
"session_id": "my-session",
"user_query": "what tools do I use for database backups",
"background_context": "User regularly uses pg_dump...",
"skill_reuse_plan": [...],
"append_to_session": true // write result to session history (default: true)
}
// Response
{
"final_response": "You typically use pg_dump scheduled via cron...",
"session_id": "my-session",
"wm_token_usage": 3412
}
POST /memory/consolidate/{session_id} — Trigger Consolidation
Manually trigger memory consolidation for a session (normally runs automatically in the background after each chat turn).
curl -X POST http://localhost:8000/memory/consolidate/my-session
// Response
{ "message": "Consolidation done: 5 entities, 2 skills extracted." }
Usage Examples
# Basic single-turn demo (automatically registers 'demo_user')
python examples/api_pipeline_demo.py
# Multi-turn chat demo (3 consecutive turns, followed by consolidation)
python examples/api_pipeline_demo.py --multi-turn
# Custom query and user credentials
python examples/api_pipeline_demo.py --username alice --password secret123 \
--query "How do I backup my database with pg_dump?"
# Use a fixed session_id (useful for accumulating history across multiple runs)
python examples/api_pipeline_demo.py --session-id my-test-session