Vector Store Long-Term Memory

Vector store memory gives agents the ability to remember and retrieve information from unbounded history — past conversations, user preferences, facts learned over time — without filling up the context window. Instead of including all past interactions verbatim, the agent retrieves only the most relevant pieces of history for the current query.

How Vector Store Memory Works

Storage: Each piece of information (conversation snippet, learned fact, user preference) is converted to a high-dimensional vector via an embedding model and stored in a vector database.
Retrieval: When processing a new query, the query is embedded and a similarity search finds the most semantically relevant stored memories.
Injection: Retrieved memories are inserted into the current prompt as context.

This gives agents access to effectively unlimited history while only paying the token cost of the retrieved relevant subset.

Setting Up a Vector Store

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
import chromadb

# Initialize embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create a persistent vector store
vector_store = Chroma(
    collection_name="agent_memory",
    embedding_function=embeddings,
    persist_directory="./.chroma_db",  # Persists to disk
)

def store_memory(content: str, metadata: dict = None) -> str:
    """Store a new memory in the vector store."""
    doc = Document(
        page_content=content,
        metadata=metadata or {},
    )
    ids = vector_store.add_documents([doc])
    return ids[0]

def retrieve_memories(query: str, k: int = 5, filter: dict = None) -> list[Document]:
    """Retrieve the k most relevant memories for a query."""
    results = vector_store.similarity_search(
        query=query,
        k=k,
        filter=filter,
    )
    return results

# Store some memories
store_memory(
    "User Alice prefers concise, technical explanations without analogies.",
    metadata={"user_id": "alice", "type": "preference", "timestamp": "2025-01-15"}
)
store_memory(
    "Alice is building a distributed data pipeline using Apache Kafka and Flink.",
    metadata={"user_id": "alice", "type": "project_context", "timestamp": "2025-01-15"}
)
store_memory(
    "Alice asked about Kafka consumer group rebalancing and was confused by partition assignment.",
    metadata={"user_id": "alice", "type": "knowledge_gap", "timestamp": "2025-01-15"}
)

# Retrieve relevant memories
memories = retrieve_memories("How should I configure my Kafka consumers?", k=3)
for doc in memories:
    print(f"Memory: {doc.page_content}")
    print(f"Metadata: {doc.metadata}\n")

Vector Store Memory in LangChain Agents

from langchain.memory import VectorStoreRetrieverMemory
from langchain.chains import ConversationChain
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Create retriever from vector store
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

# Wrap as LangChain memory
vector_memory = VectorStoreRetrieverMemory(
    retriever=retriever,
    memory_key="relevant_history",
    input_key="input",
)

# Prompt that uses retrieved memories
prompt_template = PromptTemplate(
    input_variables=["relevant_history", "input"],
    template="""You are a helpful assistant with access to relevant past context.

Relevant past context:
{relevant_history}

Current question: {input}

Provide a helpful response, taking into account the past context where relevant."""
)

chain = ConversationChain(
    llm=llm,
    memory=vector_memory,
    prompt=prompt_template,
    verbose=True,
)

response = chain.invoke({"input": "How should I handle Kafka consumer rebalancing?"})
print(response["response"])
# Should reference Alice's previous confusion about partition assignment

Production Vector Store Options

Vector Store	Best For	Hosting
Chroma	Development, local use	Local / self-hosted
Pinecone	Production SaaS, large scale	Managed cloud
Weaviate	Hybrid search, GraphQL API	Self-hosted / cloud
pgvector	Already using PostgreSQL	Self-hosted
Qdrant	High performance, filtering	Self-hosted / cloud
FAISS	In-memory, no persistence needed	Local

For most production applications, pgvector (if you're already on PostgreSQL) or Pinecone (if you want a fully managed solution) are the pragmatic choices.

Hybrid Memory Architecture

Combine vector store (semantic long-term) with buffer (recent exact) memory for the best results:

from langchain.memory import CombinedMemory, ConversationBufferWindowMemory

# Recent exact context (last 5 turns)
buffer_memory = ConversationBufferWindowMemory(
    k=5,
    memory_key="recent_history",
    return_messages=True,
)

# Semantic long-term context
vector_memory = VectorStoreRetrieverMemory(
    retriever=retriever,
    memory_key="relevant_history",
)

# Combine both
combined_memory = CombinedMemory(memories=[buffer_memory, vector_memory])

# Prompt uses both memory types
combined_prompt = PromptTemplate(
    input_variables=["recent_history", "relevant_history", "input"],
    template="""Relevant past context:
{relevant_history}

Recent conversation:
{recent_history}

User: {input}
Assistant:"""
)

This pattern gives you the precision of exact recent context plus the breadth of semantic long-term retrieval — with bounded token usage regardless of conversation length.