10 min read

Building My Own RAG: AWS Bedrock, Titan Embeddings, and Mistral Models

How I built a custom RAG chatbot using AWS Bedrock Titan embeddings, Mistral language models, and file-based vector storage. A practical implementation with cosine similarity search and rate limiting.

AWSRAGBedrockTitanMistralEmbeddingsNext.jsTypeScript

Hey everyone! I'm excited to share how I built my own RAG (Retrieval-Augmented Generation) chatbot using AWS Bedrock services. This project combines AWS Titan embeddings with Mistral language models to create an intelligent chatbot that knows about my experience and projects. I'll walk you through the actual implementation details based on my working codebase.

Why Build My Own RAG?

I wanted to create a chatbot that could intelligently answer questions about my experience, skills, and projects. Instead of using off-the-shelf solutions, I decided to build it from scratch to:

  • Learn RAG internals by implementing every component myself
  • Use AWS Bedrock for both embeddings and language models
  • Keep it simple with file-based storage and cosine similarity
  • Add rate limiting to prevent abuse and manage costs

The Actual Architecture

My RAG system uses a straightforward but effective architecture:

AWS Bedrock Titan

Text Embeddings v2

Generates 1024-dimensional vectors for semantic search using amazon.titan-embed-text-v2:0

Mistral Models

Language Generation

Uses Mistral 7B Instruct and Mixtral models through Bedrock for response generation

File-Based Storage

JSON Storage

Stores embeddings and text chunks in JSON files for simplicity and portability

Cosine Similarity

Vector Search

Implements cosine similarity calculation for finding relevant text chunks

Next.js API

Chat Endpoint

RESTful API endpoint with rate limiting and error handling

Rate Limiting

Request Management

In-memory rate limiting to prevent abuse (10 requests per minute per IP)

Implementation Details

1. Data Preparation

I started by creating structured data chunks from my professional information:

[
  {
    "id": "contact-001",
    "text": "Aniket Patil - Contact Information: Phone: +91 8421015314...",
    "source": "resume",
    "meta": {
      "category": "contact",
      "tags": ["contact-info", "phone", "email"]
    }
  },
  {
    "id": "experience-flipick-001",
    "text": "Current Role - Flipick Pvt Ltd, Pune: Team Lead...",
    "source": "experience",
    "meta": {
      "category": "experience",
      "tags": ["flipick", "team-lead", "ai-solutions"]
    }
  }
]

2. Embedding Generation

The embedding generation script processes chunks and creates vector representations:

// AWS Bedrock Titan Text Embeddings V2
const MODEL_ID = 'amazon.titan-embed-text-v2:0';

const embedding = await generateEmbedding(chunk.text);
// Returns 1024-dimensional vector

embeddingsData.push({
  id: chunk.id,
  text: chunk.text,
  embedding: embedding,
  source: chunk.source,
  wordCount: chunk.text.split(' ').length
});

3. Similarity Search

When a user asks a question, the system finds the most relevant chunks:

  1. Converts the query to embeddings using the same Titan model
  2. Performs vector similarity search using cosine similarity
  3. Retrieves top-k most relevant chunks with scores
  4. Constructs context for the language model
// Generate embedding for user query
const userEmbedding = await generateEmbedding(message);

// Calculate cosine similarity with all chunks
const similarities = embeddings.map(chunk => ({
  ...chunk,
  similarity: cosineSimilarity(userEmbedding, chunk.embedding)
}));

// Sort by similarity and take top 4
similarities.sort((a, b) => b.similarity - a.similarity);
const topChunks = similarities.slice(0, 4);

Chat API Implementation

The chat endpoint handles requests with proper error handling and rate limiting:

export async function POST(request: NextRequest) {
  // Rate limiting check
  const ip = request.ip || 'unknown';
  if (!checkRateLimit(ip)) {
    return NextResponse.json(
      { error: 'Rate limit exceeded. Please try again later.' },
      { status: 429 }
    );
  }

  // Generate user query embedding
  const userEmbedding = await generateEmbedding(message);

  // Find similar chunks
  const topChunks = findTopSimilarChunks(userEmbedding, embeddings);

  // Generate response using Mistral
  const response = await generateMistralResponse(message, topChunks);
}

Rate Limiting Strategy

I implemented a simple but effective rate limiting system:

  • 10 requests per minute per IP address
  • In-memory tracking using Map data structure
  • Automatic cleanup of expired entries
  • 429 status code for exceeded limits

Advanced Features

Streaming Response Implementation

For a better user experience, I implemented real-time streaming responses using AWS Bedrock's native streaming API:

// Native AWS Bedrock streaming implementation
const bedrockInput: InvokeModelWithResponseStreamCommandInput = {
 modelId: 'mistral.mistral-7b-instruct-v0:2',
 contentType: 'application/json',
 accept: 'application/json',
 body: JSON.stringify({
   prompt: `<s>[INST] ${systemPrompt}\n\nHuman: ${message} [/INST]`,
   max_tokens: 1000,
   temperature: 0.7,
 }),
};

const bedrockCommand = new InvokeModelWithResponseStreamCommand(bedrockInput);
const bedrockResponse = await bedrockClient.send(bedrockCommand);

// Stream response chunks in real-time
for await (const chunk of bedrockResponse.body) {
 if (chunk.chunk?.bytes) {
   const decodedChunk = new TextDecoder().decode(chunk.chunk.bytes);
   const parsedChunk = JSON.parse(decodedChunk);

   if (parsedChunk.outputs?.[0]?.text) {
     // Send text chunk to frontend immediately
     controller.enqueue(encoder.encode(`data: ${JSON.stringify({
       text: parsedChunk.outputs[0].text
     })}\n\n`));
   }
 }
}

Session Management

The chat interface includes persistent session management with localStorage:

// Session persistence in localStorage
useEffect(() => {
 const saved = localStorage.getItem('aniket-chat-conversation');
 if (saved) {
   const parsed = JSON.parse(saved);
   setState(prev => ({
     ...prev,
     messages: parsed.messages.map(msg => ({
       ...msg,
       timestamp: new Date(msg.timestamp)
     }))
   }));
 }
}, []);

// Auto-save on message changes
useEffect(() => {
 if (state.messages.length > 0) {
   localStorage.setItem('aniket-chat-conversation', JSON.stringify({
     messages: state.messages,
     lastUpdated: new Date().toISOString()
   }));
 }
}, [state.messages]);

// New session functionality
const clearSession = () => {
 setState({ messages: [], isLoading: false });
 localStorage.removeItem('aniket-chat-conversation');
};

Frontend Integration

The React frontend handles streaming responses with real-time updates:

  • EventSource API - Handles Server-Sent Events for streaming
  • Real-time updates - Updates UI as chunks arrive
  • Error handling - Falls back to regular API if streaming fails
  • Source citations - Shows relevant document sources for responses
  • Context chunks - Displays relevant text excerpts used for generation

Challenges and Solutions

Challenge 1: AWS Credentials Management

Managing AWS credentials securely in a Next.js application.

Solution: Used AWS SDK with server-side credential loading and proper error handling for missing credentials.

Challenge 2: Embedding Generation Speed

AWS Bedrock has rate limits and embedding generation takes time.

Solution: Added 100ms delays between requests and implemented batch processing where possible.

Challenge 3: Context Window Management

Balancing relevant context with token limits for language models.

Solution: Limited to top-4 chunks and truncated long texts to stay within token limits.

Key Technical Components

Cosine Similarity Implementation

function cosineSimilarity(vecA: number[], vecB: number[]): number {
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;

  for (let i = 0; i < vecA.length; i++) {
    dotProduct += vecA[i] * vecB[i];
    normA += vecA[i] * vecA[i];
    normB += vecB[i] * vecB[i];
  }

  if (normA === 0 || normB === 0) return 0;
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

Error Handling

Robust error handling for AWS service issues and model failures:

  • Model fallback - Tries multiple Mistral models if one fails
  • Rate limit errors - Returns proper HTTP status codes
  • Credential errors - Clear error messages for setup issues
  • Validation errors - Input validation and length limits

Performance Characteristics

1024
Embedding dimensions
4
Top chunks retrieved
10
Requests per minute limit

What Works Well

  • Accurate responses about my experience and skills
  • Cost effective - Only pay for actual API calls
  • Simple to maintain - No complex infrastructure
  • Fast development - Quick to iterate and test

AWS IAM Policies and Security

Proper AWS IAM permissions are crucial for the RAG system to function securely.

Security Best Practices

  • Principle of Least Privilege - Only grant access to specific models you need
  • Environment Variables - Never hardcode AWS credentials in your code
  • Server-side Only - Keep AWS credentials on the server-side, never expose them to the client
  • Model Access - Ensure models are enabled in your AWS Bedrock console
  • Region Specification - Specify the correct AWS region for your Bedrock models

Additional Production Features

Error Handling and Fallbacks

Robust error handling ensures the chat system remains reliable:

  • Model Fallback Chain - Tries multiple Mistral models if one fails
  • Streaming Fallback - Falls back to regular API if streaming fails
  • Rate Limit Handling - Returns proper HTTP status codes (429) for exceeded limits
  • Input Validation - Validates message length and content before processing
  • AWS Error Handling - Specific error messages for credential and quota issues

Performance Optimizations

Several optimizations ensure fast and cost-effective operation:

  • Embedding Caching - Embeddings are pre-computed and stored in JSON files
  • Top-K Retrieval - Only retrieves top 4 most relevant chunks per query
  • Token Limits - Enforces reasonable input/output token limits
  • Context Truncation - Truncates long text chunks to fit within limits
  • Batch Processing - Processes embedding generation in batches with delays

User Experience Enhancements

The frontend provides a polished chat experience:

  • Real-time Streaming - Responses appear as they're generated
  • Source Citations - Shows which documents were used for each response
  • Context Preview - Expandable context chunks show relevant excerpts
  • Session Persistence - Conversations are saved and restored automatically
  • Responsive Design - Works seamlessly on desktop and mobile devices
  • Loading States - Clear feedback during processing and streaming

Future Improvements

While the current implementation works well, there are several areas for enhancement:

  • Persistent storage - Move from JSON files to a proper database
  • Better rate limiting - Use Redis for distributed rate limiting
  • Response caching - Cache frequent queries to reduce costs
  • Conversation history - Track chat history for better context
  • Advanced chunking - Implement smarter text splitting strategies
  • Multi-user support - Add user authentication and personalized chats
  • Analytics - Track usage patterns and performance metrics