Building My Own RAG: AWS Bedrock, Titan Embeddings, and Mistral Models

Hey everyone! I'm excited to share how I built my own RAG (Retrieval-Augmented Generation) chatbot using AWS Bedrock services. This project combines AWS Titan embeddings with Mistral language models to create an intelligent chatbot that knows about my experience and projects. I'll walk you through the actual implementation details based on my working codebase.

Why Build My Own RAG?

I wanted to create a chatbot that could intelligently answer questions about my experience, skills, and projects. Instead of using off-the-shelf solutions, I decided to build it from scratch to:

Learn RAG internals by implementing every component myself
Use AWS Bedrock for both embeddings and language models
Keep it simple with file-based storage and cosine similarity
Add rate limiting to prevent abuse and manage costs

The Actual Architecture

My RAG system uses a straightforward but effective architecture:

AWS Bedrock Titan

Text Embeddings v2

Generates 1024-dimensional vectors for semantic search using amazon.titan-embed-text-v2:0

Mistral Models

Language Generation

Uses Mistral 7B Instruct and Mixtral models through Bedrock for response generation

File-Based Storage

JSON Storage

Stores embeddings and text chunks in JSON files for simplicity and portability

Cosine Similarity

Vector Search

Implements cosine similarity calculation for finding relevant text chunks

Next.js API

Chat Endpoint

RESTful API endpoint with rate limiting and error handling

Rate Limiting

Request Management

In-memory rate limiting to prevent abuse (10 requests per minute per IP)

Implementation Details

1. Data Preparation

I started by creating structured data chunks from my professional information:

[
  {
    "id": "contact-001",
    "text": "Aniket Patil - Contact Information: Phone: +91 8421015314...",
    "source": "resume",
    "meta": {
      "category": "contact",
      "tags": ["contact-info", "phone", "email"]
    }
  },
  {
    "id": "experience-flipick-001",
    "text": "Current Role - Flipick Pvt Ltd, Pune: Team Lead...",
    "source": "experience",
    "meta": {
      "category": "experience",
      "tags": ["flipick", "team-lead", "ai-solutions"]
    }
  }
]

2. Embedding Generation

The embedding generation script processes chunks and creates vector representations:

// AWS Bedrock Titan Text Embeddings V2
const MODEL_ID = 'amazon.titan-embed-text-v2:0';

const embedding = await generateEmbedding(chunk.text);
// Returns 1024-dimensional vector

embeddingsData.push({
  id: chunk.id,
  text: chunk.text,
  embedding: embedding,
  source: chunk.source,
  wordCount: chunk.text.split(' ').length
});

3. Similarity Search

When a user asks a question, the system finds the most relevant chunks:

Converts the query to embeddings using the same Titan model
Performs vector similarity search using cosine similarity
Retrieves top-k most relevant chunks with scores
Constructs context for the language model

// Generate embedding for user query
const userEmbedding = await generateEmbedding(message);

// Calculate cosine similarity with all chunks
const similarities = embeddings.map(chunk => ({
  ...chunk,
  similarity: cosineSimilarity(userEmbedding, chunk.embedding)
}));

// Sort by similarity and take top 4
similarities.sort((a, b) => b.similarity - a.similarity);
const topChunks = similarities.slice(0, 4);

Chat API Implementation

The chat endpoint handles requests with proper error handling and rate limiting:

export async function POST(request: NextRequest) {
  // Rate limiting check
  const ip = request.ip || 'unknown';
  if (!checkRateLimit(ip)) {
    return NextResponse.json(
      { error: 'Rate limit exceeded. Please try again later.' },
      { status: 429 }
    );
  }

  // Generate user query embedding
  const userEmbedding = await generateEmbedding(message);

  // Find similar chunks
  const topChunks = findTopSimilarChunks(userEmbedding, embeddings);

  // Generate response using Mistral
  const response = await generateMistralResponse(message, topChunks);
}

Rate Limiting Strategy

I implemented a simple but effective rate limiting system:

10 requests per minute per IP address
In-memory tracking using Map data structure
Automatic cleanup of expired entries
429 status code for exceeded limits

Advanced Features

Streaming Response Implementation

For a better user experience, I implemented real-time streaming responses using AWS Bedrock's native streaming API:

// Native AWS Bedrock streaming implementation
const bedrockInput: InvokeModelWithResponseStreamCommandInput = {
 modelId: 'mistral.mistral-7b-instruct-v0:2',
 contentType: 'application/json',
 accept: 'application/json',
 body: JSON.stringify({
   prompt: `<s>[INST] ${systemPrompt}\n\nHuman: ${message} [/INST]`,
   max_tokens: 1000,
   temperature: 0.7,
 }),
};

const bedrockCommand = new InvokeModelWithResponseStreamCommand(bedrockInput);
const bedrockResponse = await bedrockClient.send(bedrockCommand);

// Stream response chunks in real-time
for await (const chunk of bedrockResponse.body) {
 if (chunk.chunk?.bytes) {
   const decodedChunk = new TextDecoder().decode(chunk.chunk.bytes);
   const parsedChunk = JSON.parse(decodedChunk);

   if (parsedChunk.outputs?.[0]?.text) {
     // Send text chunk to frontend immediately
     controller.enqueue(encoder.encode(`data: ${JSON.stringify({
       text: parsedChunk.outputs[0].text
     })}\n\n`));
   }
 }
}

Session Management

The chat interface includes persistent session management with localStorage:

// Session persistence in localStorage
useEffect(() => {
 const saved = localStorage.getItem('aniket-chat-conversation');
 if (saved) {
   const parsed = JSON.parse(saved);
   setState(prev => ({
     ...prev,
     messages: parsed.messages.map(msg => ({
       ...msg,
       timestamp: new Date(msg.timestamp)
     }))
   }));
 }
}, []);

// Auto-save on message changes
useEffect(() => {
 if (state.messages.length > 0) {
   localStorage.setItem('aniket-chat-conversation', JSON.stringify({
     messages: state.messages,
     lastUpdated: new Date().toISOString()
   }));
 }
}, [state.messages]);

// New session functionality
const clearSession = () => {
 setState({ messages: [], isLoading: false });
 localStorage.removeItem('aniket-chat-conversation');
};

Frontend Integration

The React frontend handles streaming responses with real-time updates:

EventSource API - Handles Server-Sent Events for streaming
Real-time updates - Updates UI as chunks arrive
Error handling - Falls back to regular API if streaming fails
Source citations - Shows relevant document sources for responses
Context chunks - Displays relevant text excerpts used for generation

Challenges and Solutions

Challenge 1: AWS Credentials Management

Managing AWS credentials securely in a Next.js application.

Solution: Used AWS SDK with server-side credential loading and proper error handling for missing credentials.

Challenge 2: Embedding Generation Speed

AWS Bedrock has rate limits and embedding generation takes time.

Solution: Added 100ms delays between requests and implemented batch processing where possible.

Challenge 3: Context Window Management

Balancing relevant context with token limits for language models.

Solution: Limited to top-4 chunks and truncated long texts to stay within token limits.

Key Technical Components

Cosine Similarity Implementation

function cosineSimilarity(vecA: number[], vecB: number[]): number {
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;

  for (let i = 0; i < vecA.length; i++) {
    dotProduct += vecA[i] * vecB[i];
    normA += vecA[i] * vecA[i];
    normB += vecB[i] * vecB[i];
  }

  if (normA === 0 || normB === 0) return 0;
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

Error Handling

Robust error handling for AWS service issues and model failures:

Model fallback - Tries multiple Mistral models if one fails
Rate limit errors - Returns proper HTTP status codes
Credential errors - Clear error messages for setup issues
Validation errors - Input validation and length limits

Performance Characteristics

1024

Embedding dimensions

Top chunks retrieved

Requests per minute limit

What Works Well

Accurate responses about my experience and skills
Cost effective - Only pay for actual API calls
Simple to maintain - No complex infrastructure
Fast development - Quick to iterate and test

AWS IAM Policies and Security

Proper AWS IAM permissions are crucial for the RAG system to function securely.

Security Best Practices

Principle of Least Privilege - Only grant access to specific models you need
Environment Variables - Never hardcode AWS credentials in your code
Server-side Only - Keep AWS credentials on the server-side, never expose them to the client
Model Access - Ensure models are enabled in your AWS Bedrock console
Region Specification - Specify the correct AWS region for your Bedrock models

Additional Production Features

Error Handling and Fallbacks

Robust error handling ensures the chat system remains reliable:

Model Fallback Chain - Tries multiple Mistral models if one fails
Streaming Fallback - Falls back to regular API if streaming fails
Rate Limit Handling - Returns proper HTTP status codes (429) for exceeded limits
Input Validation - Validates message length and content before processing
AWS Error Handling - Specific error messages for credential and quota issues

Performance Optimizations

Several optimizations ensure fast and cost-effective operation:

Embedding Caching - Embeddings are pre-computed and stored in JSON files
Top-K Retrieval - Only retrieves top 4 most relevant chunks per query
Token Limits - Enforces reasonable input/output token limits
Context Truncation - Truncates long text chunks to fit within limits
Batch Processing - Processes embedding generation in batches with delays

User Experience Enhancements

The frontend provides a polished chat experience:

Real-time Streaming - Responses appear as they're generated
Source Citations - Shows which documents were used for each response
Context Preview - Expandable context chunks show relevant excerpts
Session Persistence - Conversations are saved and restored automatically
Responsive Design - Works seamlessly on desktop and mobile devices
Loading States - Clear feedback during processing and streaming

Future Improvements

While the current implementation works well, there are several areas for enhancement:

Persistent storage - Move from JSON files to a proper database
Better rate limiting - Use Redis for distributed rate limiting
Response caching - Cache frequent queries to reduce costs
Conversation history - Track chat history for better context
Advanced chunking - Implement smarter text splitting strategies
Multi-user support - Add user authentication and personalized chats
Analytics - Track usage patterns and performance metrics