Hey everyone! I'm excited to share how I built my own RAG (Retrieval-Augmented Generation) chatbot using AWS Bedrock services. This project combines AWS Titan embeddings with Mistral language models to create an intelligent chatbot that knows about my experience and projects. I'll walk you through the actual implementation details based on my working codebase.
Why Build My Own RAG?
I wanted to create a chatbot that could intelligently answer questions about my experience, skills, and projects. Instead of using off-the-shelf solutions, I decided to build it from scratch to:
- Learn RAG internals by implementing every component myself
- Use AWS Bedrock for both embeddings and language models
- Keep it simple with file-based storage and cosine similarity
- Add rate limiting to prevent abuse and manage costs
The Actual Architecture
My RAG system uses a straightforward but effective architecture:
AWS Bedrock Titan
Text Embeddings v2
Generates 1024-dimensional vectors for semantic search using amazon.titan-embed-text-v2:0
Mistral Models
Language Generation
Uses Mistral 7B Instruct and Mixtral models through Bedrock for response generation
File-Based Storage
JSON Storage
Stores embeddings and text chunks in JSON files for simplicity and portability
Cosine Similarity
Vector Search
Implements cosine similarity calculation for finding relevant text chunks
Next.js API
Chat Endpoint
RESTful API endpoint with rate limiting and error handling
Rate Limiting
Request Management
In-memory rate limiting to prevent abuse (10 requests per minute per IP)
Implementation Details
1. Data Preparation
I started by creating structured data chunks from my professional information:
[ { "id": "contact-001", "text": "Aniket Patil - Contact Information: Phone: +91 8421015314...", "source": "resume", "meta": { "category": "contact", "tags": ["contact-info", "phone", "email"] } }, { "id": "experience-flipick-001", "text": "Current Role - Flipick Pvt Ltd, Pune: Team Lead...", "source": "experience", "meta": { "category": "experience", "tags": ["flipick", "team-lead", "ai-solutions"] } } ]
2. Embedding Generation
The embedding generation script processes chunks and creates vector representations:
// AWS Bedrock Titan Text Embeddings V2 const MODEL_ID = 'amazon.titan-embed-text-v2:0'; const embedding = await generateEmbedding(chunk.text); // Returns 1024-dimensional vector embeddingsData.push({ id: chunk.id, text: chunk.text, embedding: embedding, source: chunk.source, wordCount: chunk.text.split(' ').length });
3. Similarity Search
When a user asks a question, the system finds the most relevant chunks:
- Converts the query to embeddings using the same Titan model
- Performs vector similarity search using cosine similarity
- Retrieves top-k most relevant chunks with scores
- Constructs context for the language model
// Generate embedding for user query const userEmbedding = await generateEmbedding(message); // Calculate cosine similarity with all chunks const similarities = embeddings.map(chunk => ({ ...chunk, similarity: cosineSimilarity(userEmbedding, chunk.embedding) })); // Sort by similarity and take top 4 similarities.sort((a, b) => b.similarity - a.similarity); const topChunks = similarities.slice(0, 4);
Chat API Implementation
The chat endpoint handles requests with proper error handling and rate limiting:
export async function POST(request: NextRequest) { // Rate limiting check const ip = request.ip || 'unknown'; if (!checkRateLimit(ip)) { return NextResponse.json( { error: 'Rate limit exceeded. Please try again later.' }, { status: 429 } ); } // Generate user query embedding const userEmbedding = await generateEmbedding(message); // Find similar chunks const topChunks = findTopSimilarChunks(userEmbedding, embeddings); // Generate response using Mistral const response = await generateMistralResponse(message, topChunks); }
Rate Limiting Strategy
I implemented a simple but effective rate limiting system:
- 10 requests per minute per IP address
- In-memory tracking using Map data structure
- Automatic cleanup of expired entries
- 429 status code for exceeded limits
Advanced Features
Streaming Response Implementation
For a better user experience, I implemented real-time streaming responses using AWS Bedrock's native streaming API:
// Native AWS Bedrock streaming implementation const bedrockInput: InvokeModelWithResponseStreamCommandInput = { modelId: 'mistral.mistral-7b-instruct-v0:2', contentType: 'application/json', accept: 'application/json', body: JSON.stringify({ prompt: `<s>[INST] ${systemPrompt}\n\nHuman: ${message} [/INST]`, max_tokens: 1000, temperature: 0.7, }), }; const bedrockCommand = new InvokeModelWithResponseStreamCommand(bedrockInput); const bedrockResponse = await bedrockClient.send(bedrockCommand); // Stream response chunks in real-time for await (const chunk of bedrockResponse.body) { if (chunk.chunk?.bytes) { const decodedChunk = new TextDecoder().decode(chunk.chunk.bytes); const parsedChunk = JSON.parse(decodedChunk); if (parsedChunk.outputs?.[0]?.text) { // Send text chunk to frontend immediately controller.enqueue(encoder.encode(`data: ${JSON.stringify({ text: parsedChunk.outputs[0].text })}\n\n`)); } } }
Session Management
The chat interface includes persistent session management with localStorage:
// Session persistence in localStorage useEffect(() => { const saved = localStorage.getItem('aniket-chat-conversation'); if (saved) { const parsed = JSON.parse(saved); setState(prev => ({ ...prev, messages: parsed.messages.map(msg => ({ ...msg, timestamp: new Date(msg.timestamp) })) })); } }, []); // Auto-save on message changes useEffect(() => { if (state.messages.length > 0) { localStorage.setItem('aniket-chat-conversation', JSON.stringify({ messages: state.messages, lastUpdated: new Date().toISOString() })); } }, [state.messages]); // New session functionality const clearSession = () => { setState({ messages: [], isLoading: false }); localStorage.removeItem('aniket-chat-conversation'); };
Frontend Integration
The React frontend handles streaming responses with real-time updates:
- EventSource API - Handles Server-Sent Events for streaming
- Real-time updates - Updates UI as chunks arrive
- Error handling - Falls back to regular API if streaming fails
- Source citations - Shows relevant document sources for responses
- Context chunks - Displays relevant text excerpts used for generation
Challenges and Solutions
Challenge 1: AWS Credentials Management
Managing AWS credentials securely in a Next.js application.
Solution: Used AWS SDK with server-side credential loading and proper error handling for missing credentials.
Challenge 2: Embedding Generation Speed
AWS Bedrock has rate limits and embedding generation takes time.
Solution: Added 100ms delays between requests and implemented batch processing where possible.
Challenge 3: Context Window Management
Balancing relevant context with token limits for language models.
Solution: Limited to top-4 chunks and truncated long texts to stay within token limits.
Key Technical Components
Cosine Similarity Implementation
function cosineSimilarity(vecA: number[], vecB: number[]): number { let dotProduct = 0; let normA = 0; let normB = 0; for (let i = 0; i < vecA.length; i++) { dotProduct += vecA[i] * vecB[i]; normA += vecA[i] * vecA[i]; normB += vecB[i] * vecB[i]; } if (normA === 0 || normB === 0) return 0; return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB)); }
Error Handling
Robust error handling for AWS service issues and model failures:
- Model fallback - Tries multiple Mistral models if one fails
- Rate limit errors - Returns proper HTTP status codes
- Credential errors - Clear error messages for setup issues
- Validation errors - Input validation and length limits
Performance Characteristics
What Works Well
- Accurate responses about my experience and skills
- Cost effective - Only pay for actual API calls
- Simple to maintain - No complex infrastructure
- Fast development - Quick to iterate and test
AWS IAM Policies and Security
Proper AWS IAM permissions are crucial for the RAG system to function securely.
Security Best Practices
- Principle of Least Privilege - Only grant access to specific models you need
- Environment Variables - Never hardcode AWS credentials in your code
- Server-side Only - Keep AWS credentials on the server-side, never expose them to the client
- Model Access - Ensure models are enabled in your AWS Bedrock console
- Region Specification - Specify the correct AWS region for your Bedrock models
Additional Production Features
Error Handling and Fallbacks
Robust error handling ensures the chat system remains reliable:
- Model Fallback Chain - Tries multiple Mistral models if one fails
- Streaming Fallback - Falls back to regular API if streaming fails
- Rate Limit Handling - Returns proper HTTP status codes (429) for exceeded limits
- Input Validation - Validates message length and content before processing
- AWS Error Handling - Specific error messages for credential and quota issues
Performance Optimizations
Several optimizations ensure fast and cost-effective operation:
- Embedding Caching - Embeddings are pre-computed and stored in JSON files
- Top-K Retrieval - Only retrieves top 4 most relevant chunks per query
- Token Limits - Enforces reasonable input/output token limits
- Context Truncation - Truncates long text chunks to fit within limits
- Batch Processing - Processes embedding generation in batches with delays
User Experience Enhancements
The frontend provides a polished chat experience:
- Real-time Streaming - Responses appear as they're generated
- Source Citations - Shows which documents were used for each response
- Context Preview - Expandable context chunks show relevant excerpts
- Session Persistence - Conversations are saved and restored automatically
- Responsive Design - Works seamlessly on desktop and mobile devices
- Loading States - Clear feedback during processing and streaming
Future Improvements
While the current implementation works well, there are several areas for enhancement:
- Persistent storage - Move from JSON files to a proper database
- Better rate limiting - Use Redis for distributed rate limiting
- Response caching - Cache frequent queries to reduce costs
- Conversation history - Track chat history for better context
- Advanced chunking - Implement smarter text splitting strategies
- Multi-user support - Add user authentication and personalized chats
- Analytics - Track usage patterns and performance metrics