Most businesses don't have an AI problem - they have an integration problem. The AI services exist, the APIs are documented, and the pricing is accessible. What's missing is a clear path from "we want to use AI" to "AI is running inside our actual workflows." That gap is almost always a systems integration challenge, not a technology one.
This article covers the practical mechanics of API AI integration: how to connect AI services to your existing stack without rebuilding everything, where the common failure points are, and what good architecture looks like in production.
Understanding What You're Actually Connecting
Before writing a single line of integration code, map out what you have. Most organisations are working with a combination of:
- Core business systems - CRMs, ERPs, accounting platforms, databases
- Communication tools - email, Slack, Teams, ticketing systems
- Data stores - structured databases, document storage, data warehouses
- Existing APIs - internal services, third-party platforms, legacy endpoints
AI services like OpenAI, Anthropic, Google Vertex AI, and AWS Bedrock all expose REST APIs. That means if your existing system can make an HTTP request, it can call an AI service. The technical barrier is lower than most teams expect.
The real complexity sits in three areas: authentication and credential management, data formatting and transformation, and handling the asynchronous nature of some AI workloads. Get these right and the rest is plumbing.
Authentication and Credential Management Done Properly
Every AI API uses key-based authentication. You'll be passing an API key in the request header, typically as a Bearer token. The mistake most teams make is treating this like a config file variable rather than a secret.
Authorization: Bearer sk-your-api-key-here
In production, API keys should never appear in source code, environment files committed to version control, or client-side JavaScript. Use a secrets manager - AWS Secrets Manager, Azure Key Vault, HashiCorp Vault, or even a properly configured environment variable system in your deployment pipeline.
A practical pattern for most Australian mid-market organisations:
- Store API keys in your cloud provider's secrets manager
- Grant your application's IAM role read access to those secrets
- Pull the key at runtime, not at build time
- Rotate keys on a schedule (quarterly is a reasonable default)
- Set up usage alerts so you notice anomalous spend before it becomes a billing problem
Rate limiting is the other authentication-adjacent concern. Most AI APIs enforce both requests-per-minute and tokens-per-minute limits. Build retry logic with exponential backoff into your integration layer from day one. A simple implementation:
import time
import random
def call_with_retry(api_func, max_retries=3):
for attempt in range(max_retries):
try:
return api_func()
except RateLimitError:
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
raise Exception("Max retries exceeded")
Structuring Your Integration Layer
The most common mistake in API AI integration projects is coupling AI calls directly to application code. A product manager asks for a feature, a developer adds an OpenAI call inline, and six months later you have AI logic scattered across 40 files with no consistent error handling, logging, or cost visibility.
The better approach is a dedicated integration layer - a service or module that owns all AI interactions. This gives you:
- A single place to swap providers - if you move from OpenAI to Anthropic, you change one file, not forty
- Centralised logging - every prompt, response, latency, and token count goes through one place
- Consistent error handling - timeouts, malformed responses, and content policy rejections are handled uniformly
- Cost tracking - you can attribute spend to specific features or teams
In practice, this looks like an internal AI service class or a lightweight internal API. Here's a simplified structure:
class AIService:
def __init__(self, client, model, logger):
self.client = client
self.model = model
self.logger = logger
def complete(self, prompt, system_message=None, max_tokens=500):
start = time.time()
response = self.client.chat.completions.create(
model=self.model,
messages=self._build_messages(prompt, system_message),
max_tokens=max_tokens
)
self.logger.log({
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"latency_ms": (time.time() - start) * 1000,
"model": self.model
})
return response.choices[0].message.content
This pattern works whether you're building in Python, Node.js, or C#. The principle is the same: AI is a dependency, and dependencies should be injected and abstracted.
Data Transformation and Context Management
AI APIs accept text. Your systems store structured data. The translation layer between these two realities is where most integration projects get messy.
Consider a practical scenario: a logistics company wants to use AI to summarise customer complaint tickets before they reach a support agent. Their CRM stores ticket data across six tables - customer record, order history, previous tickets, product details, shipping events, and agent notes.
The naive approach is to dump all of this into a prompt and hope the model figures it out. The production approach is to build a context assembly function that:
- Queries only the relevant fields for the current ticket
- Formats them into a consistent structure the model handles well
- Applies a token budget so you don't exceed context limits
- Includes a system message that tells the model its role and output format
def build_ticket_context(ticket_id, max_tokens=2000):
ticket = get_ticket(ticket_id)
customer = get_customer(ticket.customer_id)
recent_orders = get_recent_orders(ticket.customer_id, limit=3)
context = f"""
Customer: {customer.name} (since {customer.created_date})
Issue category: {ticket.category}
Ticket description: {ticket.description}
Recent orders:
{format_orders(recent_orders)}
"""
return truncate_to_token_budget(context, max_tokens)
Token budgeting matters for two reasons: cost and reliability. Models behave better with focused, relevant context than with everything you can throw at them.
Handling Asynchronous Workloads
Not every AI task needs to happen in real time. Document analysis, batch summarisation, report generation, and data enrichment jobs are all better handled asynchronously.
Trying to run a 30-second AI processing job synchronously inside a web request is a reliable way to create timeout errors, frustrated users, and brittle infrastructure. The pattern to use instead:
- Accept the request and return a job ID immediately
- Push the work to a queue (SQS, RabbitMQ, Azure Service Bus)
- A worker process picks up the job, calls the AI API, and stores the result
- The client polls for completion or receives a webhook notification
This architecture also gives you natural rate limit management. Your worker can process jobs at a controlled pace rather than hammering the API with concurrent requests.
For Australian businesses running on AWS, a practical stack is API Gateway - Lambda - SQS - Lambda worker - DynamoDB for results. The whole thing can be set up with infrastructure-as-code in a few hours and scales without any operational overhead.
Observability and Cost Control in Production
API AI integration doesn't end at deployment. In production, you need visibility into what's happening, what it's costing, and when things go wrong.
Minimum viable observability for an AI integration:
- Latency tracking - p50, p95, p99 response times per endpoint
- Error rates - distinguish between API errors, timeout errors, and content policy rejections
- Token consumption - by feature, by user tier, by time period
- Cost attribution - map spend back to business functions
Most AI providers give you usage dashboards, but these don't map to your application's structure. Build your own logging from the start. A structured log entry per AI call - including the feature that triggered it, the model used, token counts, and latency - gives you the data to optimise costs and diagnose problems.
Cost control in practice means setting hard limits. OpenAI, Anthropic, and AWS Bedrock all support spending caps. Use them. Also consider:
- Caching responses for identical or near-identical prompts
- Using smaller, cheaper models for simpler tasks (classification, extraction) and larger models only where quality justifies the cost
- Implementing user-level rate limits if you're building a multi-tenant product
A mid-sized professional services firm we worked with reduced their AI API spend by 40% simply by routing document classification tasks to a smaller model and only escalating to their primary model when the confidence score fell below a threshold. The quality difference for end users was negligible.
What to Do Next
If you're planning an API AI integration project, start here:
This week:
- Audit your existing systems and identify the three workflows where AI would have the highest impact
- Check whether those systems expose APIs or webhooks you can hook into
- Set up a development API key with a hard spend limit and start testing your target AI service against real data samples
Before you build:
- Define your integration layer architecture before writing feature code
- Establish logging and observability requirements upfront, not as an afterthought
- Decide on your secrets management approach and document it
When you're ready to scale:
- Run a load test against your integration layer before going to production
- Set up cost alerts at 50% and 80% of your expected monthly budget
- Document your prompt templates and context assembly logic as if someone else will maintain it - because they will
The organisations getting the most value from AI right now aren't the ones with the most sophisticated models. They're the ones who've done the integration work carefully, built observable systems, and connected AI to the workflows where it actually changes outcomes.
If you want a technical review of your current stack or help scoping an integration project, get in touch with the Exponential Tech team.