Architecture for my Generative AI Application

I’ve been playing around with some AI models to generate images. And let me tell you, there’s no shortage of them out there. Thanks to platforms like Hugging Face, I don’t even have to train them myself. Just download, plug, and play. Sounds easy, right? Well, there’s a tiny little catch, these models are huge. We’re talking 10GB per model. That’s like downloading an entire season of your favorite TV show.

Now, to run these models, I need a lot of RAM, a ton of hard drive space, and, oh yeah, some GPUs. Not just any GPUs, but the kind that don’t come cheap. This is where things start getting interesting.

The Problem

Running these AI models requires GPUs. A lot of GPUs. And if you’ve ever checked the price of GPU instances on AWS or GCP, you’ll know they aren’t exactly giving them away. Renting these things by the hour is like renting a Ferrari, sure, it’s fast, but forget to return it on time, and your wallet will cry.

So, I need a setup that allows my web application to use AI models without keeping these expensive machines running 24/7. The goal is simple: run the GPUs only when needed, and shut them down the moment the job is done. But how do I shut them off without having to manually start and stop the instances every time? (Because let’s be honest, I’ll forget, and my credit card will suffer.)

Oh, and there’s another problem: I can’t handle parallel requests efficiently. If multiple users try to generate images at the same time, I’d need even more GPU RAM to process them simultaneously. And guess what? More GPU RAM means upgrading to a bigger, more expensive instance, which means doubling (or worse) the costs. So, yeah, parallel requests are a luxury I can’t afford… literally.

The Solution

The answer? Two separate applications:

A web application that’s always online, ready to receive requests.
A backend service that runs the AI models but only spins up when needed.

So, how do I control when to start and stop this second machine? Manually? No way. That’s a recipe for disaster (or, at least, an unexpected cloud bill). Instead, I use asynchronous messaging.

Both AWS and GCP provide messaging systems, SQS (AWS) and PubSub (GCP), which let me set up an auto-scaling group. Here’s how it works:

The web app sends a message to the queue when it needs to generate an image.
If no GPU instance is running, the message triggers an auto-scaling rule to start a GPU instance.
The GPU instance reads the queue, processes the requests, and, once done, shuts itself down. (Or, if startup time is too slow, it can wait for a few minutes before shutting down, just in case another request comes in.)

Boom. Problem solved. The GPUs only run when they’re needed, and I don’t have to remember to turn them off. My credit card thanks me.

Conclusion

This setup works perfectly as long as:

The GPU usage isn’t constant (if I need it 24/7, I might as well keep a dedicated instance running).
The startup and shutdown times don’t introduce too much latency.

If my application starts demanding more GPU power or real-time inference, I’ll revisit this architecture. But for now, this keeps my AI application running efficiently without burning a hole in my pocket.

And that, my friends, is how you keep your AI-powered dreams alive without going broke.

Architecture for my Generative AI Application

The Problem

The Solution

Conclusion

Discover more from The Dev World – Sergio Lema

Comments

Leave a comment Cancel reply