What are the top AI upskilling gaps identified by AWS and Coursera?

Cloud architecture, data governance, and cybersecurity skills for production AI workloads.

Why is platform engineering important for enterprise AI?

It standardizes deployment, networking, and security through golden paths, reducing variance and operational risk.

How can teams improve FinOps for AI workloads?

Set GPU-specific quotas, define autoscaling rules based on inference metrics, and monitor token usage and latency.

Artificial Intelligence August 8, 2025

AWS and Coursera find AI upskilling gaps in cloud architecture, data governance, and security

New research from AWS and Coursera lands on a point most engineering teams already know from experience: AI projects slow down when the cloud, data, and security work underneath them is weak. Companies still talk about AI skills as though the main ga...

AWS and Coursera are right about AI upskilling, but they’re underselling the hard part

New research from AWS and Coursera lands on a point most engineering teams already know from experience: AI projects slow down when the cloud, data, and security work underneath them is weak.

Companies still talk about AI skills as though the main gap is prompt engineering, model choice, or picking the right copilots. For teams shipping production systems, that’s rarely the hard part. The harder work sits lower in the stack: provisioning GPU-heavy workloads without burning cash, keeping training and retrieval data clean enough to trust, and securing a system that now spans model APIs, vector stores, secrets vaults, third-party services, and whatever internal data source got wired into a RAG pipeline last week.

If you’re running AI programs in 2026, the dependency chain is obvious. Cloud architecture, data governance, and cybersecurity sit inside the AI stack now.

Infrastructure discipline is the bottleneck

Enterprise AI is past the toy-demo stage. LLM apps, retrieval pipelines, fine-tuning jobs, and multimodal inference all assume a cloud-native runtime. That means elastic compute, fast storage, sane networking, and identity controls that hold up under load.

Teams hit the same problems over and over:

GPU instances are available, but orchestration is sloppy and utilization is poor.
The model works in testing, then retrieval quality falls apart because source data is stale, duplicated, or badly chunked.
A prototype ships fast through a public endpoint with broad IAM permissions, then security has to clean up later.
Costs spike because nobody set quotas, autoscaling rules, or observability around token usage and inference latency.

The AWS and Coursera framing is useful because it pulls the discussion back to system design. The people building AI products need working knowledge of VPCs, IAM, KMS, managed Kubernetes or equivalent orchestration, streaming ingestion, lineage, encryption, red teaming, policy enforcement, and runtime telemetry. That’s standard production work now.

Cloud skills now include GPU economics

The cloud part is bigger than learning AWS services. AI workloads have odd cost profiles, bursty demand, and very little tolerance for bad architecture.

A standard web app can absorb some waste. An LLM-backed service running on scarce accelerators usually can’t.

Teams need to think about:

GPU scheduling and bin-packing on Kubernetes or managed platforms
Autoscaling policies that reflect real inference traffic, not generic CPU thresholds
Quantization and lower precision inference where quality still holds up
Network design for private access to model endpoints, data stores, and internal services
Storage throughput for training and embedding pipelines
FinOps discipline tied to AI workloads, not a monthly billing autopsy

This is one reason platform engineering is becoming central to enterprise AI. Companies that take this seriously are building internal AI platforms with golden paths: approved model gateways, standard telemetry, private networking defaults, central auth, and prebuilt templates for retrieval and evaluation. That reduces variance, which matters when teams are moving fast on expensive infrastructure.

It also avoids the usual mess where every product squad builds its own shaky MLOps stack.

Data work is still what teams underestimate

Bad data has always killed ML projects. Generative AI just gives teams a few more ways to break things.

RAG pipelines look forgiving at first glance. Ingest documents, create embeddings, store vectors, query at runtime. In practice, output quality depends on basic data engineering discipline that many AI teams still treat as somebody else’s job.

Data contracts matter. Schema validation matters. Lineage matters. Reproducible transformations matter. If you can’t say where a chunk of retrieval data came from, when it changed, whether it contains sensitive fields, and which embedding version produced it, you’re guessing.

The source material points to lakehouse formats like Iceberg, Delta, and Hudi, along with CDC pipelines and validation gates. That’s the right direction. AI systems need the same boring controls analytics and ML platforms have needed for years, with tighter feedback loops and higher stakes when they fail.

A lightweight validation layer catches a lot early. A pandera schema in CI that rejects malformed training or retrieval data before it reaches model builds is simple and effective. The example in the source material does the right things: enforce schema shape, validate country and age ranges, and require a hashed email field so raw PII doesn’t leak into downstream exports.

That won’t solve every data problem. It does stop one of the worst habits in AI work: letting bad inputs slide because the deadline feels urgent.

Security gets stranger fast

Traditional app security was already hard. AI systems add stranger failure modes and more places to lose control.

The obvious problems are still there: weak IAM, public buckets, sloppy secrets handling, missing encryption. The AI stack adds more exposure:

Prompt injection against retrieval-augmented apps
Data exfiltration through model prompts and responses
Unvetted third-party APIs in the inference path
Vector stores holding sensitive embeddings with weak access controls
Model artifacts and datasets entering the supply chain without signing or provenance checks
Evaluation blind spots where safety regressions don’t block deployment

“Cybersecurity upskilling” sounds mild compared with the problem. Teams need operational security practices that fit AI workloads. Least-privilege IAM, private subnets, VPC endpoints, KMS-backed encryption, secret rotation, egress controls, and incident playbooks should be the default setup.

Policy-as-code matters here because it scales better than review meetings. An OPA/Rego rule that fails CI when a feature store bucket lacks KMS encryption or is exposed publicly is blunt, but blunt is fine. It works. Better that than hoping somebody catches the problem in a Terraform diff after midnight.

There’s also a compliance angle that’s getting harder to dodge. The EU AI Act, NIST AI RMF adoption, and sector-specific requirements are pushing teams to document model risk, data handling, and audit trails with a lot more rigor. If your AI program has no lineage, no access logs, and no answer on data residency, the production posture isn’t serious.

What a production AI stack actually needs

The most useful part of the AWS and Coursera framing is that it treats AI as a system.

A production enterprise LLM app usually needs all of this working together:

Ingestion from transactional systems or streams into a lakehouse or equivalent storage layer
Validation gates and data contracts to stop garbage before it spreads
Tagging, encryption, and lineage for sensitive records
Deterministic feature or embedding pipelines with versioned transforms
A vector store with real access controls and auditability
A model registry that tracks versions, eval results, datasets, and risk notes
Deployment behind an API gateway with auth, rate limiting, and filtering
Private networking and proper secret management
Observability for latency, token consumption, failures, and safety metrics
Continuous security testing, including prompt injection and jailbreak checks

Miss one of those layers and the system gets fragile quickly.

That’s why the skills-gap conversation matters. The modern AI stack cuts across backend, data, platform, ML, and security work in ways most org charts still don’t match.

A sensible 90-day plan

The source material outlines a 90-day upskilling path. It’s more realistic than most corporate training plans because it ties learning to delivery.

Weeks 1 to 3: build a secure runtime

Stand up a minimal AI service in a sandbox. Use a VPC, private subnets, secrets management, and a managed model endpoint. Add a basic retrieval API, tracing, and autoscaling.

The runtime needs to be real enough for networking, auth, and observability problems to show up early.

Weeks 4 to 6: tighten the data path

Add schema validation in CI. Set data contracts. Track lineage. Encrypt sensitive columns. Run automated PII checks on ingestion.

A lot of teams find out at this stage that they don’t actually know what’s feeding the model.

Weeks 7 to 9: enforce security baselines

Apply least privilege. Lock down private connectivity to data and model endpoints. Add prompt and response filters. Test egress controls. Run an AI red-team exercise.

If you haven’t tried to break your own RAG app, you should assume somebody else will.

Weeks 10 to 12: measure cost and reliability

Define SLOs for latency and error rates. Put dashboards and alerts in place. Add GPU quotas, scaling rules, and lower-precision inference where it makes sense.

This is where pilots either become sustainable or turn into expensive hobbies.

What leaders should pay attention to

For tech leads and engineering managers, the takeaway is uncomfortable but straightforward: AI delivery depends on cross-functional maturity that many companies still don’t have.

You can’t staff this with a couple of ML engineers and hope the rest sorts itself out. Backend engineers need to care about private endpoints and service boundaries. Data engineers need to own contracts, lineage, and reproducibility. Platform teams need to provide paved roads instead of making every squad improvise. Security has to be in the build loop, not waiting at the release gate.

That’s why the AWS and Coursera research matters. It says something obvious, and that obvious point is still missing from budget plans and hiring discussions.

The model gets the demo. The foundation decides whether the system survives production.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service

Data engineering and cloud

Build the data and cloud foundations that AI workloads need to run reliably.

Related proof

Cloud data pipeline modernization

How pipeline modernization cut reporting delays by 63%.

Andy Jassy's AI case is really about infrastructure, not model APIs

Andy Jassy is making a straightforward case: companies need to spend hard on AI now. Not on a few model APIs bolted onto old products. On the infrastructure underneath it, and on the product decisions that determine where AI actually belongs. Amazon ...

Why AI infrastructure is going multi-cloud, but data storage still is not

AI infrastructure has already come apart into pieces. Teams train on CoreWeave, fine-tune on Lambda, run inference somewhere else, and watch spot pricing the whole time. Compute moves around far more easily than it did a few years ago. Data usually d...

Snowflake signs $6B AWS deal as Amazon pushes its AI chips

Snowflake has signed a new five-year, $6 billion agreement with Amazon Web Services. The size of the deal is the point. AWS says Snowflake has sold about $7 billion worth of services through AWS Marketplace since Snowflake was founded in 2012. This n...