Data Readiness 101 for AI Agents

Great agents are built on ordinary, well-described data. You don’t need a data lake or a year-long migration—just a handful of disciplined habits that make information findable, trustworthy, and usable. This guide shows non-experts how to prepare data so AI agents can act confidently without surprises.

TL;DR

Inventory sources and owners; mark what’s authoritative vs. reference.
Write light schemas (names, types, required fields) and save one example row.
Start with read-only access via service accounts; rotate and store secrets safely.
Add simple validations (not empty, dates in range, IDs unique) and stop on failure.
Handle sensitive data deliberately: mask in logs, restrict exports, keep an audit trail.

Why “Data Readiness” Matters

Agents turn goals into actions using tools—querying tables, reading files, calling APIs. If the underlying data is messy, undocumented, or hard to reach, agents produce guesswork and rework. By contrast, ordinary but consistent data lets agents do the boring parts quickly: pulling the right records, merging cleanly, and drafting outputs that humans can review with confidence.

Think minimum viable data (MVD): You’re not engineering a warehouse. You’re making today’s sources clear and dependable enough that an agent can do useful work this week.

1) Inventory Your Sources (and What’s “Authoritative”)

Start by mapping systems and the tables or endpoints agents will touch. Keep it short and plain English.

System	Primary tables/APIs	Authoritative for	Owner	Notes
SIS	`students`, `enrollments`	Student status	Registrar	IDs stable; dates in UTC
CRM	`contacts`, `activities`	Prospect touchpoints	Admissions Ops	Emails normalized? Check
Helpdesk	`tickets`, `categories`	Issue history	IT Support	Category names drifted in 2023

Decide what’s authoritative (the source of truth), what’s reference (lookups like department codes), and what’s derived (reports or aggregates). Agents should prefer authoritative sources for facts they cite.

2) Lightweight Schemas (Names, Types, Required Fields)

Write the smallest useful description of each table or endpoint. Make field names predictable, types explicit, and required columns obvious. Include an example row to eliminate ambiguity.

{
  "table": "students",
  "primary_key": "student_id",
  "fields": {
    "student_id": { "type": "string", "required": true, "example": "S-001234" },
    "first_name": { "type": "string", "required": true },
    "last_name":  { "type": "string", "required": true },
    "email":      { "type": "string", "required": false },
    "status":     { "type": "string", "required": true, "enum": ["active","leave","alumni"] },
    "created_at": { "type": "datetime", "required": true, "timezone": "UTC" }
  }
}

Consistency here pays dividends later: agents can validate fields, operators can debug quickly, and new teammates won’t guess what a column means.

3) Access & Least Privilege (Practical)

Agents don’t need broad access; they need the right access. Create service accounts with the minimum scopes, start read-only, and rotate credentials regularly. Store secrets outside of code (e.g., environment variables or a secrets manager).

service_account: svc-agent
systems:
  - name: helpdesk
    scopes: [read]
    notes: "Tier-0 analytics only; no ticket writes"
  - name: storage
    scopes: [write_draft]
    notes: "Draft PDFs to /Reports, approvals required for share"
rotation: 90_days
key_storage: "secrets manager"

Start read-only. Add write permissions only when the review path is clear and tested.

4) Simple Validations (Stop on Failure)

Validations are small rules that catch big headaches. They’re easy to write and dramatically improve trust.

Not empty: The query returned at least one row.
Dates in range: The data matches the requested window.
Required columns present: e.g., id, created_at, status.
Primary key uniqueness: No duplicate IDs.

rules:
  - not_empty: { table: tickets }
  - date_range: { table: tickets, col: created_at, start: 2025-08-01, end: 2025-08-31 }
  - require_columns: { table: tickets, cols: ["id","category","created_at"] }
  - unique: { table: tickets, col: id }

If a rule fails, the agent should stop, log a reason, and route to a human reviewer. No silent improvisation.

5) PII & Policy (Be Deliberate)

Some fields deserve more caution. Map which columns contain sensitive data (PII, PHI, financial details). Mask them in logs, restrict exports, and limit where artifacts can be written. Keep track of who approved what.

Field	Type	Sensitivity	Masking	Export Allowed
email	string	PII	hash	internal only
dob	date	PII	redact	no
mrn	string	PHI	redact	no

log_shape:
  timestamp, agent_id, tool, params_hash, rows, result_hash, actor
masking:
  email: sha256
  dob: redacted
  mrn: redacted
retention:
  logs: 180 days
  artifacts: per system policy

6) Artifacts to Prepare (Small but Mighty)

Package your readiness so agents (and people) can use it immediately:

data_sources.yaml
schemas/*.json
samples/*.csv
queries/*.sql
access_policies.md
validations.yml
glossary.md

These files aren’t bureaucracy; they’re shortcuts. They remove guesswork for whoever builds or operates the agent.

7) A Tiny Data Glossary (Make Terms Unambiguous)

Plain words beat tribal knowledge. Five minutes here prevents hours of debugging.

Authoritative: Where a fact is considered true (SIS for enrollment status).
Reference: Lookups like department codes or categories.
Derived: Reports or aggregates built from source data.
Primary key: A unique field that identifies a record (e.g., student_id).
Freshness: How recent data is (e.g., “updated nightly at 01:00 UTC”).

Quick Wins (1–2 hours each)

Normalize dates to UTC and store the timezone info in the schema.
Enforce unique IDs and add a simple duplicate check.
Add last_updated to key tables.
Write one sample query per system and save the output CSV.

Immediate payoff: With these wins in place, many workflows can be piloted read-only in a week.

Example: Readiness for a Monthly Ops Brief

Suppose you want an agent to compile a monthly helpdesk brief (volume, top categories, MoM change) and save a PDF. Minimum viable readiness looks like this:

Inventory: Helpdesk tickets (authoritative), owner = IT Support.
Schema: id (PK), category, created_at, status.
Access: service account, read-only; storage in /Reports/ (draft mode).
Validations: date range last month, non-empty, required columns present.
Artifacts: queries/tickets_last_month.sql, samples/tickets_2025-08.csv.

SELECT id, category, created_at, status
FROM tickets
WHERE created_at BETWEEN DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
  AND (DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '1 day')
ORDER BY created_at;

What Good Data Logs Look Like (for Operators)

When an agent runs, operators should see data-related facts at a glance:

Run #238 — Monthly Ops Brief
Data:
  source=tickets  rows=3,482  range=2025-08-01..2025-08-31  freshness=2025-09-01T01:14Z
Validations:
  not_empty: pass
  date_range: pass
  require_columns: pass
  unique(id): pass
Artifacts:
  /Reports/2025-08/ops-brief.pdf
  /Reports/2025-08/ops-brief-data.csv

This isn’t busywork; it’s how reviewers trust and approve outcomes quickly.

Troubleshooting Data Issues (Fast Paths)

Empty results? Check the date range and filter logic; run the sample query directly.
Duplicate IDs? Identify the source system; fix upstream or de-duplicate with a rule.
Type mismatches? Normalize types in the query (e.g., cast timestamps).
Stale data? Show a “data as of” timestamp; consider a freshness validation.

- freshness: { table: tickets, col: updated_at, max_age_hours: 36 }

Security & Privacy (Right-Sized)

Keep safeguards practical:

Secrets: environment variables or a secrets manager; no keys in code.
Scopes: define per system (read, write_draft, write_approved).
Logging: include what and why without leaking PII.
Approvals: outward actions require a human click-through.

A Two-Week Readiness Plan

Week 1: inventory sources & owners; write schemas; set read-only access; add validations.
Week 2: create sample queries; save example rows; document masking; pilot one workflow read-only.

Quick FAQ

Do we need a data lake? No. Start with today’s sources and make them dependable.

What if our fields are inconsistent? Document current names, then stabilize them over time. An explicit schema beats a perfect one you don’t have.

How do we avoid hallucinations? Fetch facts via tools, prefer authoritative sources, and validate results. The agent should cite where numbers came from.

Who owns the schema? The business owner of the system. Make changes a lightweight review item, not a surprise.

Mini Glossary

MVD (Minimum Viable Data): The smallest set of docs, access, and checks that make a workflow safe to automate.
Validation: A rule the data must satisfy (e.g., non-empty, unique IDs).
Freshness: A promise about how recent the data is.
Masking: Hiding or hashing sensitive fields in logs and artifacts.

Data readiness is less about infrastructure and more about clarity. Inventory sources, write light schemas, use read-only access, add simple validations, and be deliberate with sensitive fields. Do that, and your first agent will look less like a prototype—and more like a teammate you can trust.