Skip to content

Structured vs Unstructured Data

1. Why Data Types Matter in AI

Every AI model — whether traditional ML or a generative model — learns from data. The type of data you have determines which models you can use, how you prepare the data, and which Google Cloud tools are most appropriate.

The GAIL exam tests your ability to recognize data types and match them to the right AI approach and tooling.


2. Structured Data

Structured data is data that is organized in a predefined format, typically rows and columns, with a clear schema. Every entry follows the same structure.

Characteristics

  • Stored in relational databases or spreadsheets
  • Has defined data types (integer, float, string, date)
  • Easily searchable and queryable with SQL
  • Each field has a clear, consistent meaning

Examples

Customer ID Age City Purchase Amount Churned
1001 34 Milan €120.50 No
1002 28 Rome €45.00 Yes
1003 52 Turin €310.00 No

Other examples: - Financial transactions (amount, date, merchant, account ID) - Sensor readings (temperature, pressure, timestamp) - E-commerce orders (product ID, quantity, price, shipping address) - Survey responses with predefined choices

Use in AI

Structured data is the traditional home of classical machine learning: - Predict customer churn (classification) - Forecast sales revenue (regression) - Detect fraudulent transactions (anomaly detection) - Segment customers by purchase behavior (clustering)

Google Cloud tools: - BigQuery ML — run ML models directly on structured data using SQL - Vertex AI AutoML Tables — automatically build ML models from tabular data - Vertex AI Tabular — custom training on structured datasets


3. Unstructured Data

Unstructured data is data that does not follow a predefined format or schema. It cannot be easily organized into rows and columns.

Characteristics

  • No fixed schema or format
  • Requires preprocessing to extract meaning
  • Much more abundant than structured data (~80-90% of all data generated)
  • Rich in information but harder to query directly

Examples by type

Type Examples
Text Emails, documents, articles, chat logs, social media posts, contracts
Images Photos, medical scans, satellite imagery, product images
Audio Call recordings, podcasts, voice messages
Video Security footage, training videos, customer demos
Code Source files, scripts, notebooks

Use in AI

Unstructured data is where deep learning and generative AI excel: - Summarize customer emails (text) - Detect tumors in X-rays (images) - Transcribe call center recordings (audio) - Generate product descriptions from images (multimodal) - Classify support tickets by topic (text)

Google Cloud tools: - Vertex AI — training and deploying models on unstructured data - Document AI — extract structured information from unstructured documents - Vision AI — image classification, object detection, OCR - Speech-to-Text API — transcribe audio to text - Natural Language API — sentiment analysis, entity extraction from text - Video Intelligence API — analyze and index video content - Gemini — natively handles text, images, audio, video in one model


4. Semi-Structured Data

A third category worth knowing — data that has some organizational structure but doesn’t fit neatly into rows and columns.

Examples

  • JSON files
  • XML documents
  • Log files
  • HTML pages
  • Emails with metadata headers + free-text body

Use in AI

Often requires parsing/preprocessing before use. For example: - Extracting fields from JSON API responses - Parsing log files to identify anomalies - Extracting structured data from HTML web pages (scraping)


5. Side-by-Side Comparison

Structured Semi-Structured Unstructured
Format Fixed schema, rows/columns Flexible schema, key-value No schema
Storage Relational databases, CSV JSON, XML, NoSQL Files, object storage
Query method SQL JSON queries, NoSQL queries AI/ML models, search
Volume ~10% of enterprise data ~10% ~80%
ML approach Classical ML, AutoML Varies Deep learning, GenAI
Example Sales table Log file Customer email
GCP tool BigQuery, AutoML Tables Firestore, BigQuery Vertex AI, Gemini, Document AI

6. Data in the GenAI Pipeline

Understanding how both data types flow through a GenAI system is key for GAIL.

RAG Pipeline (most common GenAI architecture)

Unstructured data (PDFs, docs, emails)
        ↓
  Preprocessing & chunking
        ↓
  Embedding model (converts text → vectors)
        ↓
  Vector database (stores embeddings)
        ↓
  User query → Retrieve relevant chunks → Inject into prompt → LLM response

Training / Fine-tuning Pipeline

Raw data (structured + unstructured)
        ↓
  Data cleaning & labeling
        ↓
  Feature engineering (structured) / Tokenization (unstructured)
        ↓
  Model training
        ↓
  Evaluation → Deployment

Key insight for GAIL

LLMs primarily work with unstructured data — they read and generate text. However, many enterprise AI applications combine both: - A LLM generates a summary (unstructured output) - The result is stored and tagged in a database (structured) - Structured metadata (user ID, date, category) is used to filter which documents to retrieve in a RAG pipeline


7. Data Quality: Why It Matters

“Garbage in, garbage out” — the quality of your data directly determines model quality.

Key data quality dimensions

Dimension Description AI impact
Completeness Are there missing values? Missing data causes skewed predictions
Accuracy Is the data correct? Incorrect labels → wrong model behavior
Consistency Is the same concept represented the same way? Inconsistent labels confuse the model
Relevance Is the data relevant to the task? Irrelevant data adds noise
Timeliness Is the data up to date? Stale data causes model drift
Representativeness Does data cover all groups fairly? Unrepresentative data causes bias

8. Data in Google Cloud Storage

Knowing where different data types live in GCP is useful for the exam:

Data type Typical GCP storage
Structured tabular data BigQuery, Cloud SQL, Cloud Spanner
Semi-structured (JSON/logs) Firestore, BigQuery, Cloud Logging
Unstructured files (text, images, video) Cloud Storage (GCS)
Embeddings / vectors Vertex AI Vector Search, AlloyDB
Streaming data Pub/Sub + Dataflow

9. Key Vocabulary Cheat Sheet

Term Definition
Structured data Data organized in rows and columns with a fixed schema
Unstructured data Data without a predefined format (text, images, audio, video)
Semi-structured data Flexible schema data (JSON, XML, logs)
Schema The predefined structure defining what fields exist and their types
Feature An input variable used by an ML model (typically from structured data)
Embedding A numerical vector representation of unstructured data (text, image)
Vector database Database optimized for storing and searching embeddings
Tokenization Splitting text into tokens for processing by an LLM
OCR Optical Character Recognition — converting image text to machine-readable text
Data preprocessing Cleaning and transforming raw data before model training or inference
Data drift When production data diverges from training data over time
BigQuery ML SQL-based ML directly on structured data in BigQuery
Document AI Google Cloud service for extracting structure from unstructured documents