Structured vs Unstructured Data¶
1. Why Data Types Matter in AI¶
Every AI model — whether traditional ML or a generative model — learns from data. The type of data you have determines which models you can use, how you prepare the data, and which Google Cloud tools are most appropriate.
The GAIL exam tests your ability to recognize data types and match them to the right AI approach and tooling.
2. Structured Data¶
Structured data is data that is organized in a predefined format, typically rows and columns, with a clear schema. Every entry follows the same structure.
Characteristics¶
- Stored in relational databases or spreadsheets
- Has defined data types (integer, float, string, date)
- Easily searchable and queryable with SQL
- Each field has a clear, consistent meaning
Examples¶
| Customer ID | Age | City | Purchase Amount | Churned |
|---|---|---|---|---|
| 1001 | 34 | Milan | €120.50 | No |
| 1002 | 28 | Rome | €45.00 | Yes |
| 1003 | 52 | Turin | €310.00 | No |
Other examples: - Financial transactions (amount, date, merchant, account ID) - Sensor readings (temperature, pressure, timestamp) - E-commerce orders (product ID, quantity, price, shipping address) - Survey responses with predefined choices
Use in AI¶
Structured data is the traditional home of classical machine learning: - Predict customer churn (classification) - Forecast sales revenue (regression) - Detect fraudulent transactions (anomaly detection) - Segment customers by purchase behavior (clustering)
Google Cloud tools: - BigQuery ML — run ML models directly on structured data using SQL - Vertex AI AutoML Tables — automatically build ML models from tabular data - Vertex AI Tabular — custom training on structured datasets
3. Unstructured Data¶
Unstructured data is data that does not follow a predefined format or schema. It cannot be easily organized into rows and columns.
Characteristics¶
- No fixed schema or format
- Requires preprocessing to extract meaning
- Much more abundant than structured data (~80-90% of all data generated)
- Rich in information but harder to query directly
Examples by type¶
| Type | Examples |
|---|---|
| Text | Emails, documents, articles, chat logs, social media posts, contracts |
| Images | Photos, medical scans, satellite imagery, product images |
| Audio | Call recordings, podcasts, voice messages |
| Video | Security footage, training videos, customer demos |
| Code | Source files, scripts, notebooks |
Use in AI¶
Unstructured data is where deep learning and generative AI excel: - Summarize customer emails (text) - Detect tumors in X-rays (images) - Transcribe call center recordings (audio) - Generate product descriptions from images (multimodal) - Classify support tickets by topic (text)
Google Cloud tools: - Vertex AI — training and deploying models on unstructured data - Document AI — extract structured information from unstructured documents - Vision AI — image classification, object detection, OCR - Speech-to-Text API — transcribe audio to text - Natural Language API — sentiment analysis, entity extraction from text - Video Intelligence API — analyze and index video content - Gemini — natively handles text, images, audio, video in one model
4. Semi-Structured Data¶
A third category worth knowing — data that has some organizational structure but doesn’t fit neatly into rows and columns.
Examples¶
- JSON files
- XML documents
- Log files
- HTML pages
- Emails with metadata headers + free-text body
Use in AI¶
Often requires parsing/preprocessing before use. For example: - Extracting fields from JSON API responses - Parsing log files to identify anomalies - Extracting structured data from HTML web pages (scraping)
5. Side-by-Side Comparison¶
| Structured | Semi-Structured | Unstructured | |
|---|---|---|---|
| Format | Fixed schema, rows/columns | Flexible schema, key-value | No schema |
| Storage | Relational databases, CSV | JSON, XML, NoSQL | Files, object storage |
| Query method | SQL | JSON queries, NoSQL queries | AI/ML models, search |
| Volume | ~10% of enterprise data | ~10% | ~80% |
| ML approach | Classical ML, AutoML | Varies | Deep learning, GenAI |
| Example | Sales table | Log file | Customer email |
| GCP tool | BigQuery, AutoML Tables | Firestore, BigQuery | Vertex AI, Gemini, Document AI |
6. Data in the GenAI Pipeline¶
Understanding how both data types flow through a GenAI system is key for GAIL.
RAG Pipeline (most common GenAI architecture)¶
Unstructured data (PDFs, docs, emails)
↓
Preprocessing & chunking
↓
Embedding model (converts text → vectors)
↓
Vector database (stores embeddings)
↓
User query → Retrieve relevant chunks → Inject into prompt → LLM response
Training / Fine-tuning Pipeline¶
Raw data (structured + unstructured)
↓
Data cleaning & labeling
↓
Feature engineering (structured) / Tokenization (unstructured)
↓
Model training
↓
Evaluation → Deployment
Key insight for GAIL¶
LLMs primarily work with unstructured data — they read and generate text. However, many enterprise AI applications combine both: - A LLM generates a summary (unstructured output) - The result is stored and tagged in a database (structured) - Structured metadata (user ID, date, category) is used to filter which documents to retrieve in a RAG pipeline
7. Data Quality: Why It Matters¶
“Garbage in, garbage out” — the quality of your data directly determines model quality.
Key data quality dimensions¶
| Dimension | Description | AI impact |
|---|---|---|
| Completeness | Are there missing values? | Missing data causes skewed predictions |
| Accuracy | Is the data correct? | Incorrect labels → wrong model behavior |
| Consistency | Is the same concept represented the same way? | Inconsistent labels confuse the model |
| Relevance | Is the data relevant to the task? | Irrelevant data adds noise |
| Timeliness | Is the data up to date? | Stale data causes model drift |
| Representativeness | Does data cover all groups fairly? | Unrepresentative data causes bias |
8. Data in Google Cloud Storage¶
Knowing where different data types live in GCP is useful for the exam:
| Data type | Typical GCP storage |
|---|---|
| Structured tabular data | BigQuery, Cloud SQL, Cloud Spanner |
| Semi-structured (JSON/logs) | Firestore, BigQuery, Cloud Logging |
| Unstructured files (text, images, video) | Cloud Storage (GCS) |
| Embeddings / vectors | Vertex AI Vector Search, AlloyDB |
| Streaming data | Pub/Sub + Dataflow |
9. Key Vocabulary Cheat Sheet¶
| Term | Definition |
|---|---|
| Structured data | Data organized in rows and columns with a fixed schema |
| Unstructured data | Data without a predefined format (text, images, audio, video) |
| Semi-structured data | Flexible schema data (JSON, XML, logs) |
| Schema | The predefined structure defining what fields exist and their types |
| Feature | An input variable used by an ML model (typically from structured data) |
| Embedding | A numerical vector representation of unstructured data (text, image) |
| Vector database | Database optimized for storing and searching embeddings |
| Tokenization | Splitting text into tokens for processing by an LLM |
| OCR | Optical Character Recognition — converting image text to machine-readable text |
| Data preprocessing | Cleaning and transforming raw data before model training or inference |
| Data drift | When production data diverges from training data over time |
| BigQuery ML | SQL-based ML directly on structured data in BigQuery |
| Document AI | Google Cloud service for extracting structure from unstructured documents |