Document Classification
How ingested documents are split, classified by type, and routed to the appropriate storage containers.
Overview
Document classification is the second stage of the pipeline. It's a Azure function which is triggered by a queue event from the ingestion stage and is responsible for splitting multi-page documents, identifying the document type using AI, and routing each document to the correct storage container.
Steps
1. Queue Trigger — document-split-classification-queue
The function starts when a message arrives on the document-split-classification-queue. This event is pushed by the ingestion stage after a file has been successfully uploaded to Blob Storage.
2. Read — Unprocessed Files Container
The document is read from the Unprocessed-Files container in Azure Blob Storage, where it was placed during ingestion.
3. Split — Document Splitter
The document is split into individual pages or logical units. Currently only PDF splitting is supported. TIFF splitting may be added in a future iteration.
4. Store Metadata — Database
Once split, the metadata for each resulting file (file name, page count, source reference, split timestamp) is persisted to the database for traceability and downstream use.
5. Classify — AI Document Classifier
Each split document is passed to the AI classification model, which identifies the document type. Supported types include:
- Master Card
- Index Card
- Red Cross Card
Documents that cannot be confidently classified — due to damage, poor scan quality, or unknown format — are treated as unclassified.
6. Route — Type-Specific Blob Containers
Based on the classification result, each document is moved to a dedicated container:
| Document Type | Target Container |
|---|---|
| Master Card | master-cards |
| Index Card | index-cards (Not yet implemented) |
| Red Cross Card | red-cross-cards |
| Unknown / Damaged | unclassified-files |
Documents in unclassified-files require manual review and can be accessed separately to determine appropriate handling.
7. Acknowledge & Advance Queue
Once classification and routing are complete, the message is acknowledged on document-split-classification-queue. A new event is then pushed to document-extraction-queue to trigger the next stage — AI data extraction.
Flow Summary
document-split-classification-queue
↓
Read from Unprocessed-Files (Blob Storage)
↓
Split Document (PDF)
↓
Store Metadata (Database)
↓
AI Classification
↓
Route to type-specific container
↓
Acknowledge queue → Push to document-extraction-queue