Document Classification

How ingested documents are split, classified by type, and routed to the appropriate storage containers.

Overview

Document classification is the second stage of the pipeline. It's a Azure function which is triggered by a queue event from the ingestion stage and is responsible for splitting multi-page documents, identifying the document type using AI, and routing each document to the correct storage container.

Steps

1. Queue Trigger — `document-split-classification-queue`

The function starts when a message arrives on the document-split-classification-queue. This event is pushed by the ingestion stage after a file has been successfully uploaded to Blob Storage.

2. Read — Unprocessed Files Container

The document is read from the Unprocessed-Files container in Azure Blob Storage, where it was placed during ingestion.

3. Split — Document Splitter

The document is split into individual pages or logical units. Currently only PDF splitting is supported. TIFF splitting may be added in a future iteration.

4. Store Metadata — Database

Once split, the metadata for each resulting file (file name, page count, source reference, split timestamp) is persisted to the database for traceability and downstream use.

5. Classify — AI Document Classifier

Each split document is passed to the AI classification model, which identifies the document type. Supported types include:

Master Card
Index Card
Red Cross Card

Documents that cannot be confidently classified — due to damage, poor scan quality, or unknown format — are treated as unclassified.

6. Route — Type-Specific Blob Containers

Based on the classification result, each document is moved to a dedicated container:

Document Type	Target Container
Master Card	`master-cards`
Index Card	`index-cards` (Not yet implemented)
Red Cross Card	`red-cross-cards`
Unknown / Damaged	`unclassified-files`

Documents in unclassified-files require manual review and can be accessed separately to determine appropriate handling.

7. Acknowledge & Advance Queue

Once classification and routing are complete, the message is acknowledged on document-split-classification-queue. A new event is then pushed to document-extraction-queue to trigger the next stage — AI data extraction.

Flow Summary

document-split-classification-queue
        ↓
Read from Unprocessed-Files (Blob Storage)
        ↓
Split Document (PDF)
        ↓
Store Metadata (Database)
        ↓
AI Classification
        ↓
Route to type-specific container
        ↓
Acknowledge queue → Push to document-extraction-queue

Document Classification

On this page