UNRWA Digital archive
Main workflows

Document Classification

How ingested documents are split, classified by type, and routed to the appropriate storage containers.

Overview

Document classification is the second stage of the pipeline. It's a Azure function which is triggered by a queue event from the ingestion stage and is responsible for splitting multi-page documents, identifying the document type using AI, and routing each document to the correct storage container.

Steps

1. Queue Trigger — document-split-classification-queue

The function starts when a message arrives on the document-split-classification-queue. This event is pushed by the ingestion stage after a file has been successfully uploaded to Blob Storage.

2. Read — Unprocessed Files Container

The document is read from the Unprocessed-Files container in Azure Blob Storage, where it was placed during ingestion.

3. Split — Document Splitter

The document is split into individual pages or logical units. Currently only PDF splitting is supported. TIFF splitting may be added in a future iteration.

4. Store Metadata — Database

Once split, the metadata for each resulting file (file name, page count, source reference, split timestamp) is persisted to the database for traceability and downstream use.

5. Classify — AI Document Classifier

Each split document is passed to the AI classification model, which identifies the document type. Supported types include:

  • Master Card
  • Index Card
  • Red Cross Card

Documents that cannot be confidently classified — due to damage, poor scan quality, or unknown format — are treated as unclassified.

6. Route — Type-Specific Blob Containers

Based on the classification result, each document is moved to a dedicated container:

Document TypeTarget Container
Master Cardmaster-cards
Index Cardindex-cards (Not yet implemented)
Red Cross Cardred-cross-cards
Unknown / Damagedunclassified-files

Documents in unclassified-files require manual review and can be accessed separately to determine appropriate handling.

7. Acknowledge & Advance Queue

Once classification and routing are complete, the message is acknowledged on document-split-classification-queue. A new event is then pushed to document-extraction-queue to trigger the next stage — AI data extraction.

Flow Summary

document-split-classification-queue

Read from Unprocessed-Files (Blob Storage)

Split Document (PDF)

Store Metadata (Database)

AI Classification

Route to type-specific container

Acknowledge queue → Push to document-extraction-queue

On this page