Data Extraction
How classified documents are processed by Azure Document Intelligence to extract structured data into the database.
Overview
Data extraction is the third stage of the pipeline. It is handled by the File Data Extraction Service — an Azure Function that picks up classified documents, runs them through an AI model to extract structured fields, persists the results to the database, and hands off to the next stage.
Steps
1. Queue Trigger — document-extraction-queue
The function is triggered by a message on the document-extraction-queue, pushed at the end of the classification stage once a document has been classified and stored in its type-specific container.
2. Fetch — Type-Specific Blob Container
The service reads the queue message to identify the document, then fetches the file from the appropriate Blob Storage container (e.g. master-cards, index-cards, red-cross-cards).
3. Extract — Azure Document Intelligence
The document is sent to the corresponding Azure Document Intelligence model. Each document type has its own trained model to ensure accurate field extraction. The model returns structured data — ex-code, registration code, family memebrs, and other fields specific to the document type.
4. Persist — Azure SQL Database
The extracted data is written to Azure SQL. Each document type maps to its own dedicated table:
| Document Type | Tables |
|---|---|
| Red Cross Card | red-cross-card, red-cross-family-members |
| Master Card (Front) | front-master-card, front-master-card-family-members |
Each record is linked to its source document for traceability.
5. Advance Queue — document-cleansing-queue
Once the data is persisted, a new event is pushed to the document-cleansing-queue to trigger the next stage — data cleansing and validation.
Flow Summary
document-extraction-queue
↓
Fetch document from Blob Storage
↓
Send to Azure Document Intelligence (type-specific model)
↓
Persist extracted data → Azure SQL
↓
Push to document-cleansing-queue