UNRWA Digital archive
Main workflows

Data Cleansing

How extracted data is cleaned and standardized using business rules before being used downstream.

Overview

Data cleansing is the fourth stage of the pipeline. It is handled by an Azure Function triggered by the document-cleansing-queue. The function applies a set of business rules defined by the business team to normalize and clean each extracted record before it is used downstream.

Steps

1. Queue Trigger — document-cleansing-queue

The function is triggered by a message on the document-cleansing-queue, pushed at the end of the data extraction stage once a record has been persisted to Azure SQL.

2. Process — Row-by-Row Cleansing

Each database row is processed individually. Business rules are applied to every field to ensure the data is consistent, valid, and standardized.

3. Apply Business Rules — Stored Procedures

Most cleansing logic is implemented as SQL Stored Procedures. Each SP targets specific fields and applies the transformation rules provided by the business team. Examples include:

Country Code Normalization

Single-letter country prefixes are expanded to their standard three-letter codes:

Raw Value (starts with)Normalized Value
PPAL
LLEB
JJOR
YYEM

Special Character Removal

Fields such as ex-code and registration-code must contain numbers only. Any non-numeric characters are stripped from these fields.

4. Update — Write Cleaned Values Back

Once all stored procedures have run for a given row, the cleaned field values are written back to the database record, replacing the raw extracted values.

Flow Summary

document-cleansing-queue

Fetch row from Azure SQL

Apply business rules (Stored Procedures)
  - Normalize country codes
  - Strip non-numeric characters from codes
  - ... (additional rules)

Update row with cleaned values

On this page