# HR Data Quality — synthetic employee data cleaning

A skills-demonstration project: generate a realistic but **synthetic** HR
dataset, deliberately damage it with the kinds of defects real HR data carries,
then profile, clean, and visualise it with a documented, reproducible pipeline.

> **All data in this project is synthetic.** It is produced by Python/Faker for
> demonstration only. It is not real employee data and is not affiliated with
> any employer. Every dataset, workbook, and web page is labelled as such.

This project reframes data-management, data-quality QA, Excel, SQL and Python
skills around an HR employee-lifecycle scenario. It does **not** claim HR job
experience, nor proficiency in any specific HRIS (e.g. Workday or SAP); the
field names simply mirror common HR data concepts.

---

## 1. Scenario

An HR shared-service team receives an employee extract that has accumulated the
usual real-world quality problems: duplicates from repeated imports, dates typed
in several formats, departments spelled inconsistently, broken manager links,
and impossible values. Before the data can support headcount reporting or
lifecycle administration, it has to be profiled, cleaned to a controlled model,
and have anything it cannot safely fix routed to a review queue rather than
guessed at.

## 2. Data model

16 fields mirroring a typical HR record (full definitions in
[`DATA_DICTIONARY.md`](DATA_DICTIONARY.md)):

`employee_id`, `first_name`, `last_name`, `email`, `hire_date`,
`termination_date`, `employment_status`, `contract_type`, `job_title`,
`department`, `cost_center`, `org_unit`, `manager_id`, `location`,
`salary_band`, `is_synthetic`.

Controlled vocabularies are defined for `department` (10 values),
`contract_type`, `employment_status`, `org_unit`, and `salary_band`
(`B1`–`B7`). These are the single source of truth used both to generate the
data and to validate it during cleaning.

## 3. Defects injected (on purpose)

From a base of **5,000** internally-consistent records, the generator injects:

| Defect | Approx. rate |
|---|---|
| Duplicate records (exact + casing/whitespace near-duplicates) | ~2% (100 extra rows) |
| Inconsistent date formats (`dd/mm/yyyy`, `yyyy.mm.dd`, `dd Mon yyyy`, `dd-mm-yyyy`) | ~25% of dates |
| Inconsistent casing / whitespace in text fields | ~20% |
| Inconsistent department naming (`HR` / `Human Resources` / `h.r.` …) | ~18% |
| Missing `cost_center` / `manager_id` | ~5–6% each |
| Orphaned `manager_id` (points to a non-existent employee) | ~3% |
| Impossible dates (`termination_date` before `hire_date`) | ~5% of terminations |
| Out-of-range `salary_band` (`B0`, `B9`, `X` …) | ~3% |
| Malformed emails | ~4% |

The result is **5,100** raw rows. Generation is seeded (`seed = 42`), so the
defects — and therefore every downstream number — are identical on every run.

## 4. Profiling (SQL)

The raw extract is loaded into a portable **SQLite** database and profiled by
[`sql/profiling.sql`](sql/profiling.sql) — counts of exact duplicates, duplicate
keys, blank fields, orphaned manager references, out-of-domain salary bands, and
distinct department spellings (29 spellings collapse to 10 canonical values).
Results are captured to `data/profiling.json`.

Date-logic defects are profiled in Python rather than SQL, because the raw table
holds mixed free-text date formats that SQLite's `date()` cannot parse; the
typed cleaned table is then re-checked in SQL (see §6).

## 5. Cleaning rules

Each rule is named, **logged with a before/after count**, and follows one
principle: **standardise deterministically, but flag — never invent.** Anything
that cannot be safely corrected is flagged and routed to a quarantine review set
instead of being imputed or dropped.

| Rule | What it does | Decision |
|---|---|---|
| **R01** | Trim & collapse whitespace in all text fields | Pure standardisation |
| **R02** | Casing: names/titles Title Case, email lower, ID/code fields UPPER (org_unit keeps its canonical mixed case) | Pure standardisation |
| **R03** | De-duplicate by `employee_id` and by identical rows | Keep the **most complete** record (fewest blanks), tie-break first |
| **R04** | Parse mixed date formats to ISO 8601 | Flag any unparseable date |
| **R05** | Normalise `department`, `contract_type`, `employment_status` to controlled vocabularies | Flag values that map to nothing |
| **R06** | Impossible dates (`termination` < `hire`) | Clear the termination date and **flag** (preserves the otherwise-valid row) |
| **R07** | Validate `salary_band` against the domain | Clear and **flag** out-of-range bands (quarantine) |
| **R08** | Missing `cost_center` / `manager_id` | **Flag, never impute.** Managers (referenced by others) are allowed to have no manager |
| **R09** | Orphaned `manager_id` (no matching employee) | Clear and **flag** for review |
| **R10** | Reconcile `employment_status` with `termination_date` | Clear stray dates on active staff; flag contradictions |
| **R11** | Validate email format | Rebuild from the name (ASCII-folded) where possible, else flag |
| **R12** | Validate `employee_id` format & uniqueness | Flag any violation (primary-key integrity) |

### Why flag instead of impute?
In HR administration a wrong value is worse than a missing one — an invented
cost center or manager flows into payroll, reporting and org structure. So the
pipeline only auto-corrects where the correct value is unambiguous (formatting,
casing, known vocabulary). Everything else is surfaced for a human.

## 6. Results (seed 42, n = 5,000)

| Metric | Value |
|---|---|
| Raw rows in | **5,100** |
| Duplicates removed (R03) | **100** |
| Clean records (passed every gate) | **4,174** |
| Quarantined for review | **826** |
| Cleaning rules applied | **12** |
| Emails repaired (R11) | 364 |

Flag breakdown of the quarantine set: missing manager 261 · missing cost center
250 · invalid salary band 150 · orphaned manager 150 · impossible dates 48 ·
status inconsistency 48 · invalid email 34. (A row can carry more than one flag.)

After cleaning, the SQL verification gates in
[`sql/cleaning.sql`](sql/cleaning.sql) all return **0** against the cleaned
table: no duplicate IDs, no bad ID formats, all departments and salary bands in
vocabulary, no impossible or non-ISO dates, no orphaned managers, and every row
labelled `SYNTHETIC`.

## 7. Outputs

```
data/
  raw_employees.csv          raw, messy dataset
  cleaned_employees.csv      cleaned rows + qa_status + qa_flags
  quarantine_records.csv     flagged rows only
  employees.db               SQLite: raw_employees, cleaned_employees, quarantine_records
  profiling.json             pre-clean SQL profiling counts
  cleaning_log.json          machine-readable rule-by-rule impact
  summary.json               pre-aggregated metrics for the dashboard
  sample_compare.json        raw-vs-cleaned example records
excel/
  hr_raw_dataset.xlsx        Raw + Data dictionary (SYNTHETIC banner)
  hr_cleaned_dataset.xlsx    Cleaned + QA summary + Change log + Data dictionary
web/
  before-after.html          cleaning-impact + record comparison
  dashboard.html             interactive workforce dashboard
  project.css, assets/       styling, vendored Plotly, data bundle, app.js
sql/
  schema.sql, profiling.sql, cleaning.sql
scripts/
  config.py, generate.py, profile_data.py, clean.py, export.py, excel_utils.py, pipeline.py
```

## 8. How to re-run

```bash
# from this folder, with Python 3.12
py -3 -m pip install -r requirements.txt   # use python3 on macOS/Linux
py -3 scripts/pipeline.py                  # regenerate every artifact (seed 42)
py -3 scripts/pipeline.py --n 1000         # smaller run
```

The web pages are **static**. View them via any static server, e.g. from the
repository root:

```bash
py -3 -m http.server 8000
# open http://localhost:8000/projects/hr-data-quality/web/dashboard.html
```

The dashboard reads pre-aggregated JSON (bundled as `assets/data-bundle.js` so
the pages also open directly from the file system) and a locally vendored copy
of Plotly — no CDN, no backend, no build step.

## 9. Tools

Python (pandas, NumPy, Faker, openpyxl), SQLite/SQL, Plotly (charts authored and
themed in Python), and static HTML/CSS/vanilla JS. Dependencies are pinned in
[`requirements.txt`](requirements.txt).
