Advanced Problem
Multilingual handwriting systems degrade quickly when normalization is inconsistent across data sources. You get duplicated symbols, conflicting stroke orders, and broken downstream analytics.
Step 1: Define canonical entry schema
{
"char": "愛",
"unicode": "U+611B",
"language": "ja",
"strokes": ["..."]
}
Step 2: Build deterministic normalization transforms
def normalize_entry(entry):
return {
"char": entry["char"].strip(),
"unicode": entry["unicode"].upper(),
"language": entry["language"].lower(),
"strokes": list(entry.get("strokes", [])),
}