normalize function
normalize
converts a string to a specified Unicode normalization form.
Signatures
normalize(str)
normalize(str, form)
Parameter | Type | Description |
---|---|---|
str | string |
The string to normalize. |
form | keyword | The Unicode normalization form: NFC , NFD , NFKC , or NFKD (unquoted, case-insensitive keywords). Defaults to NFC . |
Return value
normalize
returns a string
.
Details
Unicode normalization is a process that converts different binary representations of characters to a canonical form. This is useful when comparing strings that may have been encoded differently.
The four normalization forms are:
- NFC (Normalization Form Canonical Composition): Canonical decomposition, followed by canonical composition. This is the default and most commonly used form.
- NFD (Normalization Form Canonical Decomposition): Canonical decomposition only. Characters are decomposed into their constituent parts.
- NFKC (Normalization Form Compatibility Composition): Compatibility decomposition, followed by canonical composition. This applies more aggressive transformations, converting compatibility variants to standard forms.
- NFKD (Normalization Form Compatibility Decomposition): Compatibility decomposition only.
For more information, see:
Examples
Normalize a string using the default NFC form:
SELECT normalize('é') AS normalized;
normalized
------------
é
NFC combines base character with combining marks:
SELECT normalize('é', NFC) AS nfc;
nfc
-----
é
NFD decomposes into base character + combining accent:
SELECT normalize('é', NFD) = E'e\u0301' AS is_decomposed;
is_decomposed
---------------
t
NFKC decomposes compatibility characters like ligatures:
SELECT normalize('fi', NFKC) AS decomposed;
decomposed
------------
fi
NFKC converts superscripts to regular characters:
SELECT normalize('x²', NFKC) AS normalized;
normalized
------------
x2