Status: Implemented
utf8.len
and other functions in UTF-8 library will do more rigorous validation of UTF-8 input, rejecting UTF-16 surrogates encoded into UTF-8.
We use the UTF-8 library utf8
from Lua 5.3. The implementation of this library mostly correctly validates UTF-8 input and either throws errors (e.g. from utf8.codes
) or returns nil
(for utf8.len
) for invalid input.
However, certain invalid UTF-8 sequences are treated as valid. This specifically includes surrogate characters encoded in UTF-8; for example, the string "\237\160\128"
represents an attempt to encode the surrogate 0xD800
into UTF-8.
This is impossible to do in UTF-8; Lua 5.3 accepts this (utf8.len
returns 1), and so does Luau.
This creates issues for any other API that correctly validates UTF-8. For example, Roblox extends Luau with a few extra functions, like utf8.nfcnormalize
, that perform UTF-8 validation correctly and reject the aforementioned string.
As a result, due to Roblox extensions, the utf8
library exposed in Roblox is inconsistent (this is also a problem in other Roblox-specific APIs like DataStores that reject invalid UTF-8 inputs that utf8.len
accepts).
We would also expect that in other environments, extra functions that handle UTF-8 inputs and validate them do the validation properly.
Lua 5.4 fixes this by changing the default behavior of utf8.len
and other UTF-8 functions to error when surrogates are present. It also fixes an issue in utf8.codes
that failed to properly reject overlong UTF-8 sequences in certain cases; Luau already contains a fix to this issue.
All functions from Luau utf8
library will correctly validate UTF-8 and reject inputs that have surrogates. The functions will use their existing error handling strategy: utf8.len
will return nil
followed by the byte offset, utf8.codepoint
and utf8.codes
will error.
Any code using these functions should already be prepared for handling errors: utf8
functions will already do a validation of UTF-8 input, the validation was just not fully comprehensive. It is expected that this will not cause any significant compatibility risk, because many other functions already reject surrogates and the use of utf8.
function family in general is rare.
Notably, this is a partial implementation of Lua 5.4 behavior. In addition to making the default behavior stricter, Lua 5.4 relaxes utf8.char
(allowing it to produce “codepoints” that exceed 0x10FFFF and are invalid per Unicode standard) unconditionally, and also provides an optional argument lax
to a few functions like utf8.len
that allows to accept both surrogates and out-of-range (larger than 0x10FFFF
) codepoints.
We do not propose to adopt the full Lua 5.4 behavior or introduce the new lax
argument for the following reasons:
luarocks
, we conclude that lax
argument is never used outside of Lua’s own tests. As such, it’s questionable whether we need to give a way to relax the checks.lax
defaults to false
). We agree with this choice but conclude that the compatibility risk, however small, exists regardless of the argument’s presence.utf8.char
, and conditionally on lax
for other utf8
functions), they allow easier production of invalid UTF-8 sequences.It is likely that 5.4 semantics change is motivated by only having two types of behavior: “strict” (proposed here) and “lax” (supporting a variant of UTF-8 that allows encoding any codepoint value). Because it is always possible to add the “lax” behavior later, and we do not want to establish the precedent and require this addition to all other functions (for example, what should nfcnormalize
do for surrogates?), we only specify the “strict” behavior here. The “lax” behavior could be added by a separate RFC if there’s significant demand for it; that would be fully backwards compatible.
After this change, in Roblox strings that are correctly validated by
utf8.len
should also work properly withutf8.nfcnormalize
,utf8.graphemes
, DataStore APIs and other functions that validate UTF-8 input.
Because this technically changes behavior, due to Hyrum’s law we might break some code that really wants to encode surrogates in UTF-8 despite the fact that only a couple functions support this.
Because this doesn’t fully align behavior with Lua 5.4, it is possible that we will get programs in the future that want the full Lua 5.4 treatment (allowing lax
). It is likely that this will happen because of larger codepoints for some odd custom encoding reasons though, not for surrogates specifically, and if it does we can add it later.
Other than doing nothing, we could implement full Lua 5.4 behavior (with lax
argument and relaxed range checks in some existing functions). This leaves the question as to what the behavior of Roblox-specific utf8 extensions like nfcnormalize
or graphemes
should be, as they would become inconsistent not just in implementation but also in interface without further changes.