Recently, we encountered an interesting issue related to field size validation when integrating newer systems with older backend systems, particularly involving Chinese characters and UTF-8 encoding.
Modern systems typically handle Chinese characters correctly as single characters. Thus, if you set a validation limit of, say, 255 characters on an input field, each Chinese character will count against 1 of that limit.
However, when these characters are passed to older backend systems still using UTF-8 encoding, each Chinese character may expand into multiple UTF-8 character slots. As a result, data originally within your set limit can unexpectedly exceed the backend’s field size, causing overflow errors or truncation.
You can see this in action below, comparing the results of encoding various characters to UTF-8:
A = \x41
3 = \x33
的 = \xe7\x9a\x84