std::format() fill character allowances

(Proposed resolution for LWG issues 3576 and 3639)

Introduction

Presented is a proposed resolution for the following LWG issues concerning the specification of fill characters in std::format().

This proposal follows prior discussion as recorded in the:

The current wording in [format.string.std]p1 restricts fill characters to "any character other than { or }". Depending on how "character" is interpreted, this may permit characters with a negative display width, characters with no display width, characters with a display width greater than one, chraracters with a varying display width, characters with an actual diplay width that differs from their estimated width, combining characters (with or without a non-combining lead character), decomposed characters, control characters, formatting characters, and emoji. The following table presents some examples of such characters.

Glyph Estimated width Code point(s) Character name
>< 1 U+0007 BELL
>< 1 U+0008 BACKSPACE
> < 1 U+0009 CHARACTER TABULATION
>​< 1 U+200B ZERO WIDTH SPACE
>́< 1 U+0301 COMBINING ACCUTE ACCENT
>é< 1 U+00E9 LATIN SMALL LETTER E WITH ACUTE
>é< 1 U+0065
U+0301
LATIN SMALL LETTER E
COMBINING ACCUTE ACCENT
>è́̂̃̄< 1 U+0065
U+0300
U+0301
U+0302
U+0303
U+0304
LATIN SMALL LETTER E
COMBINING GRAVE ACCENT
COMBINING ACUTE ACCENT
COMBINING CIRCUMFLEX ACCENT
COMBINING TILDE
COMBINING MACRON
>e< 2 U+FF45 FULLWIDTH LATIN SMALL LETTER E
>ェ< 2 U+30A7 KATAKANA LETTER SMALL E
>ェ< 1 U+FF6A HALFWIDTH KATAKANA LETTER SMALL E
> < 2 U+3000 IDIOGRAPHIC SPACE
>🤡< 2 U+1F921 CLOWN FACE
>﷽< 1 U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM
[ Note: The glyphs are presented in monospace font and may render inconsistently across browsers and operating systems. The glyphs are displayed between '<' and '>' characters to make it easier to see their presentation width. The estimated width value corresponds to [format.string.std]p11; that paragraph associates an estimated width of either one or two with all code points. In each case, the code point sequence constitutes a single extended grapheme cluster. — end note ]

It is likely that the displayed character differs from the estimated width for at least some cases above; most likely the last case. Unfortunately, there is no specification currently available that governs character display width; actual width may vary based on font selection.

Use of a fill character with a display width other than one potentially prevents a std::format() implementation from properly aligning fields. Consider a format specification for a field of width four and a field argument with an estimated field width of one. The implementation is expected to insert fill characters to consume an estimated field width of three, but that is not possible if the fill character has an estimated field width of two. Portable behavior requires that the standard clarify the intended behavior for such characters.

A std::format() implementation must store or reference a fill character in some way. Fill character allowances may impose dynamic memory management requirements or increase the complexity of parsing standard format specifiers depending on implementation choices. Implementation choices may also cause fill character restrictions to be reflected in the ABI thus making it difficult to relax restrictions later. Portable behavior requires that the standard specify whether fill characters are restricted to those that are encoded as, for example, a single code unit, a single UCS scalar value, a stream-safe extended grapheme cluster [UAX#15], or an extended grapheme cluster of unbounded length.

Design considerations

Character encoding restrictions

Fill character allowances pose a performance and overhead tradeoff. Consider the following four options for fill character support.

  1. Allow any extended grapheme cluster (EGC).
  2. Allow any stream-safe EGC [UAX#15].
  3. Allow any single UCS scalar value.
  4. Allow any single UCS scalar value that is encoded using a single code unit.

The first option (any EGC) would require implementations to support EGCs that consist of an unbounded number of code points. This option implies dynamic memory management and would require implementations to identify EGC boundaries in the format string; a requirement that otherwise does not exist at present.

The second option (any stream-safe EGC) would require implementations to support EGCs that consist of up to 32 code points. This option allows an implementation to trade off dynamic memory allocation in favor of larger data structures, but still requires EGC boundary analysis of format strings.

The third option (any single UCS scalar value) avoids dynamic memory requirements and significant increases to sizes of data structures; the fill character could be stored in a single char32_t object.

The fourth option (any single code unit) reduces fill character storage requirements to a single code unit (char or wchar_t), but has the unfortunate side effect of making the permissible set of fill characters dependent on encoding. For example, U+00E9 (LATIN SMALL LETTER E WITH ACUTE) would be rejected in a UTF-8 encoded format string, but would be accepted in a UTF-16 encoded one. Similarly, U+1F921 (CLOWN FACE) would be rejected in a UTF-16 encoded format string, but accepted in a UTF-32 encoded one.

Estimated display width restrictions

The following behaviors represent possible options for formatting fields when the fill character has an estimated display width other than 1.

  1. Use an estimated width of one for the fill character regardless of the value specified in [format.string.std]p11.
  2. Overfill
    Insert fill characters until the estimated width of the formatted field argument and the fill characters meets or exceeds the field width.
  3. Underfill
    Insert fill characters so long as the estimated width of the formatted field argument and the fill characters does not exceed the field width.
  4. Pad with a different fill character
    Insert an alternate fill character known to have an estimated width of one when inserting the requested fill character would have caused the field width to be exceeded.
  5. Undefined, unspecified, or implementation-defined behavior
    Impose no portable behavior.
  6. Error
    Throw a format_error exception.

The following table illustrates the above options for std::format(">{:🤡^4}<\n", 'X'). Font selection will determine to what degree the results shown deviate from the reference alignment.

Behavioral choice Result
(reference alignment) >-X--<
Use an estimated width of one >🤡X🤡🤡<
Overfill >🤡X🤡<
Underfill >X🤡<
Pad with a different field character (space) > X🤡<
Undefined, unspecified, or implementation-defined behavior ???
Error N/A

Existing practice

The following table illustrates existing behavior for several std::format() implementations when the example characters from the introduction are used as the fill character and with a field argument of 'X'. The first row provides a reference.

Code point(s) Format string Clang 15 trunk
with libc++
Gcc 11.2
with fmt 8.1.1
MSVC 19.31
U+002D ">{:-^4}<" >-X--< >-X--< >-X--<
U+0007 ">{:^4}<" >X< >X< >X<
U+0008 ">{:^4}<" >X< >X< >X<
U+0009 ">{: ^4}<" > X   < > X   < > X   <
U+200B ">{:​^4}<" Error1 >​X​​< >​X​​<
U+0301 ">{:́^4}<" Error1 >́X́́< >́X́́<
U+00E9 ">{:é^4}<" Error1 >éXéé< >éXéé<
U+0065
U+0301
">{:é^4}<" Error1 Error2 Error3
U+0065
U+0300
U+0301
U+0302
U+0303
U+0304
">{:è́̂̃̄^4}<" Error1 Error2 Error3
U+FF45 ">{:e^4}<" Error1 >eXee< >eXee<
U+30A7 ">{:ェ^4}<" Error1 >ェXェェ< >ェXェェ<
U+FF6A ">{:ェ^4}<" Error1 >ェXェェ< >ェXェェ<
U+3000 ">{: ^4}<" Error1 > X  < > X  <
U+1F921 ">{:🤡^4}<" Error1 >🤡X🤡🤡< >🤡X🤡🤡<
U+FDFD ">{:﷽^4}<" Error1 >﷽X﷽﷽< >﷽X﷽﷽<

1) Clang with libc++ restricts fill characters to characters that are encoded as a single code unit; an exception is thrown that, if not caught, aborts the process with the following error message.

libc++abi: terminating with uncaught exception of type std::__1::format_error: The format-spec should consume the input or end with a '}'

2) Gcc with fmt restricts fill characters to characters that are encoded as a single code point. Compilation fails with the following error.

error: call to non-'constexpr' function 'void fmt::v8::detail::error_handler::on_error(const char*)'

3) MSVC restricts fill characters to characters that are encoded as a single code point. Compilation is successful, but program execution terminates with an exit code of 3221226505 (0xC0000409: STATUS_STACK_BUFFER_OVERRUN).

All surveyed implementations assume an estimated width of 1 for fill characters regardless of the estimated width values specified in [format.string.std]p11.

Proposal

Implementation experience

This proposal has not been implemented.

Implementation impact

Some implementations, libc++ for example, will require changes to allow any single UCS scalar value to be specified as a fill character. This may impose new encoding awareness requirements on format string parsers so that fill characters encoded with more than one code unit are correctly decoded.

Acknowledgements

Thank you to Victor Zverovich, Corentin Jabot, and Peter Brett for their insights; their commentary shaped much of this proposal.

References

[N4910] "Working Draft, Standard for Programming Language C++", N4910, 2022.
https://wg21.link/n4910
[UAX#15] Ken Whistler,
"Unicode Standard Annex #15 - Unicode Normalization Forms",
Revision 51, Unicode 14.0.0, 2021.
https://www.unicode.org/reports/tr15/tr15-51.html

Wording

These changes are relative to N4910 [N4910].

Hide inserted text
Hide deleted text

Change in 22.14.2.2 [format.string.std] paragraph 1:

[…]
The syntax of format specifications is as follows:

[…]
fill:
any character other than {U+007B LEFT CURLY BRACKET or }U+007D RIGHT CURLY BRACKET
[…]

Change in 22.14.2.2 [format.string.std] paragraph 2:

The fill character is the character denoted by the fill specifier or, if the fill specifier is absent, U+0020 SPACE.

[Note 2: The fill character can be any character other than { or }. The presence of a fill characterfill specifier is signaled by the character following it, which must be one of the alignment options. If the second character of std-format-spec is not a valid alignment option, then it is assumed that both the fill character and the alignment option arethe fill-and-align specifier is absent. — end note]

Change in 22.14.2.2 [format.string.std] paragraph 3:
Drafting note: Note X and the notes that follow it will require renumbering.

The align specifier applies to all argument types. The meaning of the various alignment options is as specified in Table 64.

[ Example 1:
char c = 120;
string s0 = format("{:6}", 42);           // value of s0 is "    42"
string s1 = format("{:6}", 'x');          // value of s1 is "x     "
string s2 = format("{:*<6}", 'x');        // value of s2 is "x*****"
string s3 = format("{:*>6}", 'x');        // value of s3 is "*****x"
string s4 = format("{:*^6}", 'x');        // value of s4 is "**x***"
string s5 = format("{:6d}", c);           // value of s5 is "   120"
string s6 = format("{:6}", true);         // value of s6 is "true  "
string s7 = format("{:*>6}", "12345678"); // value of s7 is "12345678"
end example ]

[ Note 3: Unless a minimum field width is defined, the field width is determined by the size of the content and the alignment option has no effect. — end note ]

[ Note X: If the width of the formatting argument value exceeds the field width, then the alignment option has no effect. — end note ]

Table 64: Meaning of align options [tab:format.align]
Option Meaning
< Forces the field to be aligned to the start of the available space by inserting n fill characters after the formatting argument value where n is the number of fill characters needed to align to the field width. This is the default for non-arithmetic non-pointer types, charT, and bool, unless an integer presentation type is specified.
> Forces the field to be aligned to the end of the available space by inserting n fill characters before the formatting argument value where n is the number of fill characters needed to align to the field width. This is the default for arithmetic types other than charT and bool, pointer types, or when an integer presentation type is specified.
^ Forces the field to be centered within the available space by inserting ⌊n/2⌋ fill characters before and ⌈n/2⌉ fill characters after the formatting argument value, where n is the total number of fill characters to insertneeded to align to the field width.

Change in 22.14.2.2 [format.string.std] paragraph 11:

For a string in a Unicode encoding, implementations should estimate the width of a string as the sum of estimated widths of the first code pointsUCS scalar values in its extended grapheme clusters. The extended grapheme clusters of a string are defined by UAX #29. The estimated width of the following code pointsUCS scalar values is 2:

[ … ]

The estimated width of other code pointsUCS scalar values is 1.