Discussion:
string_view, the standard library, and NUL-terminated strings.
(too old to reply)
Nicol Bolas
2017-10-24 15:10:01 UTC
Permalink
The new mailing is out now, and there are a number of proposals for
`string_view`-izing aspects of the standard library. That a good thought.

However, there are some concerns I have with certain aspects of it.

There are a number of APIs whose underlying implementations *require* a
NUL-terminated string (NTS). Or at the very least, where NUL characters
cannot legally appear.

Consider P0555 <http://wg21.link/P0555>: using string_view for
source_location. The source location is ultimately a path string. And NUL
characters cannot appear in path strings. Similarly, P0781
<http://wg21.link/P0781>, which gives us a `main` signature based on
`string_view`. I don't believe it's possible to put NUL characters into
command line parameters. And even if you can, I'm fairly sure most
applications will choke on them.

These aren't really "problems", in that they can't break anything. But they
are a bit slower, since they all require the system to give lengths to
these things. While `source_location` can statically determine the length
of its location strings, `main` cannot statically determine the lengths of
its command-line parameters. So those would have to be computed before
`main`. And doing computations before `main` isn't the best idea.

What is more genuinely problematic are cases where you are passing a
`string_view` to an API where the underlying implementations consume NTSs.
All file operations, for example. This imposes overhead, since such an API
which consumes a `string_view` *must* allocate memory and copy it into a
buffer before using it.

So ironically, it is more efficient to pass a filename as a `std::string`
than as a `string_view` (for filesystems where paths are stored in UTF-8
rather than UTF-16, of course. In the latter, you're going to be allocating
memory unless you pass a `path` directly).

P0502 <http://wg21.link/P0502> essentially goes over all of the standard
library APIs that takes `basic_string` and/or `char*` and replaces them
with `string_view`, without any considerations for this issue.

For example, this proposal suggest replacing the string constructors of
many `exception` types with a `string_view`. Well, if the user puts a
NUL-character in such a string, then the string is going to be trucated,
even though the API appears to allow NUL characters (since `string_view`
allows them. This is due to the fact that `std::exception::what()` returns
an NTS. So long as that is the case, `exception`-derived classes *cannot*
use strings with NUL characters in them effectively.

String view, as a concept, is a really good idea. It's much safer than a
`const char*`. But `std::string_view` promises something that `const char*`
does not: embedded NUL characters. And we should not transition any API to
`string_view` *unless* it too can promise the use of embedded NUL
characters.

This is where an alternative, NUL-terminated string view class would come
in very handy. And thanks to the Range TS's Iterator/Sentinel model for
iterators and its associated algorithms, you can still interact with such a
class as a real range. It'd still retain the safety of a `const char*`, but
its type would tell you right away that NUL characters are not allowed.
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/58cedd2a-56e3-4aa3-b3d1-725fa6fcdf7b%40isocpp.org.
Thiago Macieira
2017-10-24 21:27:35 UTC
Permalink
Post by Nicol Bolas
Consider P0555 <http://wg21.link/P0555>: using string_view for
source_location. The source location is ultimately a path string. And NUL
characters cannot appear in path strings. Similarly, P0781
<http://wg21.link/P0781>, which gives us a `main` signature based on
`string_view`. I don't believe it's possible to put NUL characters into
command line parameters. And even if you can, I'm fairly sure most
applications will choke on them.
That was not P0781's objective. The author is merely proposing getting read of
the plain pointers and C arrays, by wrapping around more C++-like objects.

When that discussion turned up, I pointed out that having a cstring_view or
zstring_view would be even more beneficial, since on most architectures the
compiler could initialise the std::initializer_list<xstring_view> object in
constant memory, regardless of how many items exist in the list. That happens
at the expense of not pre-calculating the strings' sizes.

The author of the paper does address that, when he says that it's likely one
would do that anyway, so we may as well do it all in the beginning.

There's a significant expansion of memory used, though, and that's what worries
me more with the use of string_view. On a system with 128 kB command-lines,
the worst case scenario is 64k 1-byte arguments, in which case the kernel must
have already used 64k*8 = 512 kB of RAM for argv. By creating the
std::initializer_list, we will consume another 1 MB, for a total of 1.5 MB
before main() even starts. That's 18.75% of a standard stack size on Linux.

Moreover, the kernel does know if it is exhausting memory, in which case it
will fail to launch the application in the first place; we have no such
protection in the runtime.

$ ulimit -Ss 16
$ command true $(for ((i = 0; i < 8100; ++i)); do echo a; done)
zsh: Argument list too long: true

For that reason, I find it dangerous to use string_view and would support there
being a cstring_view or zstring_view. And if those exist, we should also re-
evaluate the use of string_view in APIs that eventually do call NTS OS APIs,
Post by Nicol Bolas
What is more genuinely problematic are cases where you are passing a
`string_view` to an API where the underlying implementations consume NTSs.
All file operations, for example. This imposes overhead, since such an API
which consumes a `string_view` *must* allocate memory and copy it into a
buffer before using it.
There's a way to avoid it: the runtime can detect if the string_view happens
to be an NTS. That requires reading past the end of the string_view, which
requires processor-specific knowledge and will possibly trigger warnings in
MSan and Valgrind. That means it should be implemented in assembly, not in
C++.

All you need to do is see if the last valid byte points to the end of a page.
If it does not, then you can read the next byte without crashing.
Post by Nicol Bolas
So ironically, it is more efficient to pass a filename as a `std::string`
than as a `string_view` (for filesystems where paths are stored in UTF-8
rather than UTF-16, of course. In the latter, you're going to be allocating
memory unless you pass a `path` directly).
Right. This is something zstring_view would be handy for.
Post by Nicol Bolas
For example, this proposal suggest replacing the string constructors of
many `exception` types with a `string_view`. Well, if the user puts a
NUL-character in such a string, then the string is going to be trucated,
even though the API appears to allow NUL characters (since `string_view`
allows them. This is due to the fact that `std::exception::what()` returns
an NTS. So long as that is the case, `exception`-derived classes *cannot*
use strings with NUL characters in them effectively.
Well, that's the same problem as passing a std::string containing such a NUL
to a filesystem API. The API declares that it is not permitted, so it will
either throw, fail an assertion or just generally misbehave if you do. So I
don't specifically see a problem see a problem with embedded NULs.

The problem is that you'd be allowed to pass a non-terminated string to
std::exception, which has nowhere to store that length without breaking binary
compatibility. That means the std::terminate() handler could try to print past
the end of valid memory and cause a memory violation error.
Post by Nicol Bolas
String view, as a concept, is a really good idea. It's much safer than a
`const char*`. But `std::string_view` promises something that `const char*`
does not: embedded NUL characters. And we should not transition any API to
`string_view` *unless* it too can promise the use of embedded NUL
characters.
Again, I don't see a big issue with the embeddeds: you just forbid them in the
API contract. It's the lack of termination that is an issue.
Post by Nicol Bolas
This is where an alternative, NUL-terminated string view class would come
in very handy. And thanks to the Range TS's Iterator/Sentinel model for
iterators and its associated algorithms, you can still interact with such a
class as a real range. It'd still retain the safety of a `const char*`, but
its type would tell you right away that NUL characters are not allowed.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/4188670.k1hKsfDcxt%40tjmaciei-mobl1.
Nicol Bolas
2017-10-24 23:30:29 UTC
Permalink
Post by Thiago Macieira
Post by Nicol Bolas
Consider P0555 <http://wg21.link/P0555>: using string_view for
source_location. The source location is ultimately a path string. And
NUL
Post by Nicol Bolas
characters cannot appear in path strings. Similarly, P0781
<http://wg21.link/P0781>, which gives us a `main` signature based on
`string_view`. I don't believe it's possible to put NUL characters into
command line parameters. And even if you can, I'm fairly sure most
applications will choke on them.
That was not P0781's objective. The author is merely proposing getting read of
the plain pointers and C arrays, by wrapping around more C++-like objects.
I understand that. But it is an *effect* of that API, even though it is not
intended.

When that discussion turned up, I pointed out that having a cstring_view or
Post by Thiago Macieira
zstring_view would be even more beneficial, since on most architectures the
compiler could initialise the std::initializer_list<xstring_view> object in
constant memory, regardless of how many items exist in the list. That happens
at the expense of not pre-calculating the strings' sizes.
The author of the paper does address that, when he says that it's likely one
would do that anyway, so we may as well do it all in the beginning.
There's a significant expansion of memory used, though, and that's what worries
me more with the use of string_view. On a system with 128 kB
command-lines,
the worst case scenario is 64k 1-byte arguments, in which case the kernel must
have already used 64k*8 = 512 kB of RAM for argv. By creating the
std::initializer_list, we will consume another 1 MB, for a total of 1.5 MB
before main() even starts. That's 18.75% of a standard stack size on Linux.
Moreover, the kernel does know if it is exhausting memory, in which case it
will fail to launch the application in the first place; we have no such
protection in the runtime.
$ ulimit -Ss 16
$ command true $(for ((i = 0; i < 8100; ++i)); do echo a; done)
zsh: Argument list too long: true
For that reason, I find it dangerous to use string_view and would support there
being a cstring_view or zstring_view. And if those exist, we should also re-
evaluate the use of string_view in APIs that eventually do call NTS OS APIs,
Post by Nicol Bolas
What is more genuinely problematic are cases where you are passing a
`string_view` to an API where the underlying implementations consume
NTSs.
Post by Nicol Bolas
All file operations, for example. This imposes overhead, since such an
API
Post by Nicol Bolas
which consumes a `string_view` *must* allocate memory and copy it into a
buffer before using it.
There's a way to avoid it: the runtime can detect if the string_view happens
to be an NTS. That requires reading past the end of the string_view, which
requires processor-specific knowledge and will possibly trigger warnings in
MSan and Valgrind. That means it should be implemented in assembly, not in
C++.
All you need to do is see if the last valid byte points to the end of a page.
If it does not, then you can read the next byte without crashing.
Post by Nicol Bolas
So ironically, it is more efficient to pass a filename as a
`std::string`
Post by Nicol Bolas
than as a `string_view` (for filesystems where paths are stored in UTF-8
rather than UTF-16, of course. In the latter, you're going to be
allocating
Post by Nicol Bolas
memory unless you pass a `path` directly).
Right. This is something zstring_view would be handy for.
Post by Nicol Bolas
For example, this proposal suggest replacing the string constructors of
many `exception` types with a `string_view`. Well, if the user puts a
NUL-character in such a string, then the string is going to be trucated,
even though the API appears to allow NUL characters (since `string_view`
allows them. This is due to the fact that `std::exception::what()`
returns
Post by Nicol Bolas
an NTS. So long as that is the case, `exception`-derived classes
*cannot*
Post by Nicol Bolas
use strings with NUL characters in them effectively.
Well, that's the same problem as passing a std::string containing such a NUL
to a filesystem API.
True enough. Which is why:

1: There should not be implicit conversion to `z/cstring_view` from
`std::string` (or from any string type that can handle embedded NULs).

2: APIs that cannot handle embedded NUL characters should not take
`std::string` in their APIs.

The API declares that it is not permitted, so it will
Post by Thiago Macieira
either throw, fail an assertion or just generally misbehave if you do. So I
don't specifically see a problem see a problem with embedded NULs.
It's better to embed the contract in the type itself, where
possible/reasonable, than to have the type and contract be separate.
Especially if we already have the need for a type that has such a contract
by design.

To put it another way, I wouldn't suggest `c/zstring_view` simply because
it lets us to express this contract within a type for such APIs. But since
we have identified places where we have a genuine need for
`c/zstring_view`, then we should use that class in places where we *do*
have such a contract.

If it's wrong to pass strings with embedded NUL characters to file APIs,
then it's just as wrong to do it to `std::exception`-derived types.

The problem is that you'd be allowed to pass a non-terminated string to
Post by Thiago Macieira
std::exception, which has nowhere to store that length without breaking binary
compatibility.
That's not a problem. `std::runtime_error` and other
`std::exception`-derived objects don't store a reference to the string;
they store a *copy* of it (they'd be non-functional otherwise, since odds
are good that any string they're provided will be destroyed by the time the
exception is caught). And P0502 doesn't change that.

So they can create a NUL-terminated byte sequence from a non-NUL-terminated
`string_view` just fine.

What's interesting is that the behavior of `std::runtime_error` with a
string with embedded NULs is well-defined. The post condition is defined in
terms of `strcmp`, so it will stop at the embedded NUL character.
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/4f96134d-00f3-4e88-865e-3205ed7ba68f%40isocpp.org.
Thiago Macieira
2017-10-24 23:39:32 UTC
Permalink
Post by Nicol Bolas
Post by Thiago Macieira
That was not P0781's objective. The author is merely proposing getting read of
the plain pointers and C arrays, by wrapping around more C++-like objects.
I understand that. But it is an *effect* of that API, even though it is not
intended.
No, that's exactly the author's intention...
Post by Nicol Bolas
Post by Thiago Macieira
Well, that's the same problem as passing a std::string containing such a NUL
to a filesystem API.
1: There should not be implicit conversion to `z/cstring_view` from
`std::string` (or from any string type that can handle embedded NULs).
Correct, there can't be. But there can be an explicit conversion which can
fail.
Post by Nicol Bolas
2: APIs that cannot handle embedded NUL characters should not take
`std::string` in their APIs.
I disagree.

I think it's fine to take std::string and put extra constraints on top on what
the string can actually contain. The limitation isn't just for NUL, certain
other characters and combinations may be disallowed too. It's very normal for
filesystems to reject certain characters (Windows does that) and on some
systems it could additionally require properly-encoded UTF-8. Other APIs may
require only US-ASCII; others could be limited to [a-zA-Z_][A-Za-z0-9_]*.
Post by Nicol Bolas
Post by Thiago Macieira
The API declares that it is not permitted, so it will
either throw, fail an assertion or just generally misbehave if you do. So I
don't specifically see a problem see a problem with embedded NULs.
It's better to embed the contract in the type itself, where
possible/reasonable, than to have the type and contract be separate.
Especially if we already have the need for a type that has such a contract
by design.
Agreed in principle. But see above for the practice regarding std::string.
Post by Nicol Bolas
To put it another way, I wouldn't suggest `c/zstring_view` simply because
it lets us to express this contract within a type for such APIs. But since
we have identified places where we have a genuine need for
`c/zstring_view`, then we should use that class in places where we *do*
have such a contract.
I would suggest either c/zstring_view or std::string simply because certain
functions can then work without allocating memory in a significant number of
OSes (most of the fs ones won't be noexcept since they are cancellation
points, but that's another story).
Post by Nicol Bolas
If it's wrong to pass strings with embedded NUL characters to file APIs,
then it's just as wrong to do it to `std::exception`-derived types.
Agreed.
Post by Nicol Bolas
What's interesting is that the behavior of `std::runtime_error` with a
string with embedded NULs is well-defined. The post condition is defined in
terms of `strcmp`, so it will stop at the embedded NUL character.
The behaviour should change: if embedded NULs are allowed, then a memcmp
should be required instead. Or don't allow them.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/3392103.s8qlfG1qxy%40tjmaciei-mobl1.
Nicol Bolas
2017-10-25 02:59:43 UTC
Permalink
Post by Thiago Macieira
Post by Nicol Bolas
Post by Thiago Macieira
That was not P0781's objective. The author is merely proposing getting read of
the plain pointers and C arrays, by wrapping around more C++-like
objects.
Post by Nicol Bolas
I understand that. But it is an *effect* of that API, even though it is
not
Post by Nicol Bolas
intended.
No, that's exactly the author's intention...
Post by Nicol Bolas
Post by Thiago Macieira
Well, that's the same problem as passing a std::string containing such
a
Post by Nicol Bolas
Post by Thiago Macieira
NUL
to a filesystem API.
1: There should not be implicit conversion to `z/cstring_view` from
`std::string` (or from any string type that can handle embedded NULs).
Correct, there can't be. But there can be an explicit conversion which can
fail.
It should not fail. By using the explicit conversion, you are certifying
that either:

1: There are no embedded NUL characters in the string.

2: You are fine with the `c/zstring_view` terminating at the first embedded
NUL character.

Now, we could have a checked conversion function that will throw if an
embedded NUL is encountered. But we shouldn't force people into spending
precious performance iterating through a string to verify something they
already know. The explicit conversion should not be more expensive than
`string::c_str()`.
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/c839e13a-7351-4912-9480-e406f4df0a28%40isocpp.org.
Thiago Macieira
2017-10-25 03:55:37 UTC
Permalink
Post by Nicol Bolas
Now, we could have a checked conversion function that will throw if an
embedded NUL is encountered. But we shouldn't force people into spending
precious performance iterating through a string to verify something they
already know. The explicit conversion should not be more expensive than
`string::c_str()`.
Fair enough, that makes sense.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/6715184.FECAYvuoeG%40tjmaciei-mobl1.
Erich Keane
2017-10-25 22:42:09 UTC
Permalink
Post by Thiago Macieira
Post by Nicol Bolas
Consider P0555 <http://wg21.link/P0555>: using string_view for
source_location. The source location is ultimately a path string. And
NUL
Post by Nicol Bolas
characters cannot appear in path strings. Similarly, P0781
<http://wg21.link/P0781>, which gives us a `main` signature based on
`string_view`. I don't believe it's possible to put NUL characters into
command line parameters. And even if you can, I'm fairly sure most
applications will choke on them.
That was not P0781's objective. The author is merely proposing getting read of
the plain pointers and C arrays, by wrapping around more C++-like objects.
When that discussion turned up, I pointed out that having a cstring_view or
zstring_view would be even more beneficial, since on most architectures the
compiler could initialise the std::initializer_list<xstring_view> object in
constant memory, regardless of how many items exist in the list. That happens
at the expense of not pre-calculating the strings' sizes.
The author of the paper does address that, when he says that it's likely one
would do that anyway, so we may as well do it all in the beginning.
This is DEFINITELY something I've considered in my time discussing this
(often with you in person!). My concern here is that those types are not
in the STL and I see little likely hood that they'll make it in.
Additionally, my hope is that OS vendors will take advantage of the
string-view constructor and start providing pointer/length pairs (since it
already HAS that information!) to applications that are string-view aware.
Post by Thiago Macieira
There's a significant expansion of memory used, though, and that's what worries
me more with the use of string_view. On a system with 128 kB
command-lines,
the worst case scenario is 64k 1-byte arguments, in which case the kernel must
have already used 64k*8 = 512 kB of RAM for argv. By creating the
std::initializer_list, we will consume another 1 MB, for a total of 1.5 MB
before main() even starts. That's 18.75% of a standard stack size on Linux.
Absolutely, I think the 'transition time' will have a handful of gotchas.
I believe that in the meantime the developers are welcome to use argc/argv
if they have that concern.
Post by Thiago Macieira
Moreover, the kernel does know if it is exhausting memory, in which case it
will fail to launch the application in the first place; we have no such
protection in the runtime.
$ ulimit -Ss 16
$ command true $(for ((i = 0; i < 8100; ++i)); do echo a; done)
zsh: Argument list too long: true
For that reason, I find it dangerous to use string_view and would support there
being a cstring_view or zstring_view. And if those exist, we should also re-
evaluate the use of string_view in APIs that eventually do call NTS OS APIs,
Post by Nicol Bolas
What is more genuinely problematic are cases where you are passing a
`string_view` to an API where the underlying implementations consume
NTSs.
Post by Nicol Bolas
All file operations, for example. This imposes overhead, since such an
API
Post by Nicol Bolas
which consumes a `string_view` *must* allocate memory and copy it into a
buffer before using it.
There's a way to avoid it: the runtime can detect if the string_view happens
to be an NTS. That requires reading past the end of the string_view, which
requires processor-specific knowledge and will possibly trigger warnings in
MSan and Valgrind. That means it should be implemented in assembly, not in
C++.
Perhaps I'm misunderstanding, but since the string_view's array are a
single allocated array, reading 1 past it is specifically permitted by the
language.
Post by Thiago Macieira
All you need to do is see if the last valid byte points to the end of a page.
If it does not, then you can read the next byte without crashing.
Post by Nicol Bolas
So ironically, it is more efficient to pass a filename as a
`std::string`
Post by Nicol Bolas
than as a `string_view` (for filesystems where paths are stored in UTF-8
rather than UTF-16, of course. In the latter, you're going to be
allocating
Post by Nicol Bolas
memory unless you pass a `path` directly).
Right. This is something zstring_view would be handy for.
Post by Nicol Bolas
For example, this proposal suggest replacing the string constructors of
many `exception` types with a `string_view`. Well, if the user puts a
NUL-character in such a string, then the string is going to be trucated,
even though the API appears to allow NUL characters (since `string_view`
allows them. This is due to the fact that `std::exception::what()`
returns
Post by Nicol Bolas
an NTS. So long as that is the case, `exception`-derived classes
*cannot*
Post by Nicol Bolas
use strings with NUL characters in them effectively.
Well, that's the same problem as passing a std::string containing such a NUL
to a filesystem API. The API declares that it is not permitted, so it will
either throw, fail an assertion or just generally misbehave if you do. So I
don't specifically see a problem see a problem with embedded NULs.
The problem is that you'd be allowed to pass a non-terminated string to
std::exception, which has nowhere to store that length without breaking binary
compatibility. That means the std::terminate() handler could try to print past
the end of valid memory and cause a memory violation error.
Post by Nicol Bolas
String view, as a concept, is a really good idea. It's much safer than a
`const char*`. But `std::string_view` promises something that `const
char*`
Post by Nicol Bolas
does not: embedded NUL characters. And we should not transition any API
to
Post by Nicol Bolas
`string_view` *unless* it too can promise the use of embedded NUL
characters.
Again, I don't see a big issue with the embeddeds: you just forbid them in the
API contract. It's the lack of termination that is an issue.
Post by Nicol Bolas
This is where an alternative, NUL-terminated string view class would
come
Post by Nicol Bolas
in very handy. And thanks to the Range TS's Iterator/Sentinel model for
iterators and its associated algorithms, you can still interact with
such a
Post by Nicol Bolas
class as a real range. It'd still retain the safety of a `const char*`,
but
Post by Nicol Bolas
its type would tell you right away that NUL characters are not allowed.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/75e7bf6d-b712-488f-8be8-58056feb38be%40isocpp.org.
Thiago Macieira
2017-10-25 23:36:54 UTC
Permalink
Post by Erich Keane
This is DEFINITELY something I've considered in my time discussing this
(often with you in person!). My concern here is that those types are not
in the STL and I see little likely hood that they'll make it in.
Additionally, my hope is that OS vendors will take advantage of the
string-view constructor and start providing pointer/length pairs (since it
already HAS that information!) to applications that are string-view aware.
For Windows, that's easy, since the command-line is not passed as an array,
but as a very long Unicode string. It's up to the runtime to break it down.
There's a helper function that you call to do it (CommandLineToArgv), so doing
that with length is definitely within the realm of acceptable.

The problem is on Unix, where the array comes from the kernel. I find it
extremely unlikely that an array with lengths will be passed. The operating
system does not and cannot know the signature of main(), as it can come from
any number of shared libraries or not exist at all, for other languages. So
the operating system would need to pass both arrays. Note how no program needs
both arrays, so this is always overhead. So I think you'll have a hard time
convincing OS developers to implement this, ever.

Even if they do, we'll have a very long inertia period where runtimes need to
adapt to the missing lengths and calculate the array themselves. You have not
Post by Erich Keane
Post by Thiago Macieira
Moreover, the kernel does know if it is exhausting memory, in which case it
will fail to launch the application in the first place; we have no such
protection in the runtime.
This is the same issue as alloca(): there's no way to tell we're exhausting
the stack.
Post by Erich Keane
Post by Thiago Macieira
There's a way to avoid it: the runtime can detect if the string_view happens
to be an NTS. That requires reading past the end of the string_view, which
requires processor-specific knowledge and will possibly trigger warnings in
MSan and Valgrind. That means it should be implemented in assembly, not in
C++.
Perhaps I'm misunderstanding, but since the string_view's array are a
single allocated array, reading 1 past it is specifically permitted by the
language.
The language permits forming the pointer, but not dereferencing it.
Dereferencing that one-past-the-last pointer is never permitted. Example:

void *ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
char *str = static_cast<char *>(ptr) + 4095;
*str = 'a';

std::string_view view(str, 1);

This view object has size() == 1, but dereferencing end() could cause a GPF,
since it jumps to the next page and that isn't mapped.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/1689318.6n5UohmHgG%40tjmaciei-mobl1.
Niall Douglas
2017-10-25 11:25:57 UTC
Permalink
Post by Nicol Bolas
Consider P0555 <http://wg21.link/P0555>: using string_view for
source_location. The source location is ultimately a path string. And NUL
characters cannot appear in path strings. Similarly, P0781
<http://wg21.link/P0781>, which gives us a `main` signature based on
`string_view`. I don't believe it's possible to put NUL characters into
command line parameters. And even if you can, I'm fairly sure most
applications will choke on them.
NUL characters in paths is not legal on POSIX. But other operating systems
can and do support NUL characters in paths. Indeed, many if not most
filesystems on POSIX permit NUL in file entries, even though there is no
way of accessing those from a POSIX API. This makes for great "undeletable"
files.
Post by Nicol Bolas
What is more genuinely problematic are cases where you are passing a
`string_view` to an API where the underlying implementations consume NTSs.
All file operations, for example. This imposes overhead, since such an API
which consumes a `string_view` *must* allocate memory and copy it into a
buffer before using it.
I can't say if the committee will like AFIO for standardisation into the
File I/O TS next year, but if it does, then all path consuming operations
use https://ned14.github.io/afio/classafio__v2__xxx_1_1path__view.html and
this concern of yours becomes "not a problem". You may, or may not,
remember that afio::path_view reads the char after the view ends to see if
it needs to buffer copy. This is safe only because path_view explicitly
documents that it will do this, so users must always supply a source of
data where one char after the end is valid for reading.

I say "not a problem" in quotes because sure, copying paths can be
expensive when they can be up to 64Kb long. But if that's happening
repeatedly in a loop, that's on the user's bad code. The afio::path_view
approach always makes it possible for the user to write maximally efficient
code given the other considerations and pressures which make it wise for
views not being zero terminated.

There's no win-win here. You cannot take the full advantages of string
views without sacrificing null termination. And my hybrid approach of
permitted reads of chars off the end comes with its own problems,
specifically that that requirement is safe for filesystem paths, but asks
too much for some arbitrary string. Different use cases lead to different
optimisation balances.
Post by Nicol Bolas
So ironically, it is more efficient to pass a filename as a `std::string`
than as a `string_view` (for filesystems where paths are stored in UTF-8
rather than UTF-16, of course. In the latter, you're going to be allocating
memory unless you pass a `path` directly).
P0502 <http://wg21.link/P0502> essentially goes over all of the standard
library APIs that takes `basic_string` and/or `char*` and replaces them
with `string_view`, without any considerations for this issue.
For example, this proposal suggest replacing the string constructors of
many `exception` types with a `string_view`. Well, if the user puts a
NUL-character in such a string, then the string is going to be trucated,
even though the API appears to allow NUL characters (since `string_view`
allows them. This is due to the fact that `std::exception::what()` returns
an NTS. So long as that is the case, `exception`-derived classes *cannot*
use strings with NUL characters in them effectively.
String view, as a concept, is a really good idea. It's much safer than a
`const char*`. But `std::string_view` promises something that `const char*`
does not: embedded NUL characters. And we should not transition any API to
`string_view` *unless* it too can promise the use of embedded NUL
characters.
This is where an alternative, NUL-terminated string view class would come
in very handy. And thanks to the Range TS's Iterator/Sentinel model for
iterators and its associated algorithms, you can still interact with such a
class as a real range. It'd still retain the safety of a `const char*`, but
its type would tell you right away that NUL characters are not allowed.
I am pretty much in agreement with all you say above oddly enough, but I
don't see the above as an automatic mandate for a zstring_view yet.

Retrofitting existing design patterns with string_view without thinking
through the potential buffer copying consequences for each and every
retrofit is highly unwise.

I suspect that if this is to be done right, some poor person needs to walk
through every individual retrofit and decide which:

1. NUL chars are permitted.
2. NUL chars have special meaning.
3. NUL chars are banned.

I'll also temporarily put on Marshall's hat for a moment when you ask him
about this, and say "you could just leave the user decide when it is right
to type string_view.data()" i.e. trust the user to decay a view to a const
char * because they know that particular view will be zero terminated. In
other words, doing no retrofitting is also a reasonable design choice here.

Niall
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/5d423715-ba04-4e73-ad44-33c794264ea7%40isocpp.org.
Erich Keane
2017-10-25 22:36:10 UTC
Permalink
Similarly, P0781 <http://wg21.link/P0781>, which gives us a `main`
signature based on `string_view`. I don't believe it's possible to put NUL
characters into command line parameters. And even if you can, I'm fairly
sure most applications will choke on them.
Author here: The intent is simply to "C++-ize" main so that it can be used
simply. I chose string_view because it is non-allocating and works with
most STL-like containers. As you note, it isn't possible for a NUL to
appear in a command line argument, since they are currently char*s. I
realize that using string_view changes this contract by no longer enforcing
this, but I suspect that this is OK, the contents of the command-line
arguments should be the OS's business, not the programming languages.
These aren't really "problems", in that they can't break anything. But
they are a bit slower, since they all require the system to give lengths to
these things. While `source_location` can statically determine the length
of its location strings, `main` cannot statically determine the lengths of
its command-line parameters. So those would have to be computed before
`main`. And doing computations before `main` isn't the best idea.
They CURRENTLY have to be computed before main. We have an interesting
cross-dependency here. The process launcher in all OS's I'm familiar with
currently provide 'argc/argv' pairs because thats what main requires. The
hope is that the OS vendors would be enabled to provide a better structure,
since the OS already knows the lengths! Additionally, to actually USE the
command line args, you have to iterate through their lengths anyway, the
patch simply moves this work before main. If it is a risk for your
application, you're welcome to use the existing signature.
What is more genuinely problematic are cases where you are passing a
`string_view` to an API where the underlying implementations consume NTSs.
All file operations, for example. This imposes overhead, since such an API
which consumes a `string_view` *must* allocate memory and copy it into a
buffer before using it.
That is definitely an interesting consideration. To that, I would have 2
responses: First, string_view permits access to the underlying NTS, so said
API could simply use the char* from the string_view. Any API that takes a
std::string would already have to construct one, so constructing one from
string_view vs char* is no difference.
So ironically, it is more efficient to pass a filename as a `std::string`
than as a `string_view` (for filesystems where paths are stored in UTF-8
rather than UTF-16, of course. In the latter, you're going to be allocating
memory unless you pass a `path` directly).
P0502 <http://wg21.link/P0502> essentially goes over all of the standard
library APIs that takes `basic_string` and/or `char*` and replaces them
with `string_view`, without any considerations for this issue.
For example, this proposal suggest replacing the string constructors of
many `exception` types with a `string_view`. Well, if the user puts a
NUL-character in such a string, then the string is going to be trucated,
even though the API appears to allow NUL characters (since `string_view`
allows them. This is due to the fact that `std::exception::what()` returns
an NTS. So long as that is the case, `exception`-derived classes *cannot*
use strings with NUL characters in them effectively.
String view, as a concept, is a really good idea. It's much safer than a
`const char*`. But `std::string_view` promises something that `const char*`
does not: embedded NUL characters. And we should not transition any API to
`string_view` *unless* it too can promise the use of embedded NUL
characters.
This is where an alternative, NUL-terminated string view class would come
in very handy. And thanks to the Range TS's Iterator/Sentinel model for
iterators and its associated algorithms, you can still interact with such a
class as a real range. It'd still retain the safety of a `const char*`, but
its type would tell you right away that NUL characters are not allowed.
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/a94fffaa-82a5-4798-9f0b-2112a3373a13%40isocpp.org.
Thiago Macieira
2017-10-25 23:20:29 UTC
Permalink
Post by Erich Keane
That is definitely an interesting consideration. To that, I would have 2
responses: First, string_view permits access to the underlying NTS, so said
API could simply use the char* from the string_view. Any API that takes a
std::string would already have to construct one, so constructing one from
string_view vs char* is no difference.
The problem is determining that the char* that string_view is pointing to is
actually an NTS. You cannot officially verify it by dereferencing view.end(). So
strictly-conforming code needs to assume it isn't and then allocate memory.

So if we take these two examples:

int openfile(const std::string &name);
int main(int argc, char **argv)
{
int fd = openfile(argv[1]);
....
}

and

int openfile(const std::string_view &name);
int main(std::initializer_list<std::string_view> args)
{
int fd = openfile(args.begin()[1]);
....
}

BOTH allocate memory. The difference is where that allocation happens: in the
first example, it happens inside main(), by the std::string constructor; in the
second, it happens inside openfile().

It gets worse if we do this:

int main(int argc, char **argv)
int main(std::initializer_list<std::string_view> args)
{
std::string name = args.begin()[1] + "/index.html";
int fd = openfile(name);
....
}
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/1928759.Nz9adYEWW8%40tjmaciei-mobl1.
Thiago Macieira
2017-10-25 23:22:42 UTC
Permalink
Post by Thiago Macieira
int main(int argc, char **argv)
int main(std::initializer_list<std::string_view> args)
{
std::string name = args.begin()[1] + "/index.html";
int fd = openfile(name);
....
}
Sorry, left-right hand synchronisation issue here. (I was holding Ctrl down
with the left hand when doing a Ctrl+V when I pressed Enter with the right
hand to insert a newline)

The second example was
int main(int argc, char **argv)
{
std::string name = argv[1] + "/index.html";
int fd = openfile(name);
....
}

In these two examples, both allocate memory in main(). But the second
additionally allocates memory inside openfile(), despite the string that was
passed being null-terminated.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/2344818.lYGMcz3ClQ%40tjmaciei-mobl1.
Marc Mutz
2017-10-26 09:44:53 UTC
Permalink
There are a number of APIs whose underlying implementations _require_
a NUL-terminated string (NTS). Or at the very least, where NUL
characters cannot legally appear.
These are two completely different pairs of shoes: NTS and embedded
NULs. There was a lot of talk about embedded NULs in this thread, which
I find surprising, since I consider embedded NULs a non-issue: a
"string" with embedded NULs is no longer a string, but binary data. Now
that we have std::byte, any API that deals with binary data should trade
in array_view<byte>/vector<byte>, not in string_view/string.

For NTS, the picture is a completely different one, and it seems to me
that a [zc]string_view definitely would solve the problem of
distinguishing between NTS and a char range.

But my main input here is the suggestion to ignore embedded NULs in
string and string_view. If there are 'string' APIs that actually deal
with binary data (sub-<<-level I/O, mainly), those should start to
embrace std::byte instead.

Thanks,
Marc
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/b91ee65651f83fb815710b7e6cb4b560%40kdab.com.
Nicol Bolas
2017-10-26 14:13:06 UTC
Permalink
Post by Marc Mutz
There are a number of APIs whose underlying implementations _require_
a NUL-terminated string (NTS). Or at the very least, where NUL
characters cannot legally appear.
These are two completely different pairs of shoes: NTS and embedded
NULs. There was a lot of talk about embedded NULs in this thread, which
I find surprising, since I consider embedded NULs a non-issue: a
"string" with embedded NULs is no longer a string, but binary data.
You can consider it whatever you want, but that won't make it true. A
string with embedded NUL characters is a string with embedded NUL
characters. And there are plenty of APIs that call themselves "string APIs"
that can handle strings with embedded NUL characters. If you want to
consider them to not really be string handling functions, that's up to you.

But I doubt you're going to convince the world that std::regex is a "binary
data" parsing library, simply because it can deal with embedded NUL strings.
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/811955d2-cb37-40f4-ad0e-335461b08f10%40isocpp.org.
Marc Mutz
2017-10-26 17:34:30 UTC
Permalink
Post by Nicol Bolas
Post by Nicol Bolas
There are a number of APIs whose underlying implementations
_require_
Post by Nicol Bolas
a NUL-terminated string (NTS). Or at the very least, where NUL
characters cannot legally appear.
These are two completely different pairs of shoes: NTS and embedded
NULs. There was a lot of talk about embedded NULs in this thread, which
I find surprising, since I consider embedded NULs a non-issue: a
"string" with embedded NULs is no longer a string, but binary data.
You can consider it whatever you want, but that won't make it true. A
string with embedded NUL characters is a string with embedded NUL
characters. And there are plenty of APIs that call themselves "string
APIs" that can handle strings with embedded NUL characters. If you
want to consider them to not really be string handling functions,
that's up to you.
But I doubt you're going to convince the world that std::regex is a
"binary data" parsing library, simply because it can deal with
embedded NUL strings.
I don't know why you think you need to reply in such a harsh way, but
maybe I just need to back up a bit and explain in more detail:

C actually has a nice distinction between strings and binary data:
strcmp vs. memcmp, e.g. Consequently, C knows no such thing as embedded
NULs in strings. No C API supports them. NULs are the end marker, and
there exists no use-case for NULs embedded in C strings, not least
because they are impossible to represent. You _can_ deal with binary
data, but you must use the mem* family of functions, not the str* one.

Pascal strings (of which std::string/std::string_view are
implementations), otoh, do support embedded NULs. There's still no
use-case for NULs in strings (because the vast majority of APIs uses
NTSs for C compat), but since Pascal strings naturally support them, C++
started using std::string to hold binary data. There's
std::vector<unsigned char>, but for some reason or another std::string
was used. We have the same 'problem' in Qt with QByteArray, which
doubles as a container for UTF-8 strings and binary data.

Surely std::string and QByteArray are nice in that having one class for
both helps reduce the executable size of programs that need to deal with
both strings and binary data. And maybe that consideration played a role
in their design in the 90s. I don't know, but it's moot, anyway:

Because by now, in C++, we try to use type-rich interfaces. Using a
double to mean speed is just as wrong as having a 'string' type double
as a container for binary data and actual strings. What we were missing
was a data type for binary data, and we now have it: std::byte. It
therefore behooves us to explore whether we can't somehow go back to the
distinction C had, between strings and binary data. After all, it's a
bit embarrassing that C has richer interfaces than C++ :P

So, yes, I have some hope that we can start to make the distinction in
the type system (again), and eventually, in std2 or so, make embedded
NULs in strings unsupported, in favour of supporting them in
std::binary_data or simply generic container<std::byte>.

I think we will have a similar discussion for Qt 6, too, re: QByteArray.

Thanks,
Marc
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/9379c4639b7349dac210a4108d63b61f%40kdab.com.
Thiago Macieira
2017-10-26 20:43:11 UTC
Permalink
Post by Marc Mutz
We have the same 'problem' in Qt with QByteArray, which
doubles as a container for UTF-8 strings and binary data.
Nitpick: QByteArray::toUpper & toLower are documented to read their input as
Latin1, not UTF-8.

More to the point: Qt recognises no string other than UTF-16 QString[*].
Everything else, including 8-bit locale-encoded strings, are considered simply
"binary data".

[*] QString can contain U+0000. That leads to trouble when passing them
directly to Win32, Java or Cocoa API.
Post by Marc Mutz
I think we will have a similar discussion for Qt 6, too, re: QByteArray.
Awaiting your post.

In any case, QByteArray and QVector share the allocation backend and a lot of
the front-end, so the excuse about code size is much smaller. The main reason
I see is the class name.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Software Architect - Intel Open Source Technology Center
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/3454266.pXh3pA3uX6%40tjmaciei-mobl1.
Nicol Bolas
2017-10-27 03:30:21 UTC
Permalink
Post by Marc Mutz
Post by Nicol Bolas
Post by Nicol Bolas
There are a number of APIs whose underlying implementations
_require_
Post by Nicol Bolas
a NUL-terminated string (NTS). Or at the very least, where NUL
characters cannot legally appear.
These are two completely different pairs of shoes: NTS and embedded
NULs. There was a lot of talk about embedded NULs in this thread, which
I find surprising, since I consider embedded NULs a non-issue: a
"string" with embedded NULs is no longer a string, but binary data.
You can consider it whatever you want, but that won't make it true. A
string with embedded NUL characters is a string with embedded NUL
characters. And there are plenty of APIs that call themselves "string
APIs" that can handle strings with embedded NUL characters. If you
want to consider them to not really be string handling functions,
that's up to you.
But I doubt you're going to convince the world that std::regex is a
"binary data" parsing library, simply because it can deal with
embedded NUL strings.
I don't know why you think you need to reply in such a harsh way, but
strcmp vs. memcmp, e.g. Consequently, C knows no such thing as embedded
NULs in strings. No C API supports them.
No *standard* C API supports them. I know of at least one C API that
permits embedded NUL characters in strings: Lua.

NULs are the end marker, and
Post by Marc Mutz
there exists no use-case for NULs embedded in C strings, not least
because they are impossible to represent. You _can_ deal with binary
data, but you must use the mem* family of functions, not the str* one.
Pascal strings (of which std::string/std::string_view are
implementations), otoh, do support embedded NULs. There's still no
use-case for NULs in strings (because the vast majority of APIs uses
NTSs for C compat),
This is essentially tautological. You're basically saying that C uses
NUL-terminated strings, and everyone uses C APIs, so nobody uses NULs in
their strings. And therefore, anything with a NUL in it is not a "string"
but is "binary data" instead.

Well, we don't *have* to use C APIs anymore. We've got plenty of string
APIs that don't require NUL-termination now: `std::string`,
`std::iostream`, and `std::regex`. Even `std::from_chars` operates on a
pointer range and doesn't treat a NUL character as the end of that range
(that is, it doesn't treat it any differently from any other non-numerical
character).

So the assumption you're resting your conclusion on doesn't work. The NUL
character in C++ strings can have whatever meaning you want. The
possibility of its presence does not indicate that the supposed string type
is just "binary data".

but since Pascal strings naturally support them, C++
Post by Marc Mutz
started using std::string to hold binary data.
... we did? Can you point to some code that uses `std::string` for binary
data? Is this really a practice that's wide-spread in the C++ world?

I know that when I deal with binary data, I use `vector<unsigned char>` or
something similar. It's no more difficult to load raw binary data from
`iostream` or `FILE` handles into a `vector` than it is a `std::string`.
And it is no more difficult to write it out from a `vector` than a `string`.

There's
Post by Marc Mutz
std::vector<unsigned char>, but for some reason or another std::string
was used. We have the same 'problem' in Qt with QByteArray, which
doubles as a container for UTF-8 strings and binary data.
Surely std::string and QByteArray are nice in that having one class for
both helps reduce the executable size of programs that need to deal with
both strings and binary data. And maybe that consideration played a role
Because by now, in C++, we try to use type-rich interfaces. Using a
double to mean speed is just as wrong as having a 'string' type double
as a container for binary data and actual strings.
While you may not use a `double` to mean "speed", if you create a `speed`
type, it'll just be a wrapper around a `double`. Just like `string` and
`vector` are essentially the same container (and sane implementations share
a lot of machinery). `string` simply has a richer interface (and certain
potential optimizations).

Just like your hypothetical `speed` type.

We already have a distinction between "container for binary data" and
"actual strings". This distinction is clear to programmers, since we named
the thing that holds "actual strings" "std::string". You however have
defined "actual string" to mean "no embedded NUL characters". Remove this
needless definition, and there is no problem.

You seem to be acting like C++ is the only language out there that permits
embedded NUL characters in strings, that this is some off-the-wall concept
that nobody else uses. Python allows it. Lua allows it.
ECMAScript/JavaScript allows it. Are you saying that all of these languages
(and others I don't know about) got it wrong? That none of these languages
have "actual strings" and instead are just throwing "binary data" around?

It's *C* that got it wrong, not the rest of the world. And we should be
trying to avoid C-isms unless they are absolutely unavoidable (such as
filesystem APIs and the like). `c/zstring_view` is a means of annotating
such an API.
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/33f2d084-0d36-43f9-9724-70c1257a7b94%40isocpp.org.
Marc Mutz
2017-10-27 06:30:25 UTC
Permalink
Ok, it seems we agree on one thing: APIs that allow embedded NULs should
not use string_view (and can't use const char*).

For me, it follows from this that they should also not use std::string
(because of the .size() != strlen() issue), but it seems you don't
concur. Ok. Let's see.

You said you use std::vector<unsigned char> for binary data, so seeing
as we now have std::byte, maybe we can agree that APIs that allow
embedded NULs should use array_view<byte> these days?

Now, the question is: how do we express, in the type system, that
embedded NULs are _not_ acceptable?

And this seems to be where we differ. In the last post, you seem to
suggest that [cz]string_view would do the job.

I disagree insofar as I think [cz]string_view is a type for a different
constraint: NUL termination. This implies no embedded NULs, but is
stronger, and prevents the use of efficient substring operations that
involve cutting off the tail.

Maybe, we also agree on this: That there's a need for three interface
types:

1. binary data (array_view<byte>): embedded NULs allowed
2. ???: embedded NULs disallowed, not necessarily NUL-terminated
3. [cz]string_view: NUL-terminated (implies no embedded NULs)

I proposed '???' to be std2::string_view. Yes, std::string_view supports
embedded NULs. That's why I said that we can probably drop support for
embedded NULs only for std2, but I believe that we should, in std2. Then
each of the different constraints is represented as a separate type. If
we don't use string_view for this, we will have two types that do the
same thing (array_view<byte> and string_view), and need to invent a new
type for case (2).

Now, as pertains to the issue at hand, your concerns about adding
string_view all over the place in std, I see the following possible
resolution:

a. add [cz]string_view, and use that for APIs that require NTS
* until we have it, such APIs should not get a string_view overload
b. toughen up support for byte in string-like APIs (regex) in std
* then use array_view<byte> for APIs that allow embedded NULs
c. use string_view for the rest
c'. alternatively, define a new type for case 2's '???' above and use
that for the rest.
* IMO, this then requires the deprecation of string_view

Fair summary?

Thanks,
Marc
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/3661c02d1e1a9ef14afa25c38ffb82e8%40kdab.com.
Nicol Bolas
2017-10-27 17:33:20 UTC
Permalink
Post by Marc Mutz
Ok, it seems we agree on one thing: APIs that allow embedded NULs should
not use string_view (and can't use const char*).
No, quite the opposite. APIs that allow embedded NUL characters very much
*should* use `string_view`. Indeed, they pretty much *have to*, since the
only way to allow embedded NUL characters in an API is to take some from of
character *range*, rather than a single pointer. `to/from_chars` take two
pointers, `string_view` uses pointer+size, and `std::string` is a
full-fledged container of characters.

For me, it follows from this that they should also not use std::string
Post by Marc Mutz
(because of the .size() != strlen() issue), but it seems you don't
concur. Ok. Let's see.
You said you use std::vector<unsigned char> for binary data, so seeing
as we now have std::byte, maybe we can agree that APIs that allow
embedded NULs should use array_view<byte> these days?
No. String APIs should use a *string type*. Using a generic `array_view`
type for a string is highly misleading and does not effectively communicate
the intent of either the caller or the callee.

Even from a usability standpoint, such an API is a pain or is flat out
nonsensical. It makes no sense for `basic_cstring_view<charT>::substr` to
return an `array_view<byte>`. *Especially* since `charT` doesn't have to be
`char`; such a case makes the return value make no sense. No,
`basic_cstring_view<charT>::substr` should return
`basic_string_view<charT>`. Why?

Because *they're both strings*. `basic_cstring_view` is a NUL-termianted
string, and `basic_string_view` is not a NUL-terminated string. If you get
a sub-section of a NUL-terminated string, then that subsection isn't
NUL-terminated anymore.

But it is *still a string*; it did not magically become a generic array.
And it *certainly* did not become a generic array of *bytes*.

We added the `std::byte` type specifically to *stop* people from confusing
the two. A user who wants to work in bytes uses `byte`. A user who wants to
work with characters uses `char`. A user who wants a byte array can create
an array of `byte`s. A user who wants to use a string can create an array
of `char`.

Why do you want to return to those bad old days where we don't what an API
means from the types it takes? An API that takes `string_view` takes a
*string*, not an array of bytes. It does string stuff; it doesn't do
arbitrary byte array stuff. An API that takes `span<const byte>` does
arbitrary byte stuff, not string stuff.

The presence or absence of NUL termination is *completely orthogonal* with
whether the range is a string or a byte array. `span<const byte>` is for
byte arrays; that type, and similar types, should not be used for strings.

Now, the question is: how do we express, in the type system, that
Post by Marc Mutz
embedded NULs are _not_ acceptable?
No, the question is: how do we express, in the type system, that the string
is NUL-terminated?

As someone else in the thread pointed out, there is a distinction between
string APIs that assume the given string is NUL-terminated and string APIs
that consider the NUL character to not be a valid character for its uses.
In the former case, passing a string with an embedded NUL character has
well-defined behavior: the string as interpreted by the API ends at the
first NUL character. In the latter case, passing a string containing a NUL
character has undefined behavior/throws an exception/errors out; and this
is treated no differently than providing a string with any other forbidden
character.

If an API forbids certain characters because it doesn't know what to do
with them, that's not something which should be handled in the type system.
Or at least, not in the standard library. It's too cumbersome to carry
around dozens of different string classes and so forth, just to deal with
different cases of which characters are allowed and forbidden.

NUL is special because the only reason for forbidding it *specifically* is
if the API interprets it as the end of string. That is, you can make an API
that only accepts, for example, ASCII letters. Such an API would forbid
NUL, but it isn't forbidding NUL specifically; it's forbidding *anything*
that isn't an ASCII letter.

Such an API simply doesn't know what to do if it gets a character outside
of its "allowed character" list. They yield undefined behavior or throw
exceptions if the string has characters they don't accept.

APIs that assume NUL termination very much know what to do with NUL
characters: terminate the string. It is well-defined what happens if you
initialize `std::exception` with a string that has embedded NUL characters:
the string returned by `what` will be truncated at the first NUL character.
Their behavior is well-defined.

And you could indeed take the above ASCII-letter API and make it use
NUL-termination. But at the API level, it is explicitly saying that NULs
have a specific meaning; it's not forbidding them. And such an API would
take a NUL-terminated string as its input.

NUL-termination is common and specific enough that it can be handled in the
type system. I've yet to see an API that doesn't use NUL-termination yet
forbids NUL characters *exclusively*; generally speaking, if you have a
non-NTS API that forbids NUL, it probably also forbids other things too.

And this seems to be where we differ. In the last post, you seem to
Post by Marc Mutz
suggest that [cz]string_view would do the job.
I disagree insofar as I think [cz]string_view is a type for a different
constraint: NUL termination. This implies no embedded NULs, but is
stronger, and prevents the use of efficient substring operations that
involve cutting off the tail.
Maybe, we also agree on this: That there's a need for three interface
1. binary data (array_view<byte>): embedded NULs allowed
2. ???: embedded NULs disallowed, not necessarily NUL-terminated
3. [cz]string_view: NUL-terminated (implies no embedded NULs)
Can you provide an example of #2? Of an API that doesn't assume NUL
termination, but forbids NUL characters for some other reason? And an
example which forbid a plethora of characters that happens to include NUL
is not a valid example.

I proposed '???' to be std2::string_view. Yes, std::string_view supports
Post by Marc Mutz
embedded NULs. That's why I said that we can probably drop support for
embedded NULs only for std2, but I believe that we should, in std2. Then
each of the different constraints is represented as a separate type. If
we don't use string_view for this, we will have two types that do the
same thing (array_view<byte> and string_view), and need to invent a new
type for case (2).
Now, as pertains to the issue at hand, your concerns about adding
string_view all over the place in std, I see the following possible
a. add [cz]string_view, and use that for APIs that require NTS
* until we have it, such APIs should not get a string_view overload
Yes, that's the general idea.

b. toughen up support for byte in string-like APIs (regex) in std
Post by Marc Mutz
* then use array_view<byte> for APIs that allow embedded NULs
No. See the above "byte range/string" discussion.

That being said, adding `c/zstring_view` would require some changes to
`regex` to allow them to be used as input strings. But that's mainly about
changing it to use the Iterator/Sentinel paradigm internally (matches would
still use `string_view`/iterator pairs).

c. use string_view for the rest
Yes, but only to the extend that `b` doesn't exist. Thus, "the rest" means
APIs that don't assume NUL termination.
--
You received this message because you are subscribed to the Google Groups "ISO C++ Standard - Future Proposals" group.
To unsubscribe from this group and stop receiving emails from it, send an email to std-proposals+***@isocpp.org.
To post to this group, send email to std-***@isocpp.org.
To view this discussion on the web visit https://groups.google.com/a/isocpp.org/d/msgid/std-proposals/ee876de4-7e7a-4559-ae29-d50117e91b53%40isocpp.org.
Loading...