Quantcast
Channel: Category Name
Viewing all articles
Browse latest Browse all 5971

std::string_view: The Duct Tape of String Types

$
0
0

Visual Studio 2017 contains support for std::string_view, a type added in C++17 to serve some of the roles previously served by const char * and const std::string& parameters. string_view is neither a “better const std::string&”, nor “better const char *”; it is neither a superset or subset of either. std::string_view is intended to be a kind of universal “glue” — a type describing the minimum common interface necessary to read string data. It doesn’t require that the data be null-terminated, and doesn’t place any restrictions on the data’s lifetime. This gives you type erasure for “free”, as a function accepting a string_view can be made to work with any string-like type, without making the function into a template, or constraining the interface of that function to a particular subset of string types.

tl;dr

string_view solves the “every platform and library has its own string type” problem for parameters. It can bind to any sequence of characters, so you can just write your function as accepting a string view:

void f(wstring_view); // string_view that uses wchar_t's

and call it without caring what stringlike type the calling code is using (and for (char*, length) argument pairs just add {} around them)

// pass a std::wstring:
std::wstring& s;         f(s);

// pass a C-style null-terminated string (string_view is not null-terminated):
wchar_t* ns = "";        f(ns);

// pass a C-style character array of len characters (excluding null terminator):
wchar_t* cs, size_t len; f({cs,len});

// pass a WinRT string
winrt::hstring hs;       f(hs);

f is just an ordinary function, it doesn’t have to be a template.

string_view as a Generic String Parameter

Today, the most common “lowest common denominator” used to pass string data around is the null-terminated string (or as the standard calls it, the Null-Terminated Character Type Sequence). This has been with us since long before C++, and provides clean “flat C” interoperability. However, char* and its support library are associated with exploitable code, because length information is an in-band property of the data and susceptible to tampering. Moreover, the null used to delimit the length prohibits embedded nulls and causes one of the most common string operations, asking for the length, to be linear in the length of the string.

Sometimes const std::string& can be used to pass string data and erase the source, because it accepts std::string objects, const char * pointers, and string literals like “meow”. Unfortunately, const std::string& creates “impedance mismatches” when interacting with code that uses other string types. If you want to talk to COM, you need to use BSTR. If you want to talk to WinRT, you need HSTRING. For NT, UNICODE_STRING, and so on. Each programming domain makes up their own new string type, lifetime semantics, and interface, but a lot of text processing code out there doesn’t care about that. Allocating entire copies of the data to process just to make differing string types happy is suboptimal for performance and reliability.

Example: A Function Accepting std::wstring and winrt::hstring

Consider the following program. It has a library function compiled in a separate .cpp, which doesn’t handle all string types explicitly but still works with any string type.

// library.cpp
#include <stddef.h>
#include <string_view>
#include <algorithm>

size_t count_letter_Rs(std::wstring_view sv) noexcept {
    return std::count(sv.begin(), sv.end(), L'R');
}
// program.cpp
// compile with: cl /std:c++17 /EHsc /W4 /WX
//    /I"%WindowsSdkDir%Include%UCRTVersion%cppwinrt" .program.cpp .library.cpp
#include <stddef.h>
#include <string.h>
#include <iostream>
#include <stdexcept>
#include <string>
#include <string_view>

#pragma comment(lib, "windowsapp")
#include <winrt/base.h>

// Library function, the .cpp caller doesn't need to know the implementation
size_t count_letter_Rs(std::wstring_view) noexcept;

int main() {
    std::wstring exampleWString(L"Hello wstring world!");
    exampleWString.push_back(L'');
    exampleWString.append(L"ARRRR embedded nulls");
    winrt::hstring exampleHString(L"Hello HSTRING world!");

    // Performance and reliability is improved vs. passing std::wstring, as
    // the following conversions don't allocate and can't fail:
    static_assert(noexcept(std::wstring_view{exampleWString}));
    static_assert(noexcept(std::wstring_view{exampleHString}));

    std::wcout << L"Rs in " << exampleWString
        << L": " << count_letter_Rs(exampleWString) << L"n";

    // note HStringWrapper->wstring_view implicit conversion when calling
    // count_letter_Rs
    std::wcout << L"Rs in " << std::wstring_view{exampleHString}
        << L": " << count_letter_Rs(exampleHString) << L"n";
}

Output:

>.program.exe
Rs in Hello wstring world! ARRRR embedded nulls: 4
Rs in Hello HSTRING world!: 1

The preceding example demonstrates a number of desirable properties of string_view (or wstring_view in this case):

vs. making count_letter_Rs some kind of template
Compile time and code size is reduced because only one instance of count_letter_Rs need be compiled. The interface of the string types in use need not be uniform, allowing types like winrt::hstring, MFC CString, or QString to work as long as a suitable conversion function is added to the string type.
vs. const char *
By accepting string_view, count_letter_Rs need not do a strlen or wcslen on the input. Embedded nulls work without problems, and there’s no chance of in-band null manipulation errors introducing bugs.
vs. const std::string&
As described in the comment above, string_view avoids a separate allocation and potential failure mode, because it passes a pointer to the string’s data, rather than making an entire owned copy of that data.
string_view For Parsers

Another place where non-allocating non-owning string pieces exposed as string_view can be useful is in parsing applications. For example, the C++17 std::filesystem::path implementation that comes with Visual C++ uses std::wstring_view internally when parsing and decomposing paths. The resulting string_views can be returned directly from functions like std::filesystem::path::filename(), but functions like std::filesystem::path::has_filename() which don’t actually need to make copies are natural to write.

inline wstring_view parse_filename(const wstring_view text)
	{	// attempt to parse text as a path and return the filename if it exists; otherwise,
		// an empty view
	const auto first = text.data();
	const auto last = first + text.size();
	const auto filename = find_filename(first, last); // algorithm defined elsewhere
	return wstring_view(filename, last - filename);
	}

class path
	{
public:
	// [...]
	path filename() const
		{	// parse the filename from *this and return a copy if present; otherwise,
			// return the empty path
		return parse_filename(native());
		}
	bool has_filename() const noexcept
		{	// parse the filename from *this and return whether it exists
		return !parse_filename(native()).empty();
		}
	// [...]
	};

In the std::experimental::filesystem implementation written before string_view, path::filename() contains the parsing logic, and returns a std::experimental::filesystem::path. has_filename is implemented in terms of filename, as depicted in the standard, allocating a path to immediately throw it away.

Iterator Debugging Support

In debugging builds, MSVC’s string_view implementation is instrumented to detect many kinds of buffer management errors. The valid input range is stamped into string_view’s iterators when they are constructed, and unsafe iterator operations are blocked with a message describing what the problem was.

// compile with cl /EHsc /W4 /WX /std:c++17 /MDd .program.cpp
#include <crtdbg.h>
#include <string_view>

int main() {
    // The next 3 lines cause assertion failures to go to stdout instead of popping a dialog:
    _set_abort_behavior(0, _WRITE_ABORT_MSG);
    _CrtSetReportMode(_CRT_ASSERT, _CRTDBG_MODE_FILE);
    _CrtSetReportFile(_CRT_ASSERT, _CRTDBG_FILE_STDOUT);

    // Do something bad with a string_view iterator:
    std::string_view test_me("hello world");
    (void)(test_me.begin() + 100); // dies
}
>cl /nologo /MDd /EHsc /W4 /WX /std:c++17 .test.cpp
test.cpp

>.test.exe
xstring(439) : Assertion failed: cannot seek string_view iterator after end

Now, this example might seem a bit obvious, because we’re clearly incrementing the iterator further than the input allows, but catching mistakes like this can make debugging much easier in something more complex. For example, a function expecting to move an iterator to the next ‘)’:

// compile with cl /EHsc /W4 /WX /std:c++17 /MDd .program.cpp
#include <crtdbg.h>
#include <string_view>

using std::string_view;

string_view::iterator find_end_paren(string_view::iterator it) noexcept {
    while (*it != ')') {
        ++it;
    }

    return it;
}

int main() {
    _set_abort_behavior(0, _WRITE_ABORT_MSG);
    _CrtSetReportMode(_CRT_ASSERT, _CRTDBG_MODE_FILE);
    _CrtSetReportFile(_CRT_ASSERT, _CRTDBG_FILE_STDOUT);
    string_view example{"malformed input"};
    const auto result = find_end_paren(example.begin());
    (void)result;
}
>cl /nologo /EHsc /W4 /WX /std:c++17 /MDd .program.cpp
program.cpp

>.program.exe
xstring(358) : Assertion failed: cannot dereference end string_view iterator
Pitfall #1: std::string_view doesn’t own its data, or extend lifetime

Because string_view doesn’t own its actual buffer, it’s easy to write code that assumes data will live a long time. An easy way to demonstrate this problem is to have a string_view data member. For example, a struct like the following is dangerous:

struct X {
    std::string_view sv; // Danger!
    explicit X(std::string_view sv_) : sv(sv_) {}
};

because a caller can expect to do something like:

int main() {
    std::string hello{"hello"};
    X example{hello + " world"}; // forms string_view to string destroyed at the semicolon
    putc(example.sv[0]); // undefined behavior
}

In this example, the expression `hello + ” world”` creates a temporary std::string, which is converted to a std::string_view before the constructor of X is called. X stores a string_view to that temporary string, and that temporary string is destroyed at the end of the full expression constructing `example`. At this point, it would be no different if X had tried to store a const char * which was deallocated. X really wants to extend the lifetime of the string data here, so it must make an actual copy.

There are of course conditions where a string_view member is fine; if you’re implementing a parser and are describing a data structure tied to the input, this may be OK, as std::regex does with std::sub_match. Just be aware that string_view’s lifetime semantics are more like that of a pointer.

Pitfall #2: Type Deduction and Implicit Conversions

Attempting to generalize functions to different character types by accepting basic_string_view instead of string_view or wstring_view prevents the intended use of implicit conversion. If we modify the program from earlier to accept a template instead of wstring_view, the example no longer works.

// program.cpp
// compile with: cl /std:c++17 /EHsc /W4 /WX
//    /I"%WindowsSdkDir%Include%UCRTVersion%cppwinrt" .program.cpp
#include <stddef.h>
#include <string.h>
#include <algorithm>
#include <iostream>
#include <locale>
#include <stdexcept>
#include <string>
#include <string_view>

#pragma comment(lib, "windowsapp")
#include <winrt/base.h>

template<class Char>
size_t count_letter_Rs(std::basic_string_view<Char> sv) noexcept {
    return std::count(sv.begin(), sv.end(),
        std::use_facet<std::ctype<Char>>(std::locale()).widen('R'));
}

int main() {
    std::wstring exampleWString(L"Hello wstring world!");
    winrt::hstring exampleHString(L"Hello HSTRING world!");
    count_letter_Rs(exampleWString); // no longer compiles; can't deduce Char
    count_letter_Rs(std::wstring_view{exampleWString}); // OK
    count_letter_Rs(exampleHString); // also no longer compiles; can't deduce Char
    count_letter_Rs(std::wstring_view{exampleHString}); // OK
}

In this example, we want exampleWString to be implicitly converted to a basic_string_view<wchar_t>. However, for that to happen we need template argument deduction to deduce CharT == wchar_t, so that we get count_letter_Rs. Template argument deduction runs before overload resolution or attempting to find conversion sequences, so it has no idea that basic_string is at all related to basic_string_view, and type deduction fails, and the program does not compile. As a result, prefer accepting a specialization of basic_string_view like string_view or wstring_view rather than a templatized basic_string_view in your interfaces.

In Closing

We hope string_view will serve as an interoperability bridge to allow more C++ code to seamlessly communicate. We are always interested in your feedback. Should you encounter issues please let us know through Help > Report A Problem in the product, or via Developer Community. Let us know your suggestions through UserVoice. You can also find us on Twitter (@VisualC) and Facebook (msftvisualcpp).


Viewing all articles
Browse latest Browse all 5971

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>