When programming in a JavaScript environment such as Node.js, you might recover raw data from the network and need to convert the bytes into strings. In a system such as Node.js, you may represent such raw bytes using a Buffer instance.
You can conveniently convert a Buffer instance into a…
When you recover textual content from the disk or from the network, you may expect it to be a Unicode string in UTF-8. It is the most common format. Unfortunately, not all sequences of bytes are valid UTF-8 and accepting invalid UTF-8 without validating it is a security risk.
How might you validate…
Modern web applications often use the http/https protocols. However, when the server and client needs to talk to each other in a symmetrical fashion, the WebSocket protocol might be preferable. For example, if you want to program a multiplayer video game, the WebSocket protocol is almost surely…
Suppose that you want to parse quickly 8-bit integers (0, 1, 2, …, 254, 255) from an ASCII/UTF-8 string. The problem comes up in the simdzone project lead by Jeroen Koekkoek (NLnet Labs). You are given a string and its length: e.g., ’22’ and length is 2. The naive approach in C might be:
int…
Conventional web applications use the http protocol (or the https variant). The http protocol is essentially asymmetrical: a client application such as a browser issues requests and the server responds. It is not possible for the server to initiate communication with the client. Certain types of…
Vitamin K2 supplements might reduce the risk of myocardial infarction (heart attacks) and of all-cause death (Hasific et al. 2022). You find vitamin K2 in some Gouda cheeses and in eggs.
Most of the water on Earth is salinated (in the oceans) and cannot be consumed. Fresh water is often scarce.…
Suppose that you want to check whether a character in C++ belongs to a fixed set, such as ‘\0’, ‘\x09’, ‘\x0a’,’\x0d’, ‘ ‘, ‘#’, ‘/’, ‘:’, ‘<‘, ‘>’, ‘?’, ‘@’, ‘[‘, ‘\’, ‘]’, ‘^’, ‘|’. A simple way is to generate a 256-byte…
In C++, suppose that you append to a string one character at a time:
while(my_string.size() <= 10'000'000) {
my_string += "a";
}
In theory, it might be possible for the C++ runtime library to implement this routine as the creation of a new string with each append: it could allocate…
The C++ library has long been organized around stream classes, at least when it comes to reading and parsing strings. But streams can be surprisingly slow. For example, if you want to parse numbers, then this C++ routine is close to being the worst possible choice for performance:
std::stringstream…
In about 10 years, Apple has multiplied by 19 the number of transistors in its mobile processors. It corresponds roughly to a steady rate of improvement of 34% per year on the number of transistors, or a doubling every 2.5 years. In real dollars, an iPhone has roughly a constant price: the price…
Computer software is typically deterministic on paper: if you run twice the same program with the same inputs, you should get the same outputs. In practice, the complexity of modern computing makes it unlikely that you could ever run twice the same program and get exactly the same result, down to…
The Web is a convenient interface to your software. Many times, if you have an existing application, you may want to allow Web access to it using HTTP. Or you may want to build a small specialized Web application. In such instances, you do not want to use an actual Web server (e.g., Apache or…
If I give a programmer a string such as "9223372036854775808" and I ask them to convert it to an integer, they might do the following in C++:
std::string s = ....
uint64_t val;
auto [ptr, ec] =
std::from_chars(s.data(), s.data() + s.size(), val);
if (ec != std::errc()) {} // I have an…
In software, we store strings of text as arrays of bytes in memory using one of the Unicode Transformation Formats (UTF), the most popular being UTF-8 and UTF-16. Windows, Java, C# and other systems common languages and systems default on UTF-16, whereas other systems and most of the web relies on…
A common problem in parsing is that you want to find all identifiers (e.g., variable names, function names) in a document quickly. There are typically some fixed rules. For example, it is common to allow ASCII letters and digits as well as characters like ‘_’ in the identifier, but to forbid…
Physicists have a published a paper with 5154 authors. The list of authors takes 24 pages out of the 33 pages. The lesson is that if someone tell you that they have published an important paper, you should ask how many authors there were and what their exact role was.
Vegatarians are at higher…
Though most strings online today follow the Unicode standard (e.g., using UTF-8), the Latin 1 standard is still in widespread inside some systems (such as browsers) as JavaScript strings are often stored as either Latin 1, UTF-8 or UTF-16 internally. Latin 1 captures the first 256 characters from…
Given a set of r random values from a large set (of size N), I have been using the formula 1-exp(-r**2/(2N)) to approximate the probability of a collision. It assumes that r is much smaller than N. The formula suggests that if you have hundreds of millions of random 64-bit numbers, you will start…