Table of Contents

Strings and Regular Expressions

Javascript strings have always been built based on 16-bit character encoding (UTF-16). Each 16-bit sequence is a code unit, representing a character. String properties and methods like length and charAt() are constructed based on this code unit.
The goal of Unicode is to provide a globally unique identifier for every character in the world. If we limit the character length to 16 bits, the number of code points will be insufficient to represent so many characters. A "globally unique identifier", also known as a code point, is a numerical value starting from 0. These numerical values or code points that represent characters are called character encodings. Character encodings must encode code points into internally consistent code units. For UTF-16, a code point can be represented by multiple code units.

Better Unicode support. In the past, 16 bits were sufficient to contain any character (each 16-bit sequence was a code unit, representing a character; in the past, 16 bits were sufficient to contain any character). It wasn't until Unicode introduced extended character sets that encoding rules had to change.

UTF-16

The first $2^{16}$ code points are all represented by 16-bit code units, and this range is called the Basic Multilingual Plane (BMP). Code points beyond this range belong to a supplementary plane. UTF-16 introduced surrogate pairs, which stipulate that two 16-bit code units represent one code point. This means there are two types of characters in a string: one is a BMP character represented by a single 16-bit code unit, and the other is a supplementary plane character represented by two 32-bit code units, such as the character: '𠮷' (String.fromCodePoint(134071))

In ECMAScript 5, all string operations are based on 16-bit code units. If UTF-16 encoded characters containing surrogate pairs are processed in the same way, the results may not be as expected

let text = "𠮷";
console.log(text.length); //2
console.log(/^.$/.test(text)); //false
console.log(text.charAt(0)); // "" 
console.log(text.charAt(1)); // ""
console.log(text.charCodeAt(0)); // 55362
console.log(text.charCodeAt(1)); // 57271

The actual length of the variable text is 1, but its length property is 2.
The variable text is treated as two characters, so a regular expression matching a single character fails.
Neither of the two 16-bit code units represents any printable character, so the charAt() method will not return a valid string.
charCodeAt() also cannot correctly identify characters. It returns the numerical value corresponding to each 16-bit code unit.

codePointAt() method

For characters in the BMP character set, the return value of the codePointAt() method is the same as that of the charCodeAt() method, but for non-BMP character sets, the return value is different. The first character of the string '𠮷a' is non-BMP, containing two code units, so its length property is 3. ES6 fully supports the UTF-16 codePointAt() method, which accepts the position of the code unit, not the character position, as an argument, and returns the code point corresponding to the given position in the string, which is an integer value.

let text = '𠮷a'
console.log(text.length)

console.log(text.charCodeAt(0)) // 55362
console.log(text.charCodeAt(1)) // 57271
console.log(text.charCodeAt(2)) // 97

console.log(text.codePointAt(0)) // 134071
console.log(text.codePointAt(1)) // 57271
console.log(text.codePointAt(2)) // 97

To check the number of code units a character occupies, you can write the following function:

function is32Bite(c){
  return c.codePointAt(0)>0xFFFF;
}

console.log(is32Bite('𠮷')) // true
console.log(is32Bite('a')) // false

fromCodePoint() method

Returning a character via its code point can be seen as an extended version of String.fromCharCode(). For all BMP characters, the execution results of both methods are the same. The results may only differ when a non-BMP code point is passed as an argument.

console.log(String.fromCodePoint(134071)) // 𠮷

normalize() method

Another interesting aspect of Unicode is that if we want to sort or compare different characters, there's a possibility that they are equivalent. Characters representing the same text might have different code points. Therefore, when comparing, you should first standardize them using the normalize() method.

Just remember, before comparing strings, always normalize them to the same form.

let normalized  = values.map(funciton(text){
  return text.normalize();
});

normalized.sort(funciton(first,second){
  if(first < second){
    return -1;
  } else if (first === second) {
    return 0;
  } else {
    return 1;
  }
})

Regular Expression u Modifier

A Unicode-aware modifier u switches it from code unit operation mode to character mode, so regular expressions will not treat surrogate pairs as two characters, thus operating exactly as expected. For example:

let text = '𠮷a'

console.log(text.length)
console.log(/^.$/.test(text)) //false
console.log(/^.$/u.test(text)) //true 使用了u修饰符后，正则表达式会匹配字符，从而就可以匹配日文文字字符

Calculating Code Point Count

ES6 still does not support string code point count detection (length still returns the number of string code units), but with the u modifier, you can solve this problem using regular expressions.

// 长字符串可能会有效率问题，可以使用字符串迭代器来处理
function codePointLength(text){
  let rs = text.match(/[\s\S]/gu);
  return rs?rs.length:0
}
// 判断浏览器是否支持u
function hasRegExU(){
  try{
    var partten = new RegExp(".","u");
     return true
  }catch(ex){
    return false
  }
}

Substring Recognition in Strings

trim()
includes() Returns true if the specified text is found in the string, otherwise returns false.
startsWith() Returns true if the specified text is found at the beginning of the string, otherwise returns false.
endsWith() Returns true if the specified text is found at the end of the string, otherwise returns false.
repeat() Returns a new string consisting of the current string repeated a specified number of times.

Two arguments: the first specifies the text to search for, and the second is optional, specifying the index position to start the search. If you need to find the actual position of a substring within a string, you still need to use the indexOf() or lastIndexOf() methods.

repeat() method

ES6 also added a repeat() method, which accepts a number-type argument indicating the number of times the string should be repeated. The return value is a new string consisting of the current string repeated a specified number of times. For example, it can be used to create indentation levels in code formatters.

let indent = " ".repeat(4),
indentLevel = 0;
// 当需要增加缩进时
let newIndent = indent.repeat(++indentLevel)

y Modifier in Regular Expressions

It affects the sticky property during regular expression searches. When character matching begins in a string, it instructs the search to start from the regular expression's lastIndex property. If no successful match is found at the specified position, the matching stops. The lastIndex property is only involved when calling methods of regular expression objects such as exec() and test()

Copying Regular Expressions

var reg1 = /ab/i,
// es5中抛出异常，es6中正常运行
reg2 = new RegExp(reg1,"g")

let re = /ab/g
console.log(re.source); // "ab"
console.log(re.flags); // "g"

Template Literals

Multiline strings: A formal concept of multiline strings.
Basic string formatting: The ability to embed variable values into strings.
HTML escaping: The ability to insert safely substituted strings into HTML.

In template literals, there's no need to escape single or double quotes. If you want to use backticks, you need to escape them with \. Variables can be used as placeholders with ${variableName} (using an undefined variable will always throw an error). Template literals themselves are JavaScript expressions, so you can embed one template literal inside another, as shown below:

let name = "Nicholas",
message = `Hello ${
  `my name is ${name}`
}`;
console.log(message);

Using Tag Functions

function tag(literals,...substitutions){
  // 返回一个字符串
}
// 举个栗子
let count = 10,
price = 0.25
message = passthru`${count}items cost $${count*price.toFixed(2)}.`

If you have a function named passthru(), then as a template literal tag, it will accept 3 arguments:
First is a literals array: equivalent to two variable placeholders splitting the string into three segments

Before the first placeholder: empty string ''
Between the first and second placeholders: items cost $
After the second: '.'

The second argument is the interpreted value of count, passed as 10, which also becomes the first element of the substitutions array. The last argument is the interpreted value of count*price.toFixed(2), which is 2.5, as the second element of the substitutions array. The number of elements in substitutions is always one less than the length of literals.

function passthru(literals,...substitutions){
  let result = '';
  // 根据substition的数量来确定循环的次数
  for(let i=0;i<substitutions.length;i++){
    result+=literals[i];
    result+=substitutions[i];
  }
  // 合并最后一个literal
  result+=literals[literals.length-1];
  return result;
}

String.raw()

Template tags can also access raw string information, meaning that through template tags, you can access the raw string before character escapes are converted into equivalent characters. The simplest example is using the built-in String.raw() tag function.

let message1 = `Multiline\nstring`,
message2 = String.raw`Multiline\nstring`;
console.log(message1); // "Multiline
                       // string"
console.log(message2); // "Multiline\\nstring"

Raw string information is also passed to template tags. The first argument of a tag function is an array that has an additional property, raw, which is an array containing the raw equivalent information for each literal value. For example, literals[0] always has an equivalent literals.raw[0], which contains its raw string information.

主题测试文章，只做测试使用。发布者：Walker，转转请注明出处：https://walker-learn.xyz/archives/4309

In-depth Understanding of ES6 002 [Study Notes]

Strings and Regular Expressions

Strings and Regular Expressions

UTF-16

codePointAt() method

fromCodePoint() method

normalize() method

Regular Expression u Modifier

Calculating Code Point Count

Substring Recognition in Strings

repeat() method

y Modifier in Regular Expressions

Copying Regular Expressions

Template Literals

String.raw()

Related Posts

Go Engineer Comprehensive Course: Protobuf Guide [Study Notes]

Go Engineer System Course 013 [Study Notes]

In-depth Understanding of ES6 007 [Study Notes]

Go Engineer System Course 004 [Study Notes]

Go Engineer System Course 010 [Study Notes]