In-depth Understanding of ES6 002 [Study Notes]

Strings and Regular Expressions. Strings and Regular Expressions. JavaScript strings have always been built based on 16-bit character encoding (UTF-16). Each 16-bit sequence is a code unit, representing a character. String properties and methods like `length` and `charAt()` are all constructed based on these code units. The goal of Unicode is to provide a globally unique identifier for every character in the world. If we limit the character length to 16 bits, the number of code points will not...

Strings and Regular Expressions

Strings and Regular Expressions

Javascript strings have always been built based on 16-bit character encoding (UTF-16). Each 16-bit sequence is a code unit, representing a character. String properties and methods like length and charAt() are constructed based on this code unit.
The goal of Unicode is to provide a globally unique identifier for every character in the world. If we limit the character length to 16 bits, the number of code points will be insufficient to represent so many characters. A "globally unique identifier", also known as a code point, is a numerical value starting from 0. These numerical values or code points that represent characters are called character encodings. Character encodings must encode code points into internally consistent code units. For UTF-16, a code point can be represented by multiple code units.

Better Unicode support. In the past, 16 bits were sufficient to contain any character (each 16-bit sequence was a code unit, representing a character; in the past, 16 bits were sufficient to contain any character). It wasn't until Unicode introduced extended character sets that encoding rules had to change.

UTF-16

The first $2^{16}$ code points are all represented by 16-bit code units, and this range is called the Basic Multilingual Plane (BMP). Code points beyond this range belong to a supplementary plane. UTF-16 introduced surrogate pairs, which stipulate that two 16-bit code units represent one code point. This means there are two types of characters in a string: one is a BMP character represented by a single 16-bit code unit, and the other is a supplementary plane character represented by two 32-bit code units, such as the character: '𠮷' (String.fromCodePoint(134071))

In ECMAScript 5, all string operations are based on 16-bit code units. If UTF-16 encoded characters containing surrogate pairs are processed in the same way, the results may not be as expected

let text = "𠮷";
console.log(text.length); //2
console.log(/^.$/.test(text)); //false
console.log(text.charAt(0)); // "" 
console.log(text.charAt(1)); // ""
console.log(text.charCodeAt(0)); // 55362
console.log(text.charCodeAt(1)); // 57271
  • The actual length of the variable text is 1, but its length property is 2.
  • The variable text is treated as two characters, so a regular expression matching a single character fails.
  • Neither of the two 16-bit code units represents any printable character, so the charAt() method will not return a valid string.
  • charCodeAt() also cannot correctly identify characters. It returns the numerical value corresponding to each 16-bit code unit.

codePointAt() method

For characters in the BMP character set, the return value of the codePointAt() method is the same as that of the charCodeAt() method, but for non-BMP character sets, the return value is different. The first character of the string '𠮷a' is non-BMP, containing two code units, so its length property is 3. ES6 fully supports the UTF-16 codePointAt() method, which accepts the position of the code unit, not the character position, as an argument, and returns the code point corresponding to the given position in the string, which is an integer value.

let text = '𠮷a'
console.log(text.length)

console.log(text.charCodeAt(0)) // 55362
console.log(text.charCodeAt(1)) // 57271
console.log(text.charCodeAt(2)) // 97

console.log(text.codePointAt(0)) // 134071
console.log(text.codePointAt(1)) // 57271
console.log(text.codePointAt(2)) // 97

To check the number of code units a character occupies, you can write the following function:

function is32Bite(c){
  return c.codePointAt(0)>0xFFFF;
}

console.log(is32Bite('𠮷')) // true
console.log(is32Bite('a')) // false

fromCodePoint() method

Returning a character via its code point can be seen as an extended version of String.fromCharCode(). For all BMP characters, the execution results of both methods are the same. The results may only differ when a non-BMP code point is passed as an argument.

console.log(String.fromCodePoint(134071)) // 𠮷

normalize() method

Another interesting aspect of Unicode is that if we want to sort or compare different characters, there's a possibility that they are equivalent. Characters representing the same text might have different code points. Therefore, when comparing, you should first standardize them using the normalize() method.

Just remember, before comparing strings, always normalize them to the same form.

let normalized  = values.map(funciton(text){
  return text.normalize();
});

normalized.sort(funciton(first,second){
  if(first < second){
    return -1;
  } else if (first === second) {
    return 0;
  } else {
    return 1;
  }
})

Regular Expression u Modifier

A Unicode-aware modifier u switches it from code unit operation mode to character mode, so regular expressions will not treat surrogate pairs as two characters, thus operating exactly as expected. For example:

let text = '𠮷a'

console.log(text.length)
console.log(/^.$/.test(text)) //false
console.log(/^.$/u.test(text)) //true 使用了u修饰符后,正则表达式会匹配字符,从而就可以匹配日文文字字符

Calculating Code Point Count

ES6 still does not support string code point count detection (length still returns the number of string code units), but with the u modifier, you can solve this problem using regular expressions.

// 长字符串可能会有效率问题,可以使用字符串迭代器来处理
function codePointLength(text){
  let rs = text.match(/[\s\S]/gu);
  return rs?rs.length:0
}
// 判断浏览器是否支持u
function hasRegExU(){
  try{
    var partten = new RegExp(".","u");
     return true
  }catch(ex){
    return false
  }
}

Substring Recognition in Strings

  • trim()
  • includes() Returns true if the specified text is found in the string, otherwise returns false.
  • startsWith() Returns true if the specified text is found at the beginning of the string, otherwise returns false.
  • endsWith() Returns true if the specified text is found at the end of the string, otherwise returns false.
  • repeat() Returns a new string consisting of the current string repeated a specified number of times.

Two arguments: the first specifies the text to search for, and the second is optional, specifying the index position to start the search. If you need to find the actual position of a substring within a string, you still need to use the indexOf() or lastIndexOf() methods.

repeat() method

ES6 also added a repeat() method, which accepts a number-type argument indicating the number of times the string should be repeated. The return value is a new string consisting of the current string repeated a specified number of times. For example, it can be used to create indentation levels in code formatters.

let indent = " ".repeat(4),
indentLevel = 0;
// 当需要增加缩进时
let newIndent = indent.repeat(++indentLevel)

y Modifier in Regular Expressions

It affects the sticky property during regular expression searches. When character matching begins in a string, it instructs the search to start from the regular expression's lastIndex property. If no successful match is found at the specified position, the matching stops. The lastIndex property is only involved when calling methods of regular expression objects such as exec() and test()

Copying Regular Expressions

var reg1 = /ab/i,
// es5中抛出异常,es6中正常运行
reg2 = new RegExp(reg1,"g")

let re = /ab/g
console.log(re.source); // "ab"
console.log(re.flags); // "g"

Template Literals

  • Multiline strings: A formal concept of multiline strings.
  • Basic string formatting: The ability to embed variable values into strings.
  • HTML escaping: The ability to insert safely substituted strings into HTML.

In template literals, there's no need to escape single or double quotes. If you want to use backticks, you need to escape them with \. Variables can be used as placeholders with ${variableName} (using an undefined variable will always throw an error). Template literals themselves are JavaScript expressions, so you can embed one template literal inside another, as shown below:

let name = "Nicholas",
message = `Hello ${
  `my name is ${name}`
}`;
console.log(message);

Using Tag Functions

function tag(literals,...substitutions){
  // 返回一个字符串
}
// 举个栗子
let count = 10,
price = 0.25
message = passthru`${count}items cost $${count*price.toFixed(2)}.`

If you have a function named passthru(), then as a template literal tag, it will accept 3 arguments:
First is a literals array: equivalent to two variable placeholders splitting the string into three segments

  • Before the first placeholder: empty string ''
  • Between the first and second placeholders: items cost $
  • After the second: '.'

The second argument is the interpreted value of count, passed as 10, which also becomes the first element of the substitutions array. The last argument is the interpreted value of count*price.toFixed(2), which is 2.5, as the second element of the substitutions array. The number of elements in substitutions is always one less than the length of literals.

function passthru(literals,...substitutions){
  let result = '';
  // 根据substition的数量来确定循环的次数
  for(let i=0;i<substitutions.length;i++){
    result+=literals[i];
    result+=substitutions[i];
  }
  // 合并最后一个literal
  result+=literals[literals.length-1];
  return result;
}

String.raw()

Template tags can also access raw string information, meaning that through template tags, you can access the raw string before character escapes are converted into equivalent characters. The simplest example is using the built-in String.raw() tag function.

let message1 = `Multiline\nstring`,
message2 = String.raw`Multiline\nstring`;
console.log(message1); // "Multiline
                       // string"
console.log(message2); // "Multiline\\nstring"

Raw string information is also passed to template tags. The first argument of a tag function is an array that has an additional property, raw, which is an array containing the raw equivalent information for each literal value. For example, literals[0] always has an equivalent literals.raw[0], which contains its raw string information.

主题测试文章,只做测试使用。发布者:Walker,转转请注明出处:https://walker-learn.xyz/archives/4309

(0)
Walker的头像Walker
上一篇 Mar 10, 2026 00:00
下一篇 Mar 8, 2026 15:40

Related Posts

  • Go Engineer's Comprehensive Course 017: Learning Notes

    Introduction to Rate Limiting, Circuit Breaking, and Degradation (with Sentinel Practical Application)
    Based on the key video points from Chapter 3 (3-1 to 3-9) of the courseware, this guide compiles a service protection introduction for beginners, helping them understand "why rate limiting, circuit breaking, and degradation are needed," and how to quickly get started with Sentinel.
    Learning Path at a Glance
    3-1 Understanding Service Avalanche and the Background of Rate Limiting, Circuit Breaking, and Degradation
    3-2 Comparing Sentinel and Hystrix to clarify technology selection
    3-3 Sen...

    Personal Nov 25, 2025
    23900
  • [Opening]

    I am Walker, born in the early 1980s, a journeyer through code and life. A full-stack development engineer, I navigate the boundaries between front-end and back-end, dedicated to the intersection of technology and art. Code is the language with which I weave dreams; projects are the canvas on which I paint the future. Amidst the rhythmic tapping of the keyboard, I explore the endless possibilities of technology, allowing inspiration to bloom eternally within the code. An avid coffee enthusiast, I am captivated by the poetry and ritual of every pour-over. In the rich aroma and subtle bitterness of coffee, I find focus and inspiration, mirroring my pursuit of excellence and balance in the world of development. Cycling...

    Feb 6, 2025 Personal
    2.3K00
  • Go Engineer System Course 004 [Study Notes]

    Requirements Analysis Backend Management System Product Management Product List Product Categories Brand Management Brand Categories Order Management Order List User Information Management User List User Addresses User Messages Carousel Management E-commerce System Login Page Homepage Product Search Product Category Navigation Carousel Display Recommended Products Display Product Details Page Product Image Display Product Description Product Specification Selection Add to Cart Shopping Cart Product List Quantity Adjustment Delete Product Checkout Function User Center Order Center My...

    Nov 25, 2025
    27300
  • Go Engineering Comprehensive Course 001 [Study Notes]

    Transitioning: Reasons for a rapid, systematic transition to Go engineering:
    To improve CRUD operations.
    To gain experience with self-developed frameworks.
    For colleagues aiming to deepen technical expertise, specializing and refining requirements.
    To advance engineering practices, developing good coding standards and management capabilities.

    The Importance of Engineering

    Expectations for Senior Developers:
    Good code standards.
    Deep understanding of underlying principles.
    Familiarity with architecture.
    Familiarity with K8s basic architecture.
    Expanding knowledge breadth and depth, and a standardized development system.

    Four Major Stages:
    Go language fundamentals.
    Microservice development (e-commerce project practical experience).
    Self-developed microservices.
    Self-developed, then re...

    Personal Nov 25, 2025
    35600
  • Go Engineering Systematic Course 003 [Study Notes]

    grpc grpc grpc-go grpc seamlessly integrates protobuf protobuf. For those of you accustomed to using JSON and XML data storage formats, I believe most have never heard of Protocol Buffer. Protocol Buffer is actually a lightweight & efficient structured data storage format developed by Google, and its performance is truly much, much stronger than JSON and XML! protobuf…

    Personal Nov 25, 2025
    25200
EN
简体中文 繁體中文 English