Nathan Willis | 9f4b375 | 2018-10-29 17:10:53 -0500 | [diff] [blame] | 1 | <?xml version="1.0"?> |
| 2 | <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" |
| 3 | "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ |
| 4 | <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> |
| 5 | <!ENTITY version SYSTEM "version.xml"> |
| 6 | ]> |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 7 | <chapter id="shaping-concepts"> |
| 8 | <title>Shaping concepts</title> |
| 9 | <section id="text-shaping-concepts"> |
| 10 | <title>Text shaping</title> |
| 11 | <para> |
| 12 | Text shaping is the process of transforming a sequence of Unicode |
| 13 | codepoints that represent individual characters (letters, |
| 14 | diacritics, tone marks, numbers, symbols, etc.) into the |
| 15 | orthographically and linguistically correct two-dimensional layout |
| 16 | of glyph shapes taken from a specified font. |
| 17 | </para> |
| 18 | <para> |
| 19 | For some writing systems (or <emphasis>scripts</emphasis>) and |
| 20 | languages, the process is simple, requiring the shaper to do |
| 21 | little more than advance the horizontal position forward by the |
| 22 | correct amount for each successive glyph. |
| 23 | </para> |
| 24 | <para> |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 25 | But, for other scripts (often unceremoniously called <emphasis>complex scripts</emphasis>), any combination of |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 26 | several shaping operations may be required, and the rules for how |
| 27 | and when they are applied vary from script to script. HarfBuzz and |
| 28 | other shaping engines implement these rules. |
| 29 | </para> |
| 30 | <para> |
| 31 | The exact rules and necessary operations for a particular script |
| 32 | constitute a shaping <emphasis>model</emphasis>. OpenType |
| 33 | specifies a set of shaping models that covers all of |
| 34 | Unicode. Other shaping models are available, however, including |
| 35 | Graphite and Apple Advanced Typography (AAT). |
| 36 | </para> |
| 37 | </section> |
| 38 | |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 39 | <section id="script-specific-shaping"> |
| 40 | <title>Script-specific shaping</title> |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 41 | <para> |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 42 | In many scripts, transforming the input |
| 43 | sequence into the final layout often requires some combination of |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 44 | operations—such as context-dependent substitutions, |
| 45 | context-dependent mark positioning, glyph-to-glyph joining, |
| 46 | glyph reordering, or glyph stacking. |
| 47 | </para> |
| 48 | <para> |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 49 | In some scripts, the shaping rules require that a text |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 50 | run be divided into syllables before the operations can be |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 51 | applied. Other scripts may apply shaping operations over |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 52 | entire words or over the entire text run, with no subdivision |
| 53 | required. |
| 54 | </para> |
| 55 | <para> |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 56 | Other scripts, do not require these |
| 57 | operations. However, correctly shaping a text run in |
| 58 | any script may still involve Unicode normalization, |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 59 | ligature substitutions, mark positioning, kerning, and applying |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 60 | other font features. |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 61 | </para> |
| 62 | </section> |
| 63 | |
| 64 | <section id="shaping-operations"> |
| 65 | <title>Shaping operations</title> |
| 66 | <para> |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 67 | Shaping a text run involves transforming the |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 68 | input sequence of Unicode codepoints with some combination of |
| 69 | operations that is specified in the shaping model for the |
| 70 | script. |
| 71 | </para> |
| 72 | <para> |
| 73 | The specific conditions that trigger a given operation for a |
| 74 | text run varies from script to script, as do the order that the |
| 75 | operations are performed in and which codepoints are |
| 76 | affected. However, the same general set of shaping operations is |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 77 | common to all of the script shaping models. |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 78 | </para> |
| 79 | |
| 80 | <itemizedlist> |
| 81 | <listitem> |
| 82 | <para> |
| 83 | A <emphasis>reordering</emphasis> operation moves a glyph |
| 84 | from its original ("logical") position in the sequence to |
| 85 | some other ("visual") position. |
| 86 | </para> |
| 87 | <para> |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 88 | The shaping model for a given script might involve |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 89 | more than one reordering step. |
| 90 | </para> |
| 91 | </listitem> |
| 92 | |
| 93 | <listitem> |
| 94 | <para> |
| 95 | A <emphasis>joining</emphasis> operation replaces a glyph |
| 96 | with an alternate form that is designed to connect with one |
| 97 | or more of the adjacent glyphs in the sequence. |
| 98 | </para> |
| 99 | </listitem> |
| 100 | |
| 101 | <listitem> |
| 102 | <para> |
| 103 | A contextual <emphasis>substitution</emphasis> operation |
| 104 | replaces either a single glyph or a subsequence of several |
| 105 | glyphs with an alternate glyph. This substitution is |
| 106 | performed when the original glyph or subsequence of glyphs |
| 107 | occurs in a specified position with respect to the |
| 108 | surrounding sequence. For example, one substitution might be |
| 109 | performed only when the target glyph is the first glyph in |
| 110 | the sequence, while another substitution is performed only |
| 111 | when a different target glyph occurs immediately after a |
| 112 | particular string pattern. |
| 113 | </para> |
| 114 | <para> |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 115 | The shaping model for a given script might involve |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 116 | multiple contextual-substitution operations, each applying |
| 117 | to different target glyphs and patterns, and which are |
| 118 | performed in separate steps. |
| 119 | </para> |
| 120 | </listitem> |
| 121 | |
| 122 | <listitem> |
| 123 | <para> |
| 124 | A contextual <emphasis>positioning</emphasis> operation |
| 125 | moves the horizontal and/or vertical position of a |
| 126 | glyph. This positioning move is performed when the glyph |
| 127 | occurs in a specified position with respect to the |
| 128 | surrounding sequence. |
| 129 | </para> |
| 130 | <para> |
| 131 | Many contextual positioning operations are used to place |
| 132 | <emphasis>mark</emphasis> glyphs (such as diacritics, vowel |
| 133 | signs, and tone markers) with respect to |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 134 | <emphasis>base</emphasis> glyphs. However, some |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 135 | scripts may use contextual positioning operations to |
| 136 | correctly place base glyphs as well, such as |
| 137 | when the script uses <emphasis>stacking</emphasis> characters. |
| 138 | </para> |
| 139 | </listitem> |
| 140 | |
| 141 | </itemizedlist> |
| 142 | </section> |
| 143 | |
| 144 | <section id="unicode-character-categories"> |
| 145 | <title>Unicode character categories</title> |
| 146 | <para> |
| 147 | Shaping models are typically specified with respect to how |
| 148 | scripts are defined in the Unicode standard. |
| 149 | </para> |
| 150 | <para> |
| 151 | Every codepoint in the Unicode Character Database (UCD) is |
| 152 | assigned a <emphasis>Unicode General Category</emphasis> (UGC), |
| 153 | which provides the most fundamental information about the |
| 154 | codepoint: whether the codepoint represents a |
| 155 | <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a |
| 156 | <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a |
| 157 | <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>, |
| 158 | or something else (<emphasis>Other</emphasis>). |
| 159 | </para> |
| 160 | <para> |
| 161 | These UGC properties are "Major" categories. Each codepoint is |
| 162 | further assigned to a "minor" category within its Major |
| 163 | category, such as "Letter, uppercase" (<literal>Lu</literal>) or |
| 164 | "Letter, modifier" (<literal>Lm</literal>). |
| 165 | </para> |
| 166 | <para> |
| 167 | Shaping models are concerned primarily with Letter and Mark |
| 168 | codepoints. The minor categories of Mark codepoints are |
| 169 | particularly important for shaping. Marks can be nonspacing |
| 170 | (<literal>Mn</literal>), spacing combining |
| 171 | (<literal>Mc</literal>), or enclosing (<literal>Me</literal>). |
| 172 | </para> |
| 173 | <para> |
| 174 | In addition to the UGC property, codepoints in the Indic and |
| 175 | Southeast Asian scripts are also assigned |
| 176 | <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and |
| 177 | <emphasis>Unicode Indic Positional Category</emphasis> (UIPC) |
Nathan Willis | ed13cad | 2018-11-28 13:48:38 -0600 | [diff] [blame] | 178 | properties that provide more detailed information needed for |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 179 | shaping. |
| 180 | </para> |
| 181 | <para> |
| 182 | The UISC property sub-categorizes Letters and Marks according to |
| 183 | common script-shaping behaviors. For example, UISC distinguishes |
| 184 | between consonant letters, vowel letters, and vowel marks. The |
Nathan Willis | ed13cad | 2018-11-28 13:48:38 -0600 | [diff] [blame] | 185 | UIPC property sub-categorizes Mark codepoints by the relative visual |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 186 | position that they occupy (above, below, right, left, or in |
| 187 | multiple positions). |
| 188 | </para> |
| 189 | <para> |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 190 | Some scripts require that the text run be split into |
Nathan Willis | ed13cad | 2018-11-28 13:48:38 -0600 | [diff] [blame] | 191 | syllables. What constitutes a valid syllable in these |
| 192 | scripts is specified in regular expressions, formed from the |
| 193 | Letter and Mark codepoints, that take the UISC and UIPC |
| 194 | properties into account. |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 195 | </para> |
| 196 | |
| 197 | </section> |
| 198 | |
| 199 | <section id="text-runs"> |
| 200 | <title>Text runs</title> |
| 201 | <para> |
| 202 | Real-world text usually contains codepoints from a mixture of |
| 203 | different Unicode scripts (including punctuation, numbers, symbols, |
| 204 | white-space characters, and other codepoints that do not belong |
| 205 | to any script). Real-world text may also be marked up with |
| 206 | formatting that changes font properties (including the font, |
| 207 | font style, and font size). |
| 208 | </para> |
| 209 | <para> |
| 210 | For shaping purposes, all real-world text streams must be first |
| 211 | segmented into runs that have a uniform set of properties. |
| 212 | </para> |
| 213 | <para> |
| 214 | In particular, shaping models always assume that every codepoint |
| 215 | in a text run has the same <emphasis>direction</emphasis>, |
| 216 | <emphasis>script</emphasis> tag, and |
| 217 | <emphasis>language</emphasis> tag. |
| 218 | </para> |
| 219 | </section> |
| 220 | |
| 221 | <section id="opentype-shaping-models"> |
| 222 | <title>OpenType shaping models</title> |
| 223 | <para> |
| 224 | OpenType provides shaping models for the following scripts: |
| 225 | </para> |
| 226 | |
| 227 | <itemizedlist> |
| 228 | <listitem> |
| 229 | <para> |
| 230 | The <emphasis>default</emphasis> shaping model handles all |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 231 | scripts with no script-specific shaping model, and may also be used as a fallback for |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 232 | handling unrecognized scripts. |
| 233 | </para> |
| 234 | </listitem> |
| 235 | |
| 236 | <listitem> |
| 237 | <para> |
| 238 | The <emphasis>Indic</emphasis> shaping model handles the Indic |
| 239 | scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, |
David Corbett | 78c5ae3 | 2022-06-25 13:32:04 -0400 | [diff] [blame] | 240 | Malayalam, Oriya, Tamil, and Telugu. |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 241 | </para> |
| 242 | <para> |
| 243 | The Indic shaping model was revised significantly in |
| 244 | 2005. To denote the change, a new set of <emphasis>script |
| 245 | tags</emphasis> was assigned for Bengali, Devanagari, |
| 246 | Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and |
| 247 | Telugu. For the sake of clarity, the term "Indic2" is |
| 248 | sometimes used to refer to the current, revised shaping |
| 249 | model. |
| 250 | </para> |
| 251 | </listitem> |
| 252 | |
| 253 | <listitem> |
| 254 | <para> |
| 255 | The <emphasis>Arabic</emphasis> shaping model supports |
| 256 | Arabic, Mongolian, N'Ko, Syriac, and several other connected |
| 257 | or cursive scripts. |
| 258 | </para> |
| 259 | </listitem> |
| 260 | |
| 261 | <listitem> |
| 262 | <para> |
| 263 | The <emphasis>Thai/Lao</emphasis> shaping model supports |
| 264 | the Thai and Lao scripts. |
| 265 | </para> |
| 266 | </listitem> |
| 267 | |
| 268 | <listitem> |
| 269 | <para> |
| 270 | The <emphasis>Khmer</emphasis> shaping model supports the |
| 271 | Khmer script. |
| 272 | </para> |
| 273 | </listitem> |
| 274 | |
| 275 | <listitem> |
| 276 | <para> |
| 277 | The <emphasis>Myanmar</emphasis> shaping model supports the |
| 278 | Myanmar (or Burmese) script. |
| 279 | </para> |
| 280 | </listitem> |
| 281 | |
| 282 | <listitem> |
| 283 | <para> |
| 284 | The <emphasis>Tibetan</emphasis> shaping model supports the |
| 285 | Tibetan script. |
| 286 | </para> |
| 287 | </listitem> |
| 288 | |
| 289 | <listitem> |
| 290 | <para> |
| 291 | The <emphasis>Hangul</emphasis> shaping model supports the |
| 292 | Hangul script. |
| 293 | </para> |
| 294 | </listitem> |
| 295 | |
| 296 | <listitem> |
| 297 | <para> |
| 298 | The <emphasis>Hebrew</emphasis> shaping model supports the |
| 299 | Hebrew script. |
| 300 | </para> |
| 301 | </listitem> |
| 302 | |
| 303 | <listitem> |
| 304 | <para> |
| 305 | The <emphasis>Universal Shaping Engine</emphasis> (USE) |
Khaled Hosny | 8d36300 | 2022-06-03 21:00:08 +0200 | [diff] [blame] | 306 | shaping model supports scripts not covered by one of |
Nathan Willis | 3a27e8f | 2018-10-12 18:23:26 -0500 | [diff] [blame] | 307 | the above, script-specific shaping models, including |
| 308 | Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi, |
| 309 | Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai |
| 310 | Viet, and many others. |
| 311 | </para> |
| 312 | </listitem> |
| 313 | |
| 314 | <listitem> |
| 315 | <para> |
| 316 | Text runs that do not fall under one of the above shaping |
| 317 | models may still require processing by a shaping engine. Of |
| 318 | particular note is <emphasis>Emoji</emphasis> shaping, which |
| 319 | may involve variation-selector sequences and glyph |
| 320 | substitution. Emoji shaping is handled by the default |
| 321 | shaping model. |
| 322 | </para> |
| 323 | </listitem> |
| 324 | |
| 325 | </itemizedlist> |
| 326 | |
| 327 | </section> |
| 328 | |
| 329 | <section id="graphite-shaping"> |
| 330 | <title>Graphite shaping</title> |
| 331 | <para> |
| 332 | In contrast to OpenType shaping, Graphite shaping does not |
| 333 | specify a predefined set of shaping models or a set of supported |
| 334 | scripts. |
| 335 | </para> |
| 336 | <para> |
| 337 | Instead, each Graphite font contains a complete set of rules that |
| 338 | implement the required shaping model for the intended |
| 339 | script. These rules include finite-state machines to match |
| 340 | sequences of codepoints to the shaping operations to perform. |
| 341 | </para> |
| 342 | <para> |
| 343 | Graphite shaping can perform the same shaping operations used in |
| 344 | OpenType shaping, as well as other functions that have not been |
| 345 | defined for OpenType shaping. |
| 346 | </para> |
| 347 | </section> |
| 348 | |
| 349 | <section id="aat-shaping"> |
| 350 | <title>AAT shaping</title> |
| 351 | <para> |
| 352 | In contrast to OpenType shaping, AAT shaping does not specify a |
| 353 | predefined set of shaping models or a set of supported scripts. |
| 354 | </para> |
| 355 | <para> |
| 356 | Instead, each AAT font includes a complete set of rules that |
| 357 | implement the desired shaping model for the intended |
| 358 | script. These rules include finite-state machines to match glyph |
| 359 | sequences and the shaping operations to perform. |
| 360 | </para> |
| 361 | <para> |
| 362 | Notably, AAT shaping rules are expressed for glyphs in the font, |
| 363 | not for Unicode codepoints. AAT shaping can perform the same |
| 364 | shaping operations used in OpenType shaping, as well as other |
| 365 | functions that have not been defined for OpenType shaping. |
| 366 | </para> |
| 367 | </section> |
| 368 | </chapter> |