blob: a95b0cbf6a78397c7540f47a2ada4f300e03fd89 [file] [log] [blame]
Nathan Willis9f4b3752018-10-29 17:10:53 -05001<?xml version="1.0"?>
2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
5 <!ENTITY version SYSTEM "version.xml">
6]>
Nathan Willis3a27e8f2018-10-12 18:23:26 -05007<chapter id="shaping-concepts">
8 <title>Shaping concepts</title>
9 <section id="text-shaping-concepts">
10 <title>Text shaping</title>
11 <para>
12 Text shaping is the process of transforming a sequence of Unicode
13 codepoints that represent individual characters (letters,
14 diacritics, tone marks, numbers, symbols, etc.) into the
15 orthographically and linguistically correct two-dimensional layout
16 of glyph shapes taken from a specified font.
17 </para>
18 <para>
19 For some writing systems (or <emphasis>scripts</emphasis>) and
20 languages, the process is simple, requiring the shaper to do
21 little more than advance the horizontal position forward by the
22 correct amount for each successive glyph.
23 </para>
24 <para>
Khaled Hosny8d363002022-06-03 21:00:08 +020025 But, for other scripts (often unceremoniously called <emphasis>complex scripts</emphasis>), any combination of
Nathan Willis3a27e8f2018-10-12 18:23:26 -050026 several shaping operations may be required, and the rules for how
27 and when they are applied vary from script to script. HarfBuzz and
28 other shaping engines implement these rules.
29 </para>
30 <para>
31 The exact rules and necessary operations for a particular script
32 constitute a shaping <emphasis>model</emphasis>. OpenType
33 specifies a set of shaping models that covers all of
34 Unicode. Other shaping models are available, however, including
35 Graphite and Apple Advanced Typography (AAT).
36 </para>
37 </section>
38
Khaled Hosny8d363002022-06-03 21:00:08 +020039 <section id="script-specific-shaping">
40 <title>Script-specific shaping</title>
Nathan Willis3a27e8f2018-10-12 18:23:26 -050041 <para>
Khaled Hosny8d363002022-06-03 21:00:08 +020042 In many scripts, transforming the input
43 sequence into the final layout often requires some combination of
Nathan Willis3a27e8f2018-10-12 18:23:26 -050044 operations&mdash;such as context-dependent substitutions,
45 context-dependent mark positioning, glyph-to-glyph joining,
46 glyph reordering, or glyph stacking.
47 </para>
48 <para>
Khaled Hosny8d363002022-06-03 21:00:08 +020049 In some scripts, the shaping rules require that a text
Nathan Willis3a27e8f2018-10-12 18:23:26 -050050 run be divided into syllables before the operations can be
Khaled Hosny8d363002022-06-03 21:00:08 +020051 applied. Other scripts may apply shaping operations over
Nathan Willis3a27e8f2018-10-12 18:23:26 -050052 entire words or over the entire text run, with no subdivision
53 required.
54 </para>
55 <para>
Khaled Hosny8d363002022-06-03 21:00:08 +020056 Other scripts, do not require these
57 operations. However, correctly shaping a text run in
58 any script may still involve Unicode normalization,
Nathan Willis3a27e8f2018-10-12 18:23:26 -050059 ligature substitutions, mark positioning, kerning, and applying
Khaled Hosny8d363002022-06-03 21:00:08 +020060 other font features.
Nathan Willis3a27e8f2018-10-12 18:23:26 -050061 </para>
62 </section>
63
64 <section id="shaping-operations">
65 <title>Shaping operations</title>
66 <para>
Khaled Hosny8d363002022-06-03 21:00:08 +020067 Shaping a text run involves transforming the
Nathan Willis3a27e8f2018-10-12 18:23:26 -050068 input sequence of Unicode codepoints with some combination of
69 operations that is specified in the shaping model for the
70 script.
71 </para>
72 <para>
73 The specific conditions that trigger a given operation for a
74 text run varies from script to script, as do the order that the
75 operations are performed in and which codepoints are
76 affected. However, the same general set of shaping operations is
Khaled Hosny8d363002022-06-03 21:00:08 +020077 common to all of the script shaping models.
Nathan Willis3a27e8f2018-10-12 18:23:26 -050078 </para>
79
80 <itemizedlist>
81 <listitem>
82 <para>
83 A <emphasis>reordering</emphasis> operation moves a glyph
84 from its original ("logical") position in the sequence to
85 some other ("visual") position.
86 </para>
87 <para>
Khaled Hosny8d363002022-06-03 21:00:08 +020088 The shaping model for a given script might involve
Nathan Willis3a27e8f2018-10-12 18:23:26 -050089 more than one reordering step.
90 </para>
91 </listitem>
92
93 <listitem>
94 <para>
95 A <emphasis>joining</emphasis> operation replaces a glyph
96 with an alternate form that is designed to connect with one
97 or more of the adjacent glyphs in the sequence.
98 </para>
99 </listitem>
100
101 <listitem>
102 <para>
103 A contextual <emphasis>substitution</emphasis> operation
104 replaces either a single glyph or a subsequence of several
105 glyphs with an alternate glyph. This substitution is
106 performed when the original glyph or subsequence of glyphs
107 occurs in a specified position with respect to the
108 surrounding sequence. For example, one substitution might be
109 performed only when the target glyph is the first glyph in
110 the sequence, while another substitution is performed only
111 when a different target glyph occurs immediately after a
112 particular string pattern.
113 </para>
114 <para>
Khaled Hosny8d363002022-06-03 21:00:08 +0200115 The shaping model for a given script might involve
Nathan Willis3a27e8f2018-10-12 18:23:26 -0500116 multiple contextual-substitution operations, each applying
117 to different target glyphs and patterns, and which are
118 performed in separate steps.
119 </para>
120 </listitem>
121
122 <listitem>
123 <para>
124 A contextual <emphasis>positioning</emphasis> operation
125 moves the horizontal and/or vertical position of a
126 glyph. This positioning move is performed when the glyph
127 occurs in a specified position with respect to the
128 surrounding sequence.
129 </para>
130 <para>
131 Many contextual positioning operations are used to place
132 <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
133 signs, and tone markers) with respect to
Khaled Hosny8d363002022-06-03 21:00:08 +0200134 <emphasis>base</emphasis> glyphs. However, some
Nathan Willis3a27e8f2018-10-12 18:23:26 -0500135 scripts may use contextual positioning operations to
136 correctly place base glyphs as well, such as
137 when the script uses <emphasis>stacking</emphasis> characters.
138 </para>
139 </listitem>
140
141 </itemizedlist>
142 </section>
143
144 <section id="unicode-character-categories">
145 <title>Unicode character categories</title>
146 <para>
147 Shaping models are typically specified with respect to how
148 scripts are defined in the Unicode standard.
149 </para>
150 <para>
151 Every codepoint in the Unicode Character Database (UCD) is
152 assigned a <emphasis>Unicode General Category</emphasis> (UGC),
153 which provides the most fundamental information about the
154 codepoint: whether the codepoint represents a
155 <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
156 <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
157 <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
158 or something else (<emphasis>Other</emphasis>).
159 </para>
160 <para>
161 These UGC properties are "Major" categories. Each codepoint is
162 further assigned to a "minor" category within its Major
163 category, such as "Letter, uppercase" (<literal>Lu</literal>) or
164 "Letter, modifier" (<literal>Lm</literal>).
165 </para>
166 <para>
167 Shaping models are concerned primarily with Letter and Mark
168 codepoints. The minor categories of Mark codepoints are
169 particularly important for shaping. Marks can be nonspacing
170 (<literal>Mn</literal>), spacing combining
171 (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
172 </para>
173 <para>
174 In addition to the UGC property, codepoints in the Indic and
175 Southeast Asian scripts are also assigned
176 <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
177 <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
Nathan Willised13cad2018-11-28 13:48:38 -0600178 properties that provide more detailed information needed for
Nathan Willis3a27e8f2018-10-12 18:23:26 -0500179 shaping.
180 </para>
181 <para>
182 The UISC property sub-categorizes Letters and Marks according to
183 common script-shaping behaviors. For example, UISC distinguishes
184 between consonant letters, vowel letters, and vowel marks. The
Nathan Willised13cad2018-11-28 13:48:38 -0600185 UIPC property sub-categorizes Mark codepoints by the relative visual
Nathan Willis3a27e8f2018-10-12 18:23:26 -0500186 position that they occupy (above, below, right, left, or in
187 multiple positions).
188 </para>
189 <para>
Khaled Hosny8d363002022-06-03 21:00:08 +0200190 Some scripts require that the text run be split into
Nathan Willised13cad2018-11-28 13:48:38 -0600191 syllables. What constitutes a valid syllable in these
192 scripts is specified in regular expressions, formed from the
193 Letter and Mark codepoints, that take the UISC and UIPC
194 properties into account.
Nathan Willis3a27e8f2018-10-12 18:23:26 -0500195 </para>
196
197 </section>
198
199 <section id="text-runs">
200 <title>Text runs</title>
201 <para>
202 Real-world text usually contains codepoints from a mixture of
203 different Unicode scripts (including punctuation, numbers, symbols,
204 white-space characters, and other codepoints that do not belong
205 to any script). Real-world text may also be marked up with
206 formatting that changes font properties (including the font,
207 font style, and font size).
208 </para>
209 <para>
210 For shaping purposes, all real-world text streams must be first
211 segmented into runs that have a uniform set of properties.
212 </para>
213 <para>
214 In particular, shaping models always assume that every codepoint
215 in a text run has the same <emphasis>direction</emphasis>,
216 <emphasis>script</emphasis> tag, and
217 <emphasis>language</emphasis> tag.
218 </para>
219 </section>
220
221 <section id="opentype-shaping-models">
222 <title>OpenType shaping models</title>
223 <para>
224 OpenType provides shaping models for the following scripts:
225 </para>
226
227 <itemizedlist>
228 <listitem>
229 <para>
230 The <emphasis>default</emphasis> shaping model handles all
Khaled Hosny8d363002022-06-03 21:00:08 +0200231 scripts with no script-specific shaping model, and may also be used as a fallback for
Nathan Willis3a27e8f2018-10-12 18:23:26 -0500232 handling unrecognized scripts.
233 </para>
234 </listitem>
235
236 <listitem>
237 <para>
238 The <emphasis>Indic</emphasis> shaping model handles the Indic
239 scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
David Corbett78c5ae32022-06-25 13:32:04 -0400240 Malayalam, Oriya, Tamil, and Telugu.
Nathan Willis3a27e8f2018-10-12 18:23:26 -0500241 </para>
242 <para>
243 The Indic shaping model was revised significantly in
244 2005. To denote the change, a new set of <emphasis>script
245 tags</emphasis> was assigned for Bengali, Devanagari,
246 Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
247 Telugu. For the sake of clarity, the term "Indic2" is
248 sometimes used to refer to the current, revised shaping
249 model.
250 </para>
251 </listitem>
252
253 <listitem>
254 <para>
255 The <emphasis>Arabic</emphasis> shaping model supports
256 Arabic, Mongolian, N'Ko, Syriac, and several other connected
257 or cursive scripts.
258 </para>
259 </listitem>
260
261 <listitem>
262 <para>
263 The <emphasis>Thai/Lao</emphasis> shaping model supports
264 the Thai and Lao scripts.
265 </para>
266 </listitem>
267
268 <listitem>
269 <para>
270 The <emphasis>Khmer</emphasis> shaping model supports the
271 Khmer script.
272 </para>
273 </listitem>
274
275 <listitem>
276 <para>
277 The <emphasis>Myanmar</emphasis> shaping model supports the
278 Myanmar (or Burmese) script.
279 </para>
280 </listitem>
281
282 <listitem>
283 <para>
284 The <emphasis>Tibetan</emphasis> shaping model supports the
285 Tibetan script.
286 </para>
287 </listitem>
288
289 <listitem>
290 <para>
291 The <emphasis>Hangul</emphasis> shaping model supports the
292 Hangul script.
293 </para>
294 </listitem>
295
296 <listitem>
297 <para>
298 The <emphasis>Hebrew</emphasis> shaping model supports the
299 Hebrew script.
300 </para>
301 </listitem>
302
303 <listitem>
304 <para>
305 The <emphasis>Universal Shaping Engine</emphasis> (USE)
Khaled Hosny8d363002022-06-03 21:00:08 +0200306 shaping model supports scripts not covered by one of
Nathan Willis3a27e8f2018-10-12 18:23:26 -0500307 the above, script-specific shaping models, including
308 Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
309 Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
310 Viet, and many others.
311 </para>
312 </listitem>
313
314 <listitem>
315 <para>
316 Text runs that do not fall under one of the above shaping
317 models may still require processing by a shaping engine. Of
318 particular note is <emphasis>Emoji</emphasis> shaping, which
319 may involve variation-selector sequences and glyph
320 substitution. Emoji shaping is handled by the default
321 shaping model.
322 </para>
323 </listitem>
324
325 </itemizedlist>
326
327 </section>
328
329 <section id="graphite-shaping">
330 <title>Graphite shaping</title>
331 <para>
332 In contrast to OpenType shaping, Graphite shaping does not
333 specify a predefined set of shaping models or a set of supported
334 scripts.
335 </para>
336 <para>
337 Instead, each Graphite font contains a complete set of rules that
338 implement the required shaping model for the intended
339 script. These rules include finite-state machines to match
340 sequences of codepoints to the shaping operations to perform.
341 </para>
342 <para>
343 Graphite shaping can perform the same shaping operations used in
344 OpenType shaping, as well as other functions that have not been
345 defined for OpenType shaping.
346 </para>
347 </section>
348
349 <section id="aat-shaping">
350 <title>AAT shaping</title>
351 <para>
352 In contrast to OpenType shaping, AAT shaping does not specify a
353 predefined set of shaping models or a set of supported scripts.
354 </para>
355 <para>
356 Instead, each AAT font includes a complete set of rules that
357 implement the desired shaping model for the intended
358 script. These rules include finite-state machines to match glyph
359 sequences and the shaping operations to perform.
360 </para>
361 <para>
362 Notably, AAT shaping rules are expressed for glyphs in the font,
363 not for Unicode codepoints. AAT shaping can perform the same
364 shaping operations used in OpenType shaping, as well as other
365 functions that have not been defined for OpenType shaping.
366 </para>
367 </section>
368</chapter>