What is a TeX token list
So, just what is a "TeX token list"?
In a previous article—also part of this series on low-level TeXnicalities—we explored the processes through which TeX scans your .tex
file to generate new tokens: we examined the fundamental nature of aTeX token and how TeX creates them (see What is a "TeX token"?).
In this follow-up article we take a look at token lists: what are they and how do TeX engines create/use them. Gaining an understanding of token lists can be tricky because they are stored deep in TeX’s internals: those details are hidden away from the user—although, today, this is not always true if you do more advanced programming with LuaTeX. But, for now, you can start to think of token lists as TeX’s way of storing a series of integer values, where each integer is a token derived from a character or command that TeX had read from your input file.
Token lists play a pivotal role in the internal operation of TeX, often in some surprising ways, such as the internal operation of commands like \uppercase
and \lowercase
. One particularly important use of token lists is storing and executing macros, a topic we will examine in detail as part of a future article in this series.
TeX gets its input from files and token lists
TeX engines have three sources of input—two that you may know:
- physical text files stored on disk;
- text that a user types into the terminal (command line);
but it also has a third way of reading/obtaining input: token lists!
Token lists are, in effect, an internal data storage facility that TeX uses as part of its operations. Because TeX’s token lists act as a “storage facility” for previously-created tokens, it makes sense for TeX to be able to re-use them as another source of input. When it becomes necessary to take its next input from a particular token list (or TeX is instructed to do so), TeX will temporarily halt reading input from a physical file (i.e., creating new tokens) and switch to obtaining its input from existing tokens: the in-memory location where the token list is stored. Clearly, with a token list the process of scanning + generation of tokens has already taken place so TeX just needs to look at each token in the list and decide what to do with each one.
By way of a quick example, the low-level (TeX primitive) \toks
command lets you create a list of tokens that TeX saves in memory for later re-use:
\toks100={Hello}
To retrieve those tokens (ie., tell TeX to treat them as its next source of input) you’d issue a command such as
\the\toks100
This will cause TeX to switch from creating new tokens from your input file to getting its next from where those tokens (created by \toks
) are stored—in a so-called token register which is just an internal memory location known to TeX (here it is register 100).
In addition, token lists can be internally generated, on-the-fly, by a number of TeX commands. One example is the command \jobname
which generates a series of character tokens—one token for each character in the name of the main file that TeX is processing. Another example is the \string
command; for example
\string\mymacro
generates a series of character tokens for each letter in the name \mymacro
—including the initial \
character. We take a closer look at some “token-generating commands” at the end of this article.
Token list: Explained by analogy
Unless you have a programming background and/or some knowledge of computer science, “token lists” may be somewhat a somewhat hazy concept, and, perhaps, a little confusing. However, if you wish to become proficient in writing TeX/LaTeX macros then a good understanding of topics such as TeX tokens, token lists and category codes (\catcode
) will prove to be extremely useful.
In this section we’re going to use an analogy to explain/illustrate the core ideas/principles of a TeX token list: how TeX stores tokens in memory. It’s worth taking time to read this through because token lists are a fundamental aspect of TeX and worth understanding in a little more detail.
Token lists: An analogy (thought experiment)
We are going to work through a “thought experiment” to provide a basis for understanding TeX token lists. Imagine that you had access to a large set of containers, such as hundreds of tins—we can’t consider or use the use the term “box” to describe our thought-experiment containers because, of course, “box” has a very specific meaning in TeX, quite unrelated to our discussion here. So we’ll call our containers “Tins”, where each Tin:
- has a unique identifying number printed on its exterior;
- is (internally) split into two compartments.
Those two compartments are designed as follows:
- the left-hand compartment holds the item you want to put in the Tin;
- the right-hand compartment is designed to hold a piece of paper onto which you can write a single number: the number identifying another Tin.
Suppose that you have a collection of, say, 5 items and you want to store that collection of items within those Tins; but, alas, each Tin can only hold 1 item of the type you want to store.
For simplicity, let’s assume we wanted to store 5 coloured circles:
Furthermore, when you go back to retrieve those items from your storage system (Tins) those items must be retrieved/found in a particular order—the order in which they were stored: that sequence must be preserved. How can you achieve this?
We can take advantage of the fact that each Tin:
- has a unique identifying number attached to its exterior;
- has 2 compartments—only 1 of which we will use to contain our item, the other contains a piece of paper with another Tin’s number written onto it.
We’ll assume every Tin is empty—but there’s nothing to stop you opening any particular Tin to check if it is empty; if it isn’t, try the next one until you find an empty Tin.
What we could do is as follows. Put our first item (dark green circle) in one of our Tins (e.g., Tin 124) and make a note of the number of this first Tin—it does not matter what number that first Tin has, all that matters is that we write it down somewhere and save it for later use.
Find a second Tin—any Tin number (e.g., Tin 432)—and take a note of its number. Write the number of that second Tin (432) on a piece of paper and place that note into the first Tin (Tin 124). We place our second item (light green circle) into the second Tin. So, we currently have the following situation:
- a written note—not stored in a Tin—stating that the first Tin is number 124 (it contains our first item);
- within Tin 124 we have added another note saying the next item is to be found in Tin 432.
In essence, we have linked our first two Tins: we know where to start (Tin 124) and that a note in Tin 124 tells us which Tin contains the next item (Tin 432).
We then find a third Tin, write down its number (e.g., Tin 543) on a piece of paper and place that in the second Tin (number 432). We then place our third item item (red circle) into the third Tin.
Now we have linked three Tins in the sequence: our starting point, Tin 124 (dark green circle) → Tin 432 (light green circle)→ Tin 543 (red circle) →…
Repeat this process for the final two items (light blue and dark blue circles) using Tin 213 (light blue circle) and Tin 102 (dark blue circle).
We now have all 5 Tins linked together (using the numeric identifier of each Tin) and are able to retrieve all our stored items—in the correct order—simply by visiting each Tin in turn, removing our item and looking at the note telling us which Tin contains our next item.
What about the last item in our list (Tin 102)?
Why should we be concerned by this one in particular? So far we have stored each item in a Tin, together with a note saying which Tin contains the next item: for the last item in our list what should that note say—because there is no next Tin.
When we reach the final item (Tin) it has to be obvious that this Tin (containing the last item) is the final item in our list—we do not need to look for another Tin, because there isn’t one. One way to do that is to use place a “special” Tin number inside our final Tin (102). We can use any number we wish provided we choose a unique number that is not the number of an actual Tin—for example “Tin -1”, “Tin 0”: it does not matter so long as we know that “Tin -1” or “Tin 0” etc immediately tells us to stop looking: we don’t need to look for any more Tins because this is the last one and thus there are no more items to retrieve.
From “items” and “Tins” to tokens and TeX
We now need to move from our analogy to a description that is closer to TeX’s reality. Firstly, instead of storing differently coloured circles in our imaginary Tins it should be clear that we could think of those Tins as storing TeX tokens: simple integers. That’s the easier part of moving our analogy across to the realm of software (TeX). But what might be the software equivalent of our physical numbered Tins with “compartments”?
We don’t want to venture too far into programming concepts but you can think of our “Tins” as representing a few bytes of computer memory which have been “packaged up” into a convenient unit of storage. Our analogy’s use of a numeric identifier for each Tin can be considered as the the location inside computer memory where each little package of memory is located. Within TeX itself, those little packages of storage are called “memory words”—a term which reflects the time/era in which TeX was created (1970s). These “memory words” are the fundamental building block used within TeX but we don’t need to explore them in any more detail here—anyone who wants further detail can refer to an article on the author’s personal blog.
In computer programming terms, what we have been discussing is called a linked list: a TeX token list is a linked list built from TeX’s storage containers called memory words where each memory word can be used to store:
- a value: the value of the token (an integer);
- a link: the memory location of the next memory word containing the next token in our list.
Where does TeX use token lists?
Everywhere! This is true because a TeX/LaTeX macro definition (e.g, a LaTeX command) is stored as a (slightly specialized) form of token list—specialized in the sense that it contains tokens that you don’t see in “standard” token lists (related to matching macro parameters etc). Don’t worry about this because we’ll address those details in a future article.
An example macro
A macro can be thought of as comprising three parts:
\def\<macro name><parameter text>{<replacement text>}
Note that instead of \def
you could have use \edef
, \gdef
or \xdef
.
Note to LaTeX users: Here we have are defining macros using raw, low-level, TeX commands (called primitives). LaTeX users will be more familiar with creating macros via LaTeX’s \newcommand
(which is itself a macro).
When you ask TeX to create (define) a macro it will create a token which represents the <macro name>
and a token list which represents the combined <parameter text>
and <replacement text>
. TeX will carefully store everything so that the token representing <macro name>
is linked to the token list representing its definition (<parameter text>
and <replacement text>
).
For example, if we define \mymacro
like this:
\def\mymacro abc #1 defz{I typed "#1"!}
We can see that its constituent parts are:
<macro name>
=mymacro
<parameter text>
=abc #1 defz
<replacement text>
=I typed "#1"!
For example, you could call \mymacro
like this:
\mymacro abc THIS TEXT defz
which results in I typed "THIS TEXT"!
being typeset—the abc
and defz
are not typeset. abc
and defz
are sequences of character tokens used to delimit the macro parameter #1
and are absorbed and discarded when your macro call is successfully processed by TeX.
When you defined \mymacro
, the pattern of tokens contained in the stored <parameter text> act as a “template” that TeX can use to work out:
- which tokens in your input are the delimiter tokens;
- which tokens in your input actually form the parameter(s) of your macro (here, what you are using for
#1
in your call of\mymacro
).
You have to call \mymacro
with a <parameter text>
containing delimiters that are identical to the ones used to define it—that includes using character delimiters with identical category codes. If the delimiters in the <parameter text>
used to call \mymacro
are different to the ones used to define it (the “template” stored in memory), then TeX can become rather confused—when it tries to process \mymacro
it would not be able to match the “template” it has saved in its memory.
When TeX sees that you are calling a macro it will scan your input text to create new tokens and try, token-by-token, to match them with the token list <parameter text>
template stored as part of your macro’s definition. If the delimiters used in your input text result in a series of tokens that don’t match the ones stored in the “template” then TeX will usually throw an error.
TeX is very particular—remember that character tokens are a combination of character code and category code: if you change the category code of a character you get a different token value resulting from that character.
Suppose we changed the category code of z
to, say, 12—ordinarily it is 11—and try to call our macro like this:
\catcode`z=12
\mymacro abc THIS TEXT defz more text here...
This time it will not work because the category code of z
has been changed. You will see an error such as this:
Runaway argument?
THIS TEXT defz
! Paragraph ended before \mymacro was complete.
<to be read again>
\par
l.22
When TeX reads and scans the z
in defz
it cannot recognize it as forming the end of \mymacro
’s <parameter text>
used in your input file. Up until seeing that erroneous z
TeX had correctly matched the first 3 characters def
but that z
(with category code 12) trips-up TeX’s scanning. Assuming z
had a category code of 11 when we defined \mymacro
: that would result in a token value of 256×11 + 122 = 2938 being stored as part of the \mymacro
’s definition (i.e., stored as part of the <parameter text> “template”). However, with category code 12, z
will now create a token value of 256×12 + 122 = 3194. Because the token value (for z
) read in from your input (value 3194) does not match the z
-token contained in the stored <parameter text>
token list template (value 2938), TeX will carry on scanning your input. TeX will continue to scan the text following on after your macro (more text here ...) to look for additional tokens—trying to match the stored <parameter text> template with the tokens it finds in your input. It probably won’t find the correct pattern of tokens and errors will result as TeX “overshoots” your input and erroneously reads extra text to create additional tokens—those extra tokens should not have been read at this point and will almost certainly generate an error.
We’ll go into this in more detail in a future article.
Other uses of token lists
Other commands used to create/store token lists include:
\toks<n>={...}
\everypar={...}
\everymath={...}
\everydisplay={...}
\everyhbox={...}
\everyvbox={...}
\output={...}
\everyjob={...}
\everycr={...}
\errhelp={...}
Each one of these commands creates a token list from the characters and commands within the braces ‘{...}’ and that list of tokens is intended to be re-used in certain circumstances. For example, \everypar={...}
creates and stores a set of tokens (a token list) that TeX injects into the input just before it starts a new paragraph.
Hidden uses of token lists: examples
In this final section we’ll look at some some practical examples of token lists being used in ways you might not expect.
Example 1: \uppercase
{...} and \lowercase
{....}—temporary token lists
In addition to explicit commands to generate token lists, there are circumstances when TeX generates a hidden and temporary internal token list in order to do some special processing. Remember that when TeX read/processes your input characters/commands they are turned into tokens: the fundamental building block that TeX engines work with.
A good example are the commands \uppercase{...}
or \lowercase{...}
because their operation can, on first encounter, be rather confusing. Once you understand what they are doing—deeper inside TeX and invisible to the user—their operations become much easier to comprehend.
Suppose you have a simple series of letters that you want to make uppercase—e.g., abcde and convert that to ABCDE. Well, it’s simple enough with TeX’s \uppercase
command:
\uppercase{abcde}
will cause TeX to output ABCDE
. Now let’s suppose we wanted to save our simple series of letters for use later on—i.e., we don’t want to output them straight away so we’ll use TeX’s only internal mechanism—not external (file) mechanism—for saving data: use a token list. We can do that by either creating a macro or using an explicit token list command:
\toks100={abcde}
\def\mychars{abcde}
Then, at some point, you might decide that you’d like to re-use your series of letters but, this time, in uppercase; so you try
\uppercase{\the\toks100}
and
\uppercase{\mychars}
But, alas, neither of these work. Why is that?
Secret token lists!
To understand how the commands \uppercase{...}
\lowercase{...}
actually work I needed to peek inside the inner workings of TeX so the following explanation is derived from doing that.
When TeX detects either \uppercase{<material>}
or \lowercase{<material>}
in your input the first thing TeX does is to create a (temporary) internal token list from the <material>
enclosed between the ‘{’ and ‘}’ which follow after the \uppercase{...}
or \lowercase{...}
commands—that temporary token list is internal to TeX.
A crucial point, and central to understanding how \uppercase{<material>}
and \lowercase{<material>}
actually work, is that any commands or macros contained in the <material>
are not expanded: all that TeX does is to generate tokens from characters and commands placed between {...}
. During the operation of \uppercase{<material>}
or \lowercase{<material>}
nothing between the braces is executed: it is simply turned into tokens.
After the <material>
inside the {...}
has been converted into a (temporary) token list, TeX then re-visits every token in that list and tests whether it is a character token or a command token (using the numeric value of the token). If TeX detects a character token it modifies that token to adjust the case of the character (according to whether \uppercase
or \lowercase
is being processed). TeX simply ignores any command tokens and does not “look into” any command tokens to see what they represent or contain (e.g. a macro containing characters)—they are simply skipped over: only character tokens are actually processed/affected by case-changing operations.
So, for example, if we issue a TeX command such as \uppercase{abcde}
TeX will create a token list from abcde
containing nothing but character tokens: they are all adjusted to create a series of modified tokens representing A, B, C, D, and E. Those modified tokens are fed back into TeX’s input processor which results in ABCDE
being typeset. However, if we have stored our characters within a macro—for example \def\mychars{abcde}
—and try to convert them to uppercase like this:
\uppercase{\mychars}
then it will fail and abcde will be typeset—not ABCDE as you might expect. If we then try to store our characters in a token list such as \toks0={abcde}
and do \uppercase{\the\toks0}
then, once again, \uppercase
will fail because the token list will comprise entirely of tokens that are not affected by \uppercase
.
Taking the example of our macro, \mychars
, after TeX detects \uppercase
in the input, TeX looks up the meaning of \uppercase
and actions it, creating a temporary token list from {\mychars}
. Clearly, that temporary token list contains just one token which is not a character token but one that represents our macro command \mychars
: hence, for the purposes of executing \uppercase
, that token is ignored—\mychars
does not represent a character token. However, as noted above, once \uppercase
has done its work, the temporary token list (created by the action of \uppercase
) is fed back into TeX’s full input-processing (scanning) mechanism. When TeX re-reads that token list it detects a token which represents our \mychars
macro which TeX executes (expands) and generates a series of characters to typeset abcde—still in lowercase because they were “wrapped up” inside a macro and thus invisible to the actions of \uppercase
.
Once TeX has re-examined the temporary token list created for \uppercase{...}
or \lowercase{...}
, and processed any character tokens, it then switches to using that temporary token list as its source of input: typesetting characters (processed character tokens) and executing commands and macros.
How can this be fixed?
Because \uppercase{...}
or \lowercase{...}
will only act upon character tokens, we need a way to “force the unpackaging” of characters contained in our macro \mychars
(or contained in a \toks
register) before \uppercase{...}
or \lowercase{...}
acts on it. By “unpackaging” what we really mean is TeX’s process of expansion:
- replacing a TeX/LaTeX command with the sequence of tokens from which that command (e.g., a macro) is comprised, or
- producing the sequence of tokens a command is designed to generate. One example of a command that generates tokens is
\jobname
, which produces a sequence of character tokens representing the name of the main TeX file being processed.
Lower-level magic: scantoks(..., ...)
Here we are really probing into some darker corners of TeX’s inner workings so you can ignore this section unless you enjoy the details…
After TeX detects \uppercase
or \lowercase
in the input stream, it executes an internal function called scantoks(..., ...)
whose job it is to generate the token list using the items between the opening ‘{’ and closing ‘}’—as discussed, that token list is subsequently examined to detect (then adjust) any character tokens to alter the character case as required. Note carefully that we are referring to scantoks(..., ...)
as the internal function built into the source code of TeX engines—here, it is not being referred to as the name of a control sequence.
As part of its work, scantoks(..., ...)
can be instructed whether to expand, or not expand, the token list it is constructing and for \uppercase
and (\lowercase
) it does not expand the tokens: it merely creates them and puts them into a token list.
One of the first things that scantoks(..., ...)
has to do is check for an opening ‘{’ (or any character of \catcode
1) because it has to ensure the user hasn’t made a syntax error and forgotten the opening ‘{’ (or any character of category code 1)—because a character with category code 1 is required to delimit the start of a list of items to be tokenized.
And here’s the trick: the task of looking for an opening ‘{’ triggers scantoks(..., ...)
to run TeX’s expansion process, which means that the following examples will work:
\let\ob={
\uppercase\ob abcde}
\def\obb{\ob}
\uppercase\obb xyz}
Taking the example of \obb
, a macro, it is recognized as an expandable command and is duly expanded by TeX (via the scantoks(..., ...)
function) in its search for an opening brace (any character with category code 1). What this means is that we can use the “\expandafter
trick” to achieve our goal of “unpacking” our characters from the confines of our macro—i.e., expanding it. Note that \expandafter
also falls into the category of being an expandable command, so TeX actions it here and lets it do its work as part of hunting for an opening ‘{’ (or any character with category code of 1).
So, if you define:
\toks0={abcde}
\def\mychars{abcde}
And do this:
\uppercase\expandafter{\mychars}
\uppercase\expandafter{\the\toks0}
in both cases you will now see ABCDE typeset because the \expandafter
causes “unpackaging” (expansion) of \mychars
and \the\toks0
—both result in \uppercase
seeing a stream of character tokens, which they can process to change the case.
Example 2: \string
—more temporary token lists
Internally, TeX classifies \string
as one of its so-called “convert” commands: performing the operation of “convert to text”. The \string
command is designed to convert a token into a human-readable text version—i.e., typeset the human-readable string of characters from which that token was originally created.
For example \string\hello
creates a temporary token list which contains the characters \, h, e, l, l, o — yes, even including the initial ‘\’. Once that token list has been created it is then re-read by TeX and the text of the command “\hello
” is typeset—yes, including ‘\’ if you choose the correct font…
You may wonder how/why TeX can typeset the escape character when it is usually used to trigger TeX’s scanner into creating a command token: why doesn’t it do that here? The answer has to do with category codes: usually, a ‘\’ character has catcode 0 (escape character) but when \string
generates its internal token list it does something a little different. When it creates a character-token list it assigns category code 12 to all characters apart from the space character which is assigned catcode 10—recall that character tokens are calculated from 256 x catcode + ASCII value. So, when TeX re-reads (inputs) the temporary token list that \string
generated from \hello
, TeX does not see an escape character because the token for ‘\’ was calculated with a catcode 12 and not 0: TeX just treats ‘\’ as a regular character and typesets it.
Strictly speaking, we should probably note that TeX does not atually generate a token for escape characters when it detects them in the input. Once it has recognized a character with category code 0, that character is just used to “trigger” generating a control sequence token: once it has triggered TeX to do that the escape character has done its work and is no longer considered.
Technical note
A command called \showtokens{...}
(introduced by the e-TeX engine) can show token lists (in the log file). From the e-TeX manual:
The command
\showtokens{<token list>}
displays the token list, and allows the display of quantities that cannot be displayed by\show
or\showthe
, e.g.:\showtokens\expandafter{\jobname}
In conclusion
In section 291 of the TeX source code (see page 122 of TeX: The Program) Knuth describes a token list as follows:
“A token list is a singly-linked list of one-word nodes in mem, where each word contains a token and a link. Macro definitions, output-routine definitions, marks,
\write
texts, and a few other things are remembered by TeX in the form of token lists, usually preceded by a node with a reference count in its “token_ref_count” field.”
On first reading this may not have been easy to understand but, hopefully, it may now make a little more sense.
Overleaf guides
- Creating a document in Overleaf
- Uploading a project
- Copying a project
- Creating a project from a template
- Using the Overleaf project menu
- Including images in Overleaf
- Exporting your work from Overleaf
- Working offline in Overleaf
- Using Track Changes in Overleaf
- Using bibliographies in Overleaf
- Sharing your work with others
- Using the History feature
- Debugging Compilation timeout errors
- How-to guides
- Guide to Overleaf’s premium features
LaTeX Basics
- Creating your first LaTeX document
- Choosing a LaTeX Compiler
- Paragraphs and new lines
- Bold, italics and underlining
- Lists
- Errors
Mathematics
- Mathematical expressions
- Subscripts and superscripts
- Brackets and Parentheses
- Matrices
- Fractions and Binomials
- Aligning equations
- Operators
- Spacing in math mode
- Integrals, sums and limits
- Display style in math mode
- List of Greek letters and math symbols
- Mathematical fonts
- Using the Symbol Palette in Overleaf
Figures and tables
- Inserting Images
- Tables
- Positioning Images and Tables
- Lists of Tables and Figures
- Drawing Diagrams Directly in LaTeX
- TikZ package
References and Citations
- Bibliography management with bibtex
- Bibliography management with natbib
- Bibliography management with biblatex
- Bibtex bibliography styles
- Natbib bibliography styles
- Natbib citation styles
- Biblatex bibliography styles
- Biblatex citation styles
Languages
- Multilingual typesetting on Overleaf using polyglossia and fontspec
- Multilingual typesetting on Overleaf using babel and fontspec
- International language support
- Quotations and quotation marks
- Arabic
- Chinese
- French
- German
- Greek
- Italian
- Japanese
- Korean
- Portuguese
- Russian
- Spanish
Document structure
- Sections and chapters
- Table of contents
- Cross referencing sections, equations and floats
- Indices
- Glossaries
- Nomenclatures
- Management in a large project
- Multi-file LaTeX projects
- Hyperlinks
Formatting
- Lengths in LaTeX
- Headers and footers
- Page numbering
- Paragraph formatting
- Line breaks and blank spaces
- Text alignment
- Page size and margins
- Single sided and double sided documents
- Multiple columns
- Counters
- Code listing
- Code Highlighting with minted
- Using colours in LaTeX
- Footnotes
- Margin notes
Fonts
Presentations
Commands
Field specific
- Theorems and proofs
- Chemistry formulae
- Feynman diagrams
- Molecular orbital diagrams
- Chess notation
- Knitting patterns
- CircuiTikz package
- Pgfplots package
- Typesetting exams in LaTeX
- Knitr
- Attribute Value Matrices
Class files
- Understanding packages and class files
- List of packages and class files
- Writing your own package
- Writing your own class