Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lexical Structure

This chapter specifies the lexical structure of Dada programs. A Dada source file is a sequence of Unicode characters, which the lexer converts into a sequence of tokens.

Source Encoding

Dada source files are encoded as UTF-8.

Tokens

The lexer produces a sequence of tokens:


Token ::= Identifier
          | Keyword
          | Literal
          | Operator
          | Delimiter

A token Token is one of the following kinds:

Each token records whether it was preceded by whitespace, a newline, or a comment. This information is used by the parser but does not produce separate tokens.

Whitespace and Comments

Whitespace

Whitespace characters (spaces, tabs, and other Unicode whitespace excluding newlines) separate tokens but are otherwise not significant.

Newline characters (\n) are tracked by the lexer. Whether a token is preceded by a newline may affect how the parser interprets certain constructs.

Comments

A comment begins with # and extends to the end of the line.

The content of a comment, including the leading #, is ignored by the lexer. A comment implies a newline for the purpose of preceding-whitespace tracking.

Identifier definition

An identifier Identifier begins with a Unicode alphabetic character or underscore (_), followed by zero or more Unicode alphanumeric characters or underscores, provided it is not a keyword Keyword:


Identifier ::= (Alphabetic | _) (Alphanumeric | _)*    (not a Keyword)

Keyword definition

The following words are reserved as keywords:


Keyword ::= as
            | async
            | await
            | class
            | else
            | enum
            | export
            | false
            | fn
            | give
            | given
            | if
            | is
            | let
            | match
            | mod
            | mut
            | my
            | our
            | perm
            | pub
            | ref
            | return
            | self
            | share
            | shared
            | struct
            | true
            | type
            | unsafe
            | use
            | where

Operator definition

The following single characters are recognized as operator tokens:


Operator ::= + | - | * | / | % | = | !
           | < | > | & | | | : | , | . | ; | ?

Multi-character operators such as &&, ||, ==, <=, >=, and -> are formed by the parser from adjacent operator tokens.

Delimiter definition

A delimited token contains a matched pair of brackets and their contents:


Delimiter ::= ( Token* ) | [ Token* ] | { Token* }

Delimiters must be balanced. An opening delimiter without a matching closing delimiter is an error.

The lexer tracks delimiter nesting. Content between matching delimiters is treated as a unit, which enables deferred parsing of function bodies and other nested structures.

Literal definition

IntegerLiteral definition

An integer literal IntegerLiteral is a sequence of one or more ASCII decimal digits (09), optionally separated by underscores (_) that do not affect the value:


IntegerLiteral ::= Digit (_? Digit)*
Digit ::= 0 | 1 | ... | 9

BooleanLiteral definition

The keywords true and false are boolean literals:


BooleanLiteral ::= true | false

StringLiteral definition

Lexical Errors

Characters that do not begin a valid token are accumulated and reported as a single error spanning the invalid sequence.