Lexical Structure
This chapter specifies the lexical structure of Dada programs. A Dada source file is a sequence of Unicode characters, which the lexer converts into a sequence of tokens.
Source Encoding
Dada source files are encoded as UTF-8.
Tokens
The lexer produces a sequence of tokens:
Token ::= Identifier
| Keyword
| Literal
| Operator
| Delimiter
A token Token is one of the following kinds:
Each token records whether it was preceded by whitespace, a newline, or a comment. This information is used by the parser but does not produce separate tokens.
Whitespace and Comments
Whitespace
Whitespace characters (spaces, tabs, and other Unicode whitespace excluding newlines) separate tokens but are otherwise not significant.
Newline characters (\n) are tracked by the lexer.
Whether a token is preceded by a newline
may affect how the parser interprets certain constructs.
Comments
A comment begins with # and extends to the end of the line.
The content of a comment, including the leading #, is ignored by the lexer.
A comment implies a newline for the purpose of preceding-whitespace tracking.
Identifier definition
An identifier Identifier begins with a Unicode alphabetic character or underscore (_),
followed by zero or more Unicode alphanumeric characters or underscores,
provided it is not a keyword Keyword:
Identifier ::= (Alphabetic | _) (Alphanumeric | _)* (not a Keyword)
Identifiers are case-sensitive.
Keyword definition
The following words are reserved as keywords:
Keyword ::= as
| async
| await
| class
| else
| enum
| export
| false
| fn
| give
| given
| if
| is
| let
| match
| mod
| mut
| my
| our
| perm
| pub
| ref
| return
| self
| share
| shared
| struct
| true
| type
| unsafe
| use
| where
Operator definition
The following single characters are recognized as operator tokens:
Operator ::= + | - | * | / | % | = | !
| < | > | & | | | : | , | . | ; | ?
- .plus
+ - .minus
- - .star
* - .slash
/ - .percent
% - .equals
= - .bang
! - .less-than
< - .greater-than
> - .ampersand
& - .pipe
| - .colon
: - .comma
, - .dot
. - .semicolon
; - .question
?
Multi-character operators such as &&, ||, ==, <=, >=, and ->
are formed by the parser from adjacent operator tokens.
Delimiter definition
A delimited token contains a matched pair of brackets and their contents:
Delimiter ::= ( Token* ) | [ Token* ] | { Token* }
- .parentheses Parentheses:
(and). - .square-brackets Square brackets:
[and]. - .curly-braces Curly braces:
{and}.
Delimiters must be balanced. An opening delimiter without a matching closing delimiter is an error.
The lexer tracks delimiter nesting. Content between matching delimiters is treated as a unit, which enables deferred parsing of function bodies and other nested structures.
Literal definition
A literal Literal is one of the following:
Literal ::= IntegerLiteral
| BooleanLiteral
| StringLiteral
IntegerLiteral definition
An integer literal IntegerLiteral is a sequence of one or more ASCII decimal digits (0–9),
optionally separated by underscores (_) that do not affect the value:
IntegerLiteral ::= Digit (_? Digit)*
Digit ::= 0 | 1 | ... | 9
BooleanLiteral definition
The keywords true and false are boolean literals:
BooleanLiteral ::= true | false
StringLiteral definition
String literal syntax is specified in String Literals.
Lexical Errors
Characters that do not begin a valid token are accumulated and reported as a single error spanning the invalid sequence.