🔤 Rust Lexical Structure

📋 Token Types Quick Reference

Keywords

fn, let, mut, if, else, match
Reserved language constructs

Identifiers

variable_name, FunctionName
Names for variables, functions, types

Literals

42, "hello", 'c', true
Direct values in source code

Operators

+, -, *, ==, &&, |
Mathematical and logical operations

Punctuation

{ } ( ) [ ] ; , . ::
Structure and grouping symbols

Comments

// line /* block */ /// doc
Code documentation and notes

🏗️ What is Lexical Structure?

Lexical structure defines how Rust source code is broken down into individual tokens - the smallest meaningful units of the language. Understanding lexical structure is fundamental to reading and writing Rust code effectively.

Token Classification
Input Text
Lexical Analysis
Tokens
Parser
AST
Identifier Grammar
XID_Start
XID_Continue*
|
_
XID_Continue+

⚙️ How Tokenization Works

Source Code
fn main() {
    let x = 42;
    println!("Hello");
}
Tokens Produced
KEYWORD(fn)
IDENT(main)
PUNCT(()
PUNCT())
PUNCT({)
KEYWORD(let)
IDENT(x)
PUNCT(=)
LITERAL(42)
PUNCT(;)
IDENT(println!)
PUNCT(()
LITERAL("Hello")
PUNCT())
PUNCT(;)
PUNCT(})

🏷️ Identifier Rules

✅ Valid Identifiers
// Basic identifiers
let variable_name = 42;
let CamelCase = "hello";
let _private = true;
let __internal = 0;

// Unicode identifiers
let 变量 = "Chinese";
let переменная = "Russian";
let μ = 3.14159;

// Raw identifiers
let r#fn = "function";
let r#match = "pattern";
❌ Invalid Identifiers
// Cannot start with numbers
// let 2fast = "error";

// Cannot use keywords without r#
// let fn = "error";
// let match = "error";

// Cannot use reserved symbols
// let @ = "error";
// let # = "error";

📝 Literal Types

Integer Literals
42, 0xFF, 0o755, 0b1010
Decimal, hex, octal, binary
Float Literals
3.14, 2.5e10, 1.0f32
Scientific notation supported
String Literals
"hello", r"raw", br"bytes"
UTF-8 strings with escapes
Character Literals
'a', '\n', '\u{41}', b'A'
Single Unicode code points
Boolean Literals
true, false
Built-in boolean values

💬 Comment Types

Comment Syntax
// Line comment

/* Block comment */

/*
 * Multi-line 
 * block comment
 */

/// Outer doc comment
//! Inner doc comment

/** Block doc comment */

/*! Inner block doc */
Usage Examples
/// Calculates factorial
/// 
/// # Examples
/// ```
/// assert_eq!(factorial(5), 120);
/// ```
fn factorial(n: u32) -> u32 {
    // Base case
    if n <= 1 {
        1
    } else {
        /* Recursive case:
           n! = n * (n-1)! */
        n * factorial(n - 1)
    }
}

🗂️ Token Categories

Keywords and Reserved Words +

Rust has strict keywordsWords that are always reserved and cannot be used as identifiers that cannot be used as identifiers, and weak keywordsWords that are contextually reserved in specific positions that are contextually reserved.

Strict Keywords

as
let x = y as i32;
Type casting
break
break 'outer;
Exit loops early
const
const MAX: u32 = 100;
Compile-time constants
continue
continue 'outer;
Skip to next iteration
crate
use crate::module;
Current crate root
else
if x { } else { }
Alternative branch
enum
enum Color { Red }
Sum types
extern
extern "C" fn foo();
Foreign functions
false
let x = false;
Boolean literal
fn
fn hello() {}
Function definition
for
for item in iter {}
Iterator loops
if
if condition {}
Conditional execution
impl
impl Trait for Type
Implementation blocks
in
for x in iter {}
Iterator binding
let
let x = 42;
Variable binding
loop
loop { break; }
Infinite loops
match
match x { _ => {} }
Pattern matching
mod
mod utils;
Module definition
move
move || x
Move closures
mut
let mut x = 42;
Mutable binding
pub
pub fn hello() {}
Public visibility
ref
let ref x = value;
Reference binding
return
return value;
Early return
self
fn method(&self) {}
Method receiver
Self
impl Self {}
Implementing type
static
static X: i32 = 0;
Static variables
struct
struct Point { x: f64 }
Product types
super
use super::module;
Parent module
trait
trait Display {}
Shared behavior
true
let x = true;
Boolean literal
type
type Result<T> = ...;
Type aliases
unsafe
unsafe { ... }
Unsafe operations
use
use std::collections;
Bring into scope
where
where T: Clone
Additional constraints
while
while condition {}
Conditional loops
Operators and Punctuation +

Arithmetic Operators

+
a + b
Addition
-
a - b
Subtraction
*
a * b
Multiplication
/
a / b
Division
%
a % b
Remainder

Comparison Operators

==
a == b
Equality
!=
a != b
Inequality
<
a < b
Less than
>
a > b
Greater than
<=
a <= b
Less or equal
>=
a >= b
Greater or equal

Logical Operators

&&
a && b
Logical AND
||
a || b
Logical OR
!
!a
Logical NOT

Assignment Operators

=
a = b
Assignment
+=
a += b
Add assign
-=
a -= b
Subtract assign
*=
a *= b
Multiply assign
/=
a /= b
Divide assign
%=
a %= b
Remainder assign
String and Character Literals +

String Literal Types

String Literals
// Basic string literal
"Hello, world!"

// Multi-line strings
"This is a \
long string"

// Escape sequences
"Line one\nLine two\tTabbed"

// Unicode escapes  
"Unicode: \u{1F980} \u{03B1}"

// Raw strings (no escapes)
r"C:\Users\Name\file.txt"
r#"String with "quotes" inside"#

// Byte strings
b"Hello"
br"Raw byte string"
Character Literals
// Basic characters
'a'
'Z'
'5'

// Special characters
' '  // space
'\n' // newline
'\t' // tab
'\r' // carriage return
'\'' // single quote
'\"' // double quote
'\\' // backslash

// Unicode characters
'α'
'🦀'
'\u{1F980}' // crab emoji

// Byte literals
b'A'
b'\n'
✅ Escape Sequences Reference
Escape Meaning Example
\\n Newline (LF) "Line 1\\nLine 2"
\\r Carriage return (CR) "Windows\\r\\n"
\\t Tab "Col1\\tCol2"
\\\\ Backslash "C:\\\\Users"
\\' Single quote '\\''
\\" Double quote "Say \\"Hello\\""
\\u{...} Unicode code point "\\u{1F980}"

🤖 For AI Coding Agents

{
  "rust_lexical_structure": {
    "tokenization_rules": {
      "whitespace": {
        "ignored": ["space", "tab", "newline", "carriage_return"],
        "significant": "Only for token separation"
      },
      "identifiers": {
        "pattern": "(XID_Start | _) (XID_Continue)*",
        "raw_form": "r#identifier_name",
        "unicode_support": true,
        "case_sensitive": true
      },
      "keywords": {
        "strict": [
          "as", "break", "const", "continue", "crate", "else", "enum",
          "extern", "false", "fn", "for", "if", "impl", "in", "let",
          "loop", "match", "mod", "move", "mut", "pub", "ref", "return",
          "self", "Self", "static", "struct", "super", "trait", "true",
          "type", "unsafe", "use", "where", "while"
        ],
        "weak": ["union", "dyn"],
        "reserved": ["abstract", "become", "box", "do", "final", "macro", "override", "priv", "typeof", "unsized", "virtual", "yield"]
      },
      "literals": {
        "integer": {
          "decimal": "123, 123_456",
          "hexadecimal": "0xFF, 0xff",
          "octal": "0o755",
          "binary": "0b1010_1111",
          "suffixes": ["u8", "u16", "u32", "u64", "u128", "usize", "i8", "i16", "i32", "i64", "i128", "isize"]
        },
        "float": {
          "format": "123.45, 1.23e10, 1.23E-10",
          "suffixes": ["f32", "f64"]
        },
        "string": {
          "basic": "\"hello world\"",
          "raw": "r\"no\\escapes\"",
          "raw_with_hashes": "r#\"can contain \"quotes\"\"#",
          "byte_string": "b\"bytes only\"",
          "raw_byte_string": "br\"raw bytes\""
        },
        "character": {
          "basic": "'a', 'Z', '5'",
          "escaped": "'\\n', '\\t', '\\''",
          "unicode": "'α', '\\u{1F980}'",
          "byte": "b'A'"
        },
        "boolean": ["true", "false"]
      },
      "operators": {
        "arithmetic": ["+", "-", "*", "/", "%"],
        "comparison": ["==", "!=", "<", ">", "<=", ">="],
        "logical": ["&&", "||", "!"],
        "bitwise": ["&", "|", "^", "<<", ">>"],
        "assignment": ["=", "+=", "-=", "*=", "/=", "%=", "&=", "|=", "^=", "<<=", ">>="],
        "special": ["->", "=>", "::", ".", "..", "...", "?", "@"]
      },
      "punctuation": {
        "brackets": ["(", ")", "[", "]", "{", "}"],
        "separators": [";", ",", ":"],
        "reference": ["&", "*"]
      },
      "comments": {
        "line": "// comment text",
        "block": "/* comment text */",
        "doc_outer": "/// documentation",
        "doc_inner": "//! inner doc",
        "doc_block_outer": "/** documentation */",
        "doc_block_inner": "/*! inner doc */"
      }
    },
    "parsing_context": {
      "token_precedence": "Operators have precedence rules for parsing",
      "contextual_keywords": "Some keywords are only reserved in specific contexts",
      "path_resolution": ":: is used for absolute and relative path resolution",
      "macro_invocation": "! suffix indicates macro calls"
    }
  }
}

🔍 Advanced Tokenization

Numeric Literals in Detail +

Integer Literal Formats

Format Examples
// Decimal (base 10)
42
1_000_000    // underscores for readability

// Hexadecimal (base 16)  
0xFF
0xDEAD_BEEF

// Octal (base 8)
0o755
0o644

// Binary (base 2)
0b1010_1111
0b1111_0000_1010_0101

// With type suffixes
42u32        // unsigned 32-bit
-42i64       // signed 64-bit
255u8        // byte value
Float Literals
// Basic float format
3.14159
2.5

// Scientific notation
1.23e10      // 1.23 × 10^10
6.022e23     // Avogadro's number
1.602e-19    // Electron charge

// Type suffixes
3.14f32      // 32-bit float
2.71828f64   // 64-bit float

// Edge cases
1.          // 1.0
.5          // Not valid in Rust!
1e10        // Scientific without decimal
Whitespace and Layout +

Rust is not indentation-sensitive like Python, but proper formatting improves readability.

✅ Whitespace Rules
// These are equivalent:
fn main(){let x=42;println!("{}",x);}

fn main() {
    let x = 42;
    println!("{}", x);
}

// Whitespace separates tokens
letx = 42;     // Error: 'letx' is not 'let x'
let x = 42;    // Correct

// Line continuation with backslash
let message = "This is a very \
               long string literal";

// Multi-line expressions
let result = some_long_function_name(
    argument_one,
    argument_two,
    argument_three,
);
Raw Identifiers

Use r# prefix to use keywords as identifiers:

// Useful for FFI or when interfacing with other languages
let r#type = "string";
let r#match = "pattern";
let r#fn = "function";

// But not needed for contextual keywords
let union = "allowed"; // 'union' is only reserved in specific contexts
Unicode Support

Rust fully supports Unicode in identifiers and strings:

// Variables can use any Unicode script
let 变量 = "Chinese variable";
let переменная = "Russian variable";
let μ = 3.14159;  // Greek letter mu
let 🦀 = "crab";   // Even emojis work (though not recommended)

// String literals support full Unicode
let emoji = "🦀🚀✨";
let math = "α² + β² = γ²";
Macro Invocation Syntax

Macros are distinguished by the ! suffix:

println!("This is a macro");     // Macro call
vec![1, 2, 3]                     // Macro call
format!("Hello {}", name)         // Macro call

// vs regular function calls
std::process::exit(0);            // Function call
String::from("hello");            // Function call
Lifetime Syntax

Lifetimes use apostrophe syntax but are not character literals:

// These are lifetime parameters, not character literals
fn longest<'a>(x: &'a str, y: &'a str) -> &'a str {
    if x.len() > y.len() { x } else { y }
}

// Named lifetimes for clarity
'outer: loop {
    'inner: loop {
        break 'outer; // Break the outer loop
    }
}
Tokenization Edge Cases +

Tricky Parsing Situations

❌ Common Tokenization Gotchas
// Floating point without leading zero
// let x = .5;  // Error! Must be 0.5

// Greater than vs right shift
Vec<Vec<i32>>    // Valid in modern Rust
// Vec<Vec<i32>>; // Used to require space: Vec<Vec<i32> >;

// Minus vs negative literal
let x = -1;      // Unary minus operator applied to literal 1
let y = - 1;     // Same thing with space
let z = 5-1;     // Binary minus: 5 - 1

// Path vs less than
Vec<String>      // Generic type parameter
if x < y { }     // Less than comparison
Documentation Comments +

Documentation Comment Types

Outer Documentation
/// This function adds two numbers together.
/// 
/// # Examples
/// 
/// ```
/// assert_eq!(add(2, 3), 5);
/// ```
/// 
/// # Panics
/// 
/// This function never panics.
fn add(a: i32, b: i32) -> i32 {
    a + b
}

/** Block-style documentation
 * Can span multiple lines
 * Often used for longer docs
 */
struct MyStruct {
    field: i32,
}
Inner Documentation
mod my_module {
    //! This is module-level documentation.
    //! 
    //! It documents the containing item (the module)
    //! rather than the following item.
    
    /*! Block-style inner documentation
        for the containing module.
        
        Used at the top of files or modules.
    */
    
    pub fn helper() {}
}

fn main() {
    //! This documents main() itself
    //! (though not commonly used)
}