🔤 Rust Lexical Structure

📋 Token Types Quick Reference

Keywords

fn, let, mut, if, else, match

Reserved language constructs

Identifiers

variable_name, FunctionName

Names for variables, functions, types

Literals

42, "hello", 'c', true

Direct values in source code

Operators

+, -, *, ==, &&, |

Mathematical and logical operations

Punctuation

{ } ( ) [ ] ; , . ::

Structure and grouping symbols

Comments

// line /* block */ /// doc

Code documentation and notes

🏗️ What is Lexical Structure?

Lexical structure defines how Rust source code is broken down into individual tokens - the smallest meaningful units of the language. Understanding lexical structure is fundamental to reading and writing Rust code effectively.

Token Classification

Input Text

Lexical Analysis

Tokens

Parser

AST

Identifier Grammar

XID_Start

XID_Continue*

XID_Continue+

⚙️ How Tokenization Works

Source Code

                    fn main() {
    let x = 42;
    println!("Hello");
}
                  

Tokens Produced

                    KEYWORD(fn)
IDENT(main)
PUNCT(()
PUNCT())
PUNCT({)
KEYWORD(let)
IDENT(x)
PUNCT(=)
LITERAL(42)
PUNCT(;)
IDENT(println!)
PUNCT(()
LITERAL("Hello")
PUNCT())
PUNCT(;)
PUNCT(})
                  

🏷️ Identifier Rules

✅ Valid Identifiers

                  // Basic identifiers
let variable_name = 42;
let CamelCase = "hello";
let _private = true;
let __internal = 0;

// Unicode identifiers
let 变量 = "Chinese";
let переменная = "Russian";
let μ = 3.14159;

// Raw identifiers
let r#fn = "function";
let r#match = "pattern";
                

❌ Invalid Identifiers

                  // Cannot start with numbers
// let 2fast = "error";

// Cannot use keywords without r#
// let fn = "error";
// let match = "error";

// Cannot use reserved symbols
// let @ = "error";
// let # = "error";
                

📝 Literal Types

Integer Literals

42, 0xFF, 0o755, 0b1010

Decimal, hex, octal, binary

Float Literals

3.14, 2.5e10, 1.0f32

Scientific notation supported

String Literals

"hello", r"raw", br"bytes"

UTF-8 strings with escapes

Character Literals

'a', '\n', '\u{41}', b'A'

Single Unicode code points

Boolean Literals

true, false

Built-in boolean values

💬 Comment Types

Comment Syntax

                    // Line comment

/* Block comment */

/*
 * Multi-line 
 * block comment
 */

/// Outer doc comment
//! Inner doc comment

/** Block doc comment */

/*! Inner block doc */
                  

Usage Examples

                    /// Calculates factorial
/// 
/// # Examples
/// ```
/// assert_eq!(factorial(5), 120);
/// ```
fn factorial(n: u32) -> u32 {
    // Base case
    if n <= 1 {
        1
    } else {
        /* Recursive case:
           n! = n * (n-1)! */
        n * factorial(n - 1)
    }
}
                  

🗂️ Token Categories

Keywords and Reserved Words +

Rust has strict keywordsWords that are always reserved and cannot be used as identifiers that cannot be used as identifiers, and weak keywordsWords that are contextually reserved in specific positions that are contextually reserved.

Strict Keywords

let x = y as i32;

Type casting

break

break 'outer;

Exit loops early

const

const MAX: u32 = 100;

Compile-time constants

continue

continue 'outer;

Skip to next iteration

crate

use crate::module;

Current crate root

else

if x { } else { }

Alternative branch

enum

enum Color { Red }

Sum types

extern

extern "C" fn foo();

Foreign functions

false

let x = false;

Boolean literal

fn hello() {}

Function definition

for

for item in iter {}

Iterator loops

if condition {}

Conditional execution

impl

impl Trait for Type

Implementation blocks

for x in iter {}

Iterator binding

let

let x = 42;

Variable binding

loop

loop { break; }

Infinite loops

match

match x { _ => {} }

Pattern matching

mod

mod utils;

Module definition

move

move || x

Move closures

mut

let mut x = 42;

Mutable binding

pub

pub fn hello() {}

Public visibility

ref

let ref x = value;

Reference binding

return

return value;

Early return

self

fn method(&self) {}

Method receiver

Self

impl Self {}

Implementing type

static

static X: i32 = 0;

Static variables

struct

struct Point { x: f64 }

Product types

super

use super::module;

Parent module

trait

trait Display {}

Shared behavior

true

let x = true;

Boolean literal

type

type Result<T> = ...;

Type aliases

unsafe

unsafe { ... }

Unsafe operations

use

use std::collections;

Bring into scope

where

where T: Clone

Additional constraints

while

while condition {}

Conditional loops

Operators and Punctuation +

Arithmetic Operators

a + b

Addition

a - b

Subtraction

a * b

Multiplication

a / b

Division

a % b

Remainder

Comparison Operators

a == b

Equality

a != b

Inequality

a < b

Less than

a > b

Greater than

a <= b

Less or equal

a >= b

Greater or equal

Logical Operators

a && b

Logical AND

a || b

Logical OR

Logical NOT

Assignment Operators

a = b

Assignment

a += b

Add assign

a -= b

Subtract assign

a *= b

Multiply assign

a /= b

Divide assign

a %= b

Remainder assign

String and Character Literals +

String Literal Types

String Literals

                    // Basic string literal
"Hello, world!"

// Multi-line strings
"This is a \
long string"

// Escape sequences
"Line one\nLine two\tTabbed"

// Unicode escapes  
"Unicode: \u{1F980} \u{03B1}"

// Raw strings (no escapes)
r"C:\Users\Name\file.txt"
r#"String with "quotes" inside"#

// Byte strings
b"Hello"
br"Raw byte string"
                  

Character Literals

                    // Basic characters
'a'
'Z'
'5'

// Special characters
' '  // space
'\n' // newline
'\t' // tab
'\r' // carriage return
'\'' // single quote
'\"' // double quote
'\\' // backslash

// Unicode characters
'α'
'🦀'
'\u{1F980}' // crab emoji

// Byte literals
b'A'
b'\n'
                  

✅ Escape Sequences Reference

Escape	Meaning	Example
\\n	Newline (LF)	"Line 1\\nLine 2"
\\r	Carriage return (CR)	"Windows\\r\\n"
\\t	Tab	"Col1\\tCol2"
\\\\	Backslash	"C:\\\\Users"
\\'	Single quote	'\\''
\\"	Double quote	"Say \\"Hello\\""
\\u{...}	Unicode code point	"\\u{1F980}"

🤖 For AI Coding Agents

{
  "rust_lexical_structure": {
    "tokenization_rules": {
      "whitespace": {
        "ignored": ["space", "tab", "newline", "carriage_return"],
        "significant": "Only for token separation"
      },
      "identifiers": {
        "pattern": "(XID_Start | _) (XID_Continue)*",
        "raw_form": "r#identifier_name",
        "unicode_support": true,
        "case_sensitive": true
      },
      "keywords": {
        "strict": [
          "as", "break", "const", "continue", "crate", "else", "enum",
          "extern", "false", "fn", "for", "if", "impl", "in", "let",
          "loop", "match", "mod", "move", "mut", "pub", "ref", "return",
          "self", "Self", "static", "struct", "super", "trait", "true",
          "type", "unsafe", "use", "where", "while"
        ],
        "weak": ["union", "dyn"],
        "reserved": ["abstract", "become", "box", "do", "final", "macro", "override", "priv", "typeof", "unsized", "virtual", "yield"]
      },
      "literals": {
        "integer": {
          "decimal": "123, 123_456",
          "hexadecimal": "0xFF, 0xff",
          "octal": "0o755",
          "binary": "0b1010_1111",
          "suffixes": ["u8", "u16", "u32", "u64", "u128", "usize", "i8", "i16", "i32", "i64", "i128", "isize"]
        },
        "float": {
          "format": "123.45, 1.23e10, 1.23E-10",
          "suffixes": ["f32", "f64"]
        },
        "string": {
          "basic": "\"hello world\"",
          "raw": "r\"no\\escapes\"",
          "raw_with_hashes": "r#\"can contain \"quotes\"\"#",
          "byte_string": "b\"bytes only\"",
          "raw_byte_string": "br\"raw bytes\""
        },
        "character": {
          "basic": "'a', 'Z', '5'",
          "escaped": "'\\n', '\\t', '\\''",
          "unicode": "'α', '\\u{1F980}'",
          "byte": "b'A'"
        },
        "boolean": ["true", "false"]
      },
      "operators": {
        "arithmetic": ["+", "-", "*", "/", "%"],
        "comparison": ["==", "!=", "<", ">", "<=", ">="],
        "logical": ["&&", "||", "!"],
        "bitwise": ["&", "|", "^", "<<", ">>"],
        "assignment": ["=", "+=", "-=", "*=", "/=", "%=", "&=", "|=", "^=", "<<=", ">>="],
        "special": ["->", "=>", "::", ".", "..", "...", "?", "@"]
      },
      "punctuation": {
        "brackets": ["(", ")", "[", "]", "{", "}"],
        "separators": [";", ",", ":"],
        "reference": ["&", "*"]
      },
      "comments": {
        "line": "// comment text",
        "block": "/* comment text */",
        "doc_outer": "/// documentation",
        "doc_inner": "//! inner doc",
        "doc_block_outer": "/** documentation */",
        "doc_block_inner": "/*! inner doc */"
      }
    },
    "parsing_context": {
      "token_precedence": "Operators have precedence rules for parsing",
      "contextual_keywords": "Some keywords are only reserved in specific contexts",
      "path_resolution": ":: is used for absolute and relative path resolution",
      "macro_invocation": "! suffix indicates macro calls"
    }
  }
}

🔍 Advanced Tokenization

Numeric Literals in Detail +

Integer Literal Formats

Format Examples

                    // Decimal (base 10)
42
1_000_000    // underscores for readability

// Hexadecimal (base 16)  
0xFF
0xDEAD_BEEF

// Octal (base 8)
0o755
0o644

// Binary (base 2)
0b1010_1111
0b1111_0000_1010_0101

// With type suffixes
42u32        // unsigned 32-bit
-42i64       // signed 64-bit
255u8        // byte value
                  

Float Literals

                    // Basic float format
3.14159
2.5

// Scientific notation
1.23e10      // 1.23 × 10^10
6.022e23     // Avogadro's number
1.602e-19    // Electron charge

// Type suffixes
3.14f32      // 32-bit float
2.71828f64   // 64-bit float

// Edge cases
1.          // 1.0
.5          // Not valid in Rust!
1e10        // Scientific without decimal
                  

Whitespace and Layout +

Rust is not indentation-sensitive like Python, but proper formatting improves readability.

✅ Whitespace Rules

                  // These are equivalent:
fn main(){let x=42;println!("{}",x);}

fn main() {
    let x = 42;
    println!("{}", x);
}

// Whitespace separates tokens
letx = 42;     // Error: 'letx' is not 'let x'
let x = 42;    // Correct

// Line continuation with backslash
let message = "This is a very \
               long string literal";

// Multi-line expressions
let result = some_long_function_name(
    argument_one,
    argument_two,
    argument_three,
);
                

Raw Identifiers

Use r# prefix to use keywords as identifiers:

              // Useful for FFI or when interfacing with other languages
let r#type = "string";
let r#match = "pattern";
let r#fn = "function";

// But not needed for contextual keywords
let union = "allowed"; // 'union' is only reserved in specific contexts
            

Unicode Support

Rust fully supports Unicode in identifiers and strings:

              // Variables can use any Unicode script
let 变量 = "Chinese variable";
let переменная = "Russian variable";
let μ = 3.14159;  // Greek letter mu
let 🦀 = "crab";   // Even emojis work (though not recommended)

// String literals support full Unicode
let emoji = "🦀🚀✨";
let math = "α² + β² = γ²";
            

Macro Invocation Syntax

Macros are distinguished by the ! suffix:

              println!("This is a macro");     // Macro call
vec![1, 2, 3]                     // Macro call
format!("Hello {}", name)         // Macro call

// vs regular function calls
std::process::exit(0);            // Function call
String::from("hello");            // Function call
            

Lifetime Syntax

Lifetimes use apostrophe syntax but are not character literals:

              // These are lifetime parameters, not character literals
fn longest<'a>(x: &'a str, y: &'a str) -> &'a str {
    if x.len() > y.len() { x } else { y }
}

// Named lifetimes for clarity
'outer: loop {
    'inner: loop {
        break 'outer; // Break the outer loop
    }
}
            

Tokenization Edge Cases +

Tricky Parsing Situations

❌ Common Tokenization Gotchas

                  // Floating point without leading zero
// let x = .5;  // Error! Must be 0.5

// Greater than vs right shift
Vec<Vec<i32>>    // Valid in modern Rust
// Vec<Vec<i32>>; // Used to require space: Vec<Vec<i32> >;

// Minus vs negative literal
let x = -1;      // Unary minus operator applied to literal 1
let y = - 1;     // Same thing with space
let z = 5-1;     // Binary minus: 5 - 1

// Path vs less than
Vec<String>      // Generic type parameter
if x < y { }     // Less than comparison
                

Documentation Comments +

Documentation Comment Types

Outer Documentation

                    /// This function adds two numbers together.
/// 
/// # Examples
/// 
/// ```
/// assert_eq!(add(2, 3), 5);
/// ```
/// 
/// # Panics
/// 
/// This function never panics.
fn add(a: i32, b: i32) -> i32 {
    a + b
}

/** Block-style documentation
 * Can span multiple lines
 * Often used for longer docs
 */
struct MyStruct {
    field: i32,
}
                  

Inner Documentation

                    mod my_module {
    //! This is module-level documentation.
    //! 
    //! It documents the containing item (the module)
    //! rather than the following item.
    
    /*! Block-style inner documentation
        for the containing module.
        
        Used at the top of files or modules.
    */
    
    pub fn helper() {}
}

fn main() {
    //! This documents main() itself
    //! (though not commonly used)
}