# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Rules
- Always run 'dart test' and 'dart analyze' before deciding that your solution is complete. Your solution is only complete when it passes all tests and there are no blocking issues.

## Development Commands

### Testing
- `dart test` - Run all 118+ language tests covering 85+ languages
- `dart test -n "pattern"` - Run specific tests (e.g., `dart test -n "Thai"` or `dart test -n "Config"`)
- `dart test --chain-stack-traces` - Run tests with detailed error output for debugging

### Code Quality
- `dart analyze` - Run static analysis and linting (uses `package:lints/recommended.yaml`)
- `dart doc` - Generate API documentation (outputs to `doc/api/`)

### Development
- `dart run example/word_count_example.dart` - Run usage examples showing basic functionality
- `dart run debug.dart` - Run debug script (if created for testing specific cases)

## Architecture Overview

This is a Dart library that provides multi-language word counting functionality, converted from a JavaScript implementation. The library supports 85+ languages across major writing systems including CJK (Chinese, Japanese, Korean), European, South Asian, African, and Middle Eastern languages.

### Core Structure

**Main Library Entry Point:**
- `lib/word_count.dart` - Public API that exports all functionality from the base implementation

**Core Implementation:**
- `lib/src/word_count_base.dart` - Contains all the word counting logic and data structures

### Key Components

**Configuration System:**
- `WordCountConfig` class with three options:
  - `punctuationAsBreaker` - Treat punctuation as word separators vs. removal
  - `disableDefaultPunctuation` - Use only custom punctuation list
  - `punctuation` - Custom punctuation characters to add/use
- `defaultPunctuation` constant - Extensive list of punctuation from multiple languages

**Result Types:**
- `WordCountResult` class - Structured result containing both `words` array and `count`
- `emptyResult` constant - Predefined empty result for null/empty inputs

**Public Functions:**
- `wordsCount(text, config)` - Returns word count only (int)
- `wordsSplit(text, config)` - Returns word array only (List<String>)
- `wordsDetect(text, config)` - Returns full result with both count and words (WordCountResult)

### Multi-Language Processing Logic

The core algorithm in `wordsDetect()` handles different writing systems through:

1. **Punctuation Processing** - Configurable removal or replacement with spaces
2. **Symbol Cleaning** - Removes Unicode symbols in ranges `\uFF00-\uFFEF` and `\u2000-\u206F`
3. **Whitespace Normalization** - Collapses multiple spaces and splits on whitespace
4. **Multi-Script RegExp Matching** - Uses Unicode ranges for:
   - Latin/Cyrillic/Malayalam letters and numbers
   - CJK characters (Chinese Hànzì, Japanese Kanji/Hiragana/Katakana, Korean Hanja/Hangul)
   - Handles individual character tokenization for CJK vs. space-separated words for European languages

### Test Architecture

**Test Coverage:**
- 118+ comprehensive tests covering 85+ languages and all configuration options
- Test groups organized by complexity: Simple → Basic → Config → Special → Split/detector

**Language Test Categories:**
- **Simple Group** - Basic multilingual examples (English, Chinese, mixed)
- **Basic Group** - One test per supported language with expected word counts
- **Config Group** - All punctuation handling configurations
- **Special Group** - Edge cases like repeated punctuation
- **Split/Detector Group** - Tests for `wordsSplit()` and `wordsDetect()` functions

**Test Utilities:**
- `wordsAreExpected(List<String>, List<String>)` - Helper function for comparing word arrays
- Language-specific expected counts account for writing system differences (e.g., Thai tokenization)

### Development Notes

**When Adding New Languages:**
- Add test cases to the "Basic" group following existing pattern
- Verify expected word count matches the language's tokenization behavior
- Some languages like Thai may have different tokenization expectations than the original JavaScript

**When Modifying Word Detection Logic:**
- All 118+ tests must pass to ensure no regressions across languages
- Pay special attention to CJK character handling and punctuation processing
- Test both individual functions (`wordsCount`, `wordsSplit`) and the combined `wordsDetect`

**Documentation:**
- API documentation is generated via `dart doc` and includes usage examples
- The `example/word_count_example.dart` demonstrates all three functions and configuration options