# libfyaml Python Binding — API Reference The `libfyaml` Python binding exposes the high-performance libfyaml C library directly. Parsed documents are represented as `FyGeneric` objects — lazy wrappers that defer conversion to Python types until you ask for them. This keeps memory low and lets you navigate large documents without materialising every node. --- ## Table of Contents 1. [Quick Start](#quick-start) 2. [Parsing](#parsing) - [Parse modes](#parse-modes) - [Parser options](#parser-options) 3. [The FyGeneric Type](#the-fygeneric-type) - [Type checking](#type-checking) - [Converting to Python](#converting-to-python) - [Container access](#container-access) - [Tags and anchors](#tags-and-anchors) - [Source markers](#source-markers) - [Comments](#comments) - [Diagnostics](#diagnostics) 4. [Serialisation](#serialisation) - [Scalar styles](#scalar-styles) 5. [Converting Python objects](#converting-python-objects) 6. [Path navigation](#path-navigation) 7. [Mutability](#mutability) 8. [FyDocumentState](#fydocumentstate) 9. [Memory management](#memory-management) 10. [Error handling](#error-handling) 11. [Comparison with PyYAML](#comparison-with-pyyaml) --- ## Quick Start ```python import libfyaml as fy # Parse a YAML string doc = fy.loads("name: Alice\nage: 30") print(doc["name"]) # FyGeneric wrapping "Alice" print(str(doc["name"])) # "Alice" print(doc.to_python()) # {'name': 'Alice', 'age': 30} # Parse a file doc = fy.load("config.yaml") # Serialise back to YAML print(fy.dumps(doc)) # Parse JSON data = fy.loads('{"x": 1}', mode='json') ``` --- ## Parsing ### `loads(s, **options) → FyGeneric` Parse a YAML or JSON **string**. Raises `ValueError` if the input contains more than one document — use `loads_all` for multi-document streams. ```python doc = fy.loads("key: value") docs = fy.loads_all("---\na: 1\n---\nb: 2") # list of FyGeneric ``` ### `load(file, **options) → FyGeneric` Parse from a **file path** (string — uses mmap internally) or any **file-like object** with a `.read()` method. ```python doc = fy.load("data.yaml") with open("data.yaml") as f: doc = fy.load(f) ``` ### `loads_all(s, **options) → list[FyGeneric]` ### `load_all(file, **options) → list[FyGeneric]` Return all documents in a multi-document stream as a list. ```python docs = fy.loads_all("---\n1\n---\n2\n---\n3") # [FyGeneric(1), FyGeneric(2), FyGeneric(3)] ``` --- ### Parse modes The `mode` parameter controls which YAML dialect is accepted: | Mode string | Meaning | |---|---| | `'yaml'`, `'yaml1.2'`, `'1.2'` | YAML 1.2 — the default | | `'yaml1.1'`, `'1.1'` | YAML 1.1 (accepts merge keys `<<`, sexagesimal numbers, etc.) | | `'yaml1.1-pyyaml'`, `'pyyaml'` | YAML 1.1 with PyYAML-compatible quirks (used by the compat layer) | | `'json'` | Strict JSON | ```python # Merge keys only work in YAML 1.1 doc = fy.loads(""" defaults: &defaults timeout: 30 server: <<: *defaults host: localhost """, mode='yaml1.1') ``` --- ### Parser options All four parse functions accept the same keyword options: | Option | Default | Description | |---|---|---| | `mode` | `'yaml'` | Dialect — see above | | `dedup` | `True` | Use the deduplication allocator (saves memory for documents with repeated content) | | `trim` | `True` | Release unused allocator memory after parsing | | `mutable` | `False` | Produce mutable `FyGeneric` objects (required for `__setitem__` and `set_at_path`) | | `collect_diag` | `False` | Attach parse diagnostics to the result instead of raising | | `create_markers` | `False` | Record byte/line/column positions for every node | | `keep_comments` | `False` | Preserve YAML comments in the document | | `keep_style` | `False` | Preserve original scalar styles (literal, folded, quoted, …) | --- ## The FyGeneric Type `FyGeneric` is the type returned by all parse functions. It wraps a C `fy_generic` value without copying data. Conversion to Python only happens when you explicitly ask for it. ```python doc = fy.loads("x: 42") type(doc) # doc.__class__ # — the Python equivalent class ``` ### Type checking Eight predicate methods, all return `bool`: ```python v = fy.loads("42") v.is_null() # False v.is_bool() # False v.is_int() # True v.is_float() # False v.is_string() # False v.is_sequence() # False v.is_mapping() # False v.is_indirect() # True if the value carries a tag or anchor ``` ### Converting to Python ```python doc = fy.loads("items: [1, 2, 3]") # Recursive — the whole document becomes plain Python doc.to_python() # {'items': [1, 2, 3]} # Scalar coercions n = fy.loads("99") int(n) # 99 float(n) # 99.0 bool(n) # True str(n) # "99" ``` `to_python()` raises `TypeError` if a mapping key is unhashable (e.g. a nested mapping used as a key). ### Container access Sequences and mappings support the standard Python container protocol: ```python doc = fy.loads("fruits: [apple, banana, cherry]") fruits = doc["fruits"] len(fruits) # 3 fruits[0] # FyGeneric("apple") str(fruits[0]) # "apple" "banana" in fruits # True (linear scan) for item in fruits: print(str(item)) # Mappings doc["fruits"] # FyGeneric sequence doc.keys() # ['fruits'] doc.values() # [FyGeneric sequence] doc.items() # [('fruits', FyGeneric sequence)] ``` Attribute access on mappings delegates to the underlying dict: ```python doc = fy.loads("host: localhost\nport: 8080") str(doc.host) # "localhost" int(doc.port) # 8080 ``` Numeric operations on integer and float values work directly: ```python v = fy.loads("10") v + 5 # 15 v * 2 # 20 v > 5 # True ``` ### Tags and anchors ```python doc = fy.loads("value: !!int '42'") v = doc["value"] v.has_tag() # True v.get_tag() # "tag:yaml.org,2002:int" doc2 = fy.loads("x: &myanchor hello\ny: *myanchor") doc2["x"].has_anchor() # True doc2["x"].get_anchor() # "myanchor" ``` ### Source markers Markers record the byte offset, line, and column of each node in the original source. Enable them at parse time with `create_markers=True`. ```python doc = fy.loads("host: localhost\nport: 8080", create_markers=True) m = doc["host"].get_marker() # (start_byte, start_line, start_col, end_byte, end_line, end_col) # e.g. (6, 0, 6, 15, 0, 15) doc["host"].has_marker() # True doc["port"].get_marker() # (22, 1, 6, 31, 1, 15) ``` Lines and columns are zero-based. `get_marker()` returns `None` when markers were not enabled. ### Comments Preserve YAML comments by parsing with `keep_comments=True`. ```python yaml_text = """\ # Server settings host: localhost # primary port: 8080 """ doc = fy.loads(yaml_text, keep_comments=True) doc["host"].get_comment() # "# primary" doc["host"].has_comment() # True ``` ### Diagnostics With `collect_diag=True` parse errors are attached to the document rather than raised immediately. This lets you process partially-valid input. ```python doc = fy.loads("good: ok\nbad: {unclosed", collect_diag=True) doc.has_diag() # True doc.get_diag() # FyGeneric describing the error(s) ``` --- ## Serialisation ### `dumps(obj, *, compact=False, json=False, style=None, indent=0) → str` Serialise a `FyGeneric` or plain Python object to a YAML (or JSON) string. ```python doc = fy.loads("name: Alice\nscores: [10, 20, 30]") print(fy.dumps(doc)) # name: Alice # scores: # - 10 # - 20 # - 30 print(fy.dumps(doc, compact=True)) # {name: Alice, scores: [10, 20, 30]} print(fy.dumps(doc, json=True)) # {"name": "Alice", "scores": [10, 20, 30]} ``` `indent` sets the indentation width (2–8 spaces; 0 uses the library default). ### `dump(file, obj, *, mode='yaml', compact=False)` Write to a file path (string) or file-like object. `mode` accepts `'yaml'` or `'json'`. ```python fy.dump("output.yaml", doc) with open("output.json", "w") as f: fy.dump(f, doc, mode='json') ``` ### `dumps_all(documents, *, compact=False, json=False, style=None) → str` ### `dump_all(file, documents, *, compact=False, json=False)` Serialise a list of documents with `---` separators. ```python docs = fy.loads_all("---\na: 1\n---\nb: 2") print(fy.dumps_all(docs)) # --- # a: 1 # --- # b: 2 ``` ### Individual node serialisation `FyGeneric` objects have their own `.dump()` method: ```python doc = fy.loads("x: 1\ny: 2") doc["x"].dump() # returns "1\n" doc["x"].dump(strip_newline=True) # returns "1" doc["x"].dump("node.yaml") # writes to file doc["x"].dump(sys.stdout, mode='json') # writes to file object ``` --- ### Scalar styles The `style` parameter controls how scalar values are written. Accepted values: | Style | Effect | |---|---| | `None` or `'default'` | Library default (usually plain) | | `'original'` | Preserve the style from the parsed input (requires `keep_style=True` at parse time) | | `'block'` | Block scalars (literal `\|` or folded `>`) | | `'flow'` | Flow / inline style | | `'pretty'` | Readable multi-line format | | `'compact'` | Compact single-line | | `'oneline'` | Force everything onto one line | ```python doc = fy.loads("text: 'hello world'") print(fy.dumps(doc, style='block')) print(fy.dumps(doc, style='flow')) ``` --- ## Converting Python objects ### `from_python(obj, *, tag=None, style=None, mutable=False, dedup=True) → FyGeneric` Convert a plain Python object (`dict`, `list`, `str`, `int`, `float`, `bool`, `None`) to a `FyGeneric`. Useful for attaching tags or styles before serialisation. ```python # Attach a YAML tag v = fy.from_python("hello", tag="!mytag") print(fy.dumps(v)) # !mytag hello # Control the scalar style text = fy.from_python("line one\nline two\n", style='|') print(fy.dumps(text)) # | # line one # line two ``` Scalar `style` values accepted by `from_python`: | Style | Meaning | |---|---| | `'|'` | Literal block scalar | | `'>'` | Folded block scalar | | `"'"` | Single-quoted | | `'"'` | Double-quoted | | `'plain'` or `''` | Plain (unquoted) | --- ## Path navigation ### `get_at_path(path) → FyGeneric` ### `get_at_unix_path(path_str) → FyGeneric` Navigate into a nested document. A path is a list of keys (strings) and indices (integers). ```python doc = fy.loads(""" servers: - host: web01 port: 80 - host: web02 port: 443 """) doc.get_at_path(["servers", 0, "host"]) # FyGeneric("web01") doc.get_at_unix_path("/servers/0/host") # FyGeneric("web01") doc.get_at_unix_path("/servers/1/port") # FyGeneric(443) ``` `get_at_path` raises `KeyError` if the path does not exist. ### `get_path() → tuple` / `get_unix_path() → str` Return the path of a node within its document (useful when iterating): ```python doc = fy.loads("a:\n b:\n c: 42") v = doc.get_at_unix_path("/a/b/c") v.get_unix_path() # "/a/b/c" v.get_path() # ('a', 'b', 'c') ``` ### Path utility functions ```python fy.path_list_to_unix_path(["servers", 0, "host"]) # "/servers/0/host" fy.unix_path_to_path_list("/servers/0/host") # ["servers", 0, "host"] ``` --- ## Mutability By default `FyGeneric` objects are immutable. Pass `mutable=True` to the parse function (or `from_python`) to allow in-place modification. ```python doc = fy.loads("x: 1\ny: 2", mutable=True) doc["x"] = 99 str(doc["x"]) # "99" doc.set_at_path(["y"], "updated") doc.set_at_unix_path("/x", 0) print(fy.dumps(doc)) # x: 0 # y: updated ``` Attempting to modify an immutable object raises `TypeError`. --- ## FyDocumentState `FyDocumentState` carries the YAML directives that appeared before a document. Access it via `FyGeneric.document_state`. ```python doc = fy.loads("%YAML 1.2\n---\nkey: value") ds = doc.document_state ds.version # (1, 2) ds.version_explicit # True ds.json_mode # False ds.tags # list of {'handle': ..., 'prefix': ...} dicts ds.tags_explicit # True if %TAG directives were present ``` `document_state` is `None` for values that are not document roots. --- ## Memory management ### Allocator strategy The `dedup=True` default uses a deduplication allocator that stores only one copy of repeated strings or scalars. This is a significant win for large documents with repeated content (e.g. YAML files with many identical keys or values). Set `dedup=False` to use the standard allocator, which may be faster for small documents or documents with little repetition. ### Trim `trim=True` (default) releases unused allocator pages after parsing is complete. Disable with `trim=False` if you will be building on the document after parsing and want to avoid reallocation. ### Manual trim ```python doc = fy.loads(large_yaml, trim=False) # ... do some work ... doc.trim() # release unused memory now ``` ### Clone `clone()` creates an independent copy of a `FyGeneric` value, decoupled from the original document's allocator: ```python original = fy.load("big.yaml") part = original.get_at_unix_path("/config/server").clone() del original # can now be collected ``` --- ## Error handling | Exception | Raised when | |---|---| | `ValueError` | Parse error; invalid mode string; invalid style; multiple documents where one was expected | | `TypeError` | Wrong argument type; mutation on an immutable object; unhashable mapping key in `to_python()` or `items()` | | `KeyError` | Path not found in `get_at_path` / `get_at_unix_path` | | `RuntimeError` | Internal builder or emitter failure; file write error | | `AttributeError` | Attribute access on a non-mapping `FyGeneric` | | `NotImplementedError` | `del` on a `FyGeneric` item | ```python try: doc = fy.loads("key: [unclosed") except ValueError as e: print(f"Parse error: {e}") # Or collect errors without raising: doc = fy.loads("key: [unclosed", collect_diag=True) if doc.has_diag(): print(doc.get_diag().to_python()) ``` --- ## Comparison with PyYAML This section describes how the **core `libfyaml` binding** relates to PyYAML. ### Where they are similar - **Function names**: `load`, `loads`, `dump`, `dumps` follow the same naming convention as PyYAML's `yaml.safe_load` / `yaml.dump`. - **Python types out**: both ultimately produce `dict`, `list`, `str`, `int`, `float`, `bool`, and `None`. Call `.to_python()` on a `FyGeneric` to get the plain Python value. - **YAML tag handling**: both support `!!str`, `!!int`, `!!float`, `!!bool`, `!!null`, `!!seq`, `!!map`, `!!binary`, and custom tags. - **Multi-document streams**: both support `---`-separated documents via `load_all` / `loads_all`. ### Where they diverge #### Return type The most immediate difference: `loads` returns a `FyGeneric`, not a native Python object. You must call `.to_python()` (or use the object directly via the container/numeric protocols) to get a plain `dict` or `list`. ```python # PyYAML import yaml result = yaml.safe_load("x: 1") type(result) # dict # libfyaml import libfyaml as fy result = fy.loads("x: 1") type(result) # FyGeneric type(result.to_python()) # dict ``` #### API shape: mode instead of Loader PyYAML selects behaviour through `Loader` classes (`SafeLoader`, `FullLoader`, `BaseLoader`). libfyaml uses a `mode` string: ```python # PyYAML yaml.load(s, Loader=yaml.SafeLoader) yaml.safe_load(s) # libfyaml fy.loads(s) # YAML 1.2 (roughly equivalent to SafeLoader) fy.loads(s, mode='yaml1.1-pyyaml') # closest to PyYAML's SafeLoader behaviour ``` There are no Loader or Dumper classes in the core binding. #### Default YAML version: 1.2 not 1.1 libfyaml defaults to **YAML 1.2**. PyYAML implements **YAML 1.1**. This affects implicit type resolution: | Input | PyYAML (1.1) | libfyaml default (1.2) | |---|---|---| | `yes` / `no` / `on` / `off` | `True` / `False` | string | | `0755` | `493` (octal int) | string | | `1:30` (sexagesimal) | `90` (int) | string | | `1.5e3` | `1500.0` | `1500.0` | | `.inf` / `.nan` | `inf` / `nan` | `inf` / `nan` | Use `mode='yaml1.1'` or `mode='yaml1.1-pyyaml'` to get YAML 1.1 resolution. #### Strictness differences in YAML 1.1 mode Even in `yaml1.1-pyyaml` mode a few corner cases differ because libfyaml follows the YAML specification more strictly than PyYAML does: | Situation | PyYAML | libfyaml | |---|---|---| | Duplicate anchor (`&a 1 ... &a 2`) | `ComposerError` | accepted (spec §3.2.2.2 allows redefinition) | | Unknown `%DIRECTIVE` | `ScannerError` | warning, continues (spec §6.8.1 says SHOULD warn) | | `?` in anchor name (`&?foo`) | `ScannerError` | accepted (`?` is a valid `ns-anchor-char` per spec §6.9.2) | | Sexagesimal integers (`190:20:30`) | `685230` | string (not resolved) | | Sexagesimal floats (`190:20:30.15`) | `685230.15` | string (not resolved) | | Single dot (`.`) | string | `0.0` (float — C library bug) | | `---` as flow scalar | string | `null` (C library bug) | #### Error messages libfyaml and PyYAML produce different human-readable error messages for the same parse errors. Code that pattern-matches exception strings will need adjustment; code that only catches the exception type will be fine. #### Block scalar emission libfyaml follows the YAML spec strictly when choosing scalar styles, which means it will refuse to use a block scalar (`|` or `>`) in contexts where the spec does not permit one — for example as a value inside a flow collection. PyYAML emits block scalars in those contexts anyway, producing output that is technically non-conformant. If you serialise a document that PyYAML would render with block scalars inside flow collections, libfyaml will choose a flow-compatible style (double-quoted) instead. #### Unicode line separators (U+2028 / U+2029) The YAML 1.2 spec (§6.5) classifies U+2028 (LINE SEPARATOR) and U+2029 (PARAGRAPH SEPARATOR) as line-break characters. libfyaml honours this in block scalars, treating them as line breaks during both parsing and emission. PyYAML predates this clarification and treats them as ordinary non-breaking characters throughout. If your data contains these code points, block-style round-trips will produce different results between the two libraries. Use double-quoted scalars to preserve them unambiguously in either library. #### !!binary tag syntax libfyaml accepts inline `!!binary` scalars (`!!binary aGVsbG8=`) in addition to the block form that PyYAML requires (`!!binary |\n aGVsbG8=`). Both forms decode to `bytes`. #### Features not in PyYAML The core binding provides capabilities that PyYAML has no equivalent for: - **Source markers** (`create_markers=True`) — byte/line/column positions for every node, without the overhead of PyYAML's `Mark` objects on events. - **Comment preservation** (`keep_comments=True`). - **Style preservation** (`keep_style=True`) — round-trip the original scalar style (literal, folded, single-quoted, etc.). - **Path navigation** — `get_at_unix_path`, `set_at_unix_path` for direct document surgery without tree traversal code. - **Deduplication allocator** — dramatically lower memory usage for documents with repeated content. - **`FyDocumentState`** — programmatic access to `%YAML` and `%TAG` directives. --- ## Appendix: Parse performance ### Methodology Configurations were measured by running `docs/benchmark-parse.py` against two real-world YAML files. Each configuration runs in an **isolated subprocess** so that allocations from earlier runs cannot inflate later measurements. All libraries are imported **before** the baseline RSS is measured so that library load cost (the `.so` footprint) is excluded from the delta. The RSS delta therefore reflects only the memory added by parsing that specific file — the data structures created, the source text mapped, the allocator pages used. Five timed repetitions were taken per configuration; the tables report the **median** parse time and **median peak RSS delta** across those runs. The benchmark can be reproduced on any YAML file: ``` python3 docs/benchmark-parse.py [--runs N] [--multi] ``` Use `--multi` for files containing multiple `---`-separated documents. **Note on PyYAML compatibility.** PyYAML's `SafeLoader` and `CLoader` do not recognise `tag:yaml.org,2002:value`, the tag YAML 1.1 assigns to a bare `=` scalar. YAML 1.2 treats `=` as a plain string, and it appears legitimately in both test files (e.g. as an enum value in Kubernetes CRD schemas). The benchmark registers a one-line constructor fix so PyYAML can parse these files; libfyaml handles them correctly without any patching. **Environment** | Item | Version | |---|---| | CPU | AMD Ryzen 5 5600X | | Python | 3.12.3 | | PyYAML | 6.0.1 | | libyaml (CLoader) | 0.2.5 | | libfyaml | v0.9.3-278 (release build) | ### Results — 6.4 MB (`AtomicCards-2-cleaned-small.yaml`, single-doc) Magic: The Gathering card database — highly varied text content with moderate key repetition. ```mermaid xychart-beta horizontal title "Parse time — AtomicCards 6.4 MB (ms, lower is better)" x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"] y-axis "ms" 0 --> 7500 bar [7155, 1228, 115, 102] ``` ```mermaid xychart-beta horizontal title "RSS delta — AtomicCards 6.4 MB (MB, lower is better)" x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"] y-axis "MB" 0 --> 175 bar [164, 123, 28, 25] ``` | Configuration | Median | Min | RSS delta | |---|---|---|---| | PyYAML `safe_load` (pure Python) | 7155 ms | 7033 ms | +164 MB | | PyYAML `CLoader` (libyaml) | 1228 ms | 1172 ms | +123 MB | | libfyaml `dedup=True` (default) | 115 ms | 114 ms | +28 MB | | libfyaml `dedup=False` | 102 ms | 101 ms | +25 MB | ### Results — 4.3 MB (`bundle.yaml`, multi-doc, 24 documents) Prometheus Operator CRD bundle ([source](https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml)) — structured Kubernetes schemas with heavy key repetition (`name`, `type`, `description`, `properties`, `spec` recurring throughout). ```mermaid xychart-beta horizontal title "Parse time — bundle.yaml 4.3 MB (ms, lower is better)" x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"] y-axis "ms" 0 --> 3200 bar [2964, 274, 53, 48] ``` ```mermaid xychart-beta horizontal title "RSS delta — bundle.yaml 4.3 MB (MB, lower is better)" x-axis ["PyYAML safe_load", "PyYAML CLoader", "libfyaml dedup=on", "libfyaml dedup=off"] y-axis "MB" 0 --> 20 bar [16, 14, 3, 10] ``` | Configuration | Median | Min | RSS delta | |---|---|---|---| | PyYAML `safe_load` (pure Python) | 2964 ms | 2919 ms | +16 MB | | PyYAML `CLoader` (libyaml) | 274 ms | 267 ms | +14 MB | | libfyaml `dedup=True` (default) | 53 ms | 52 ms | +3 MB | | libfyaml `dedup=False` | 48 ms | 48 ms | +10 MB | ### Analysis **Speed.** Across both files, libfyaml is **4–5× faster than CLoader** and **55–60× faster than pure-Python PyYAML**. The gap against the pure Python loader is expected — PyYAML constructs every node as a heap-allocated Python object while iterating the event stream in interpreted bytecode. The gap against CLoader is more meaningful: both parsers are written in C, but libfyaml uses mmap for file I/O, a purpose-built allocator, and avoids the two-phase parse/construct split that libyaml's event model requires. **Memory.** libfyaml consistently uses **far less RSS than PyYAML** for the parsed data structure. PyYAML allocates a heap object (dict, list, str, int, …) for every node in the document; libfyaml stores values in its arena allocator with `FyGeneric` wrappers created lazily on access. On the card database, libfyaml uses **~78% less RSS than CLoader** (+25–28 MB vs +123 MB); on the CRD bundle it uses **~80–98% less** (+3–10 MB vs +14 MB). Note that libfyaml's `.so` file itself has a significant up-front import cost (~50 MB RSS), which is a fixed one-time overhead amortised across all subsequent `load()` calls and not included in the delta figures above. **dedup vs no-dedup.** On the card database, `dedup=True` adds ~13 ms but saves only ~3 MB — the text content is highly varied, so the dedup allocator finds little to share. On the CRD bundle, `dedup=True` *saves* 7 MB compared to `dedup=False` because Kubernetes schemas repeat the same field names (`name`, `type`, `description`, `properties`, …) thousands of times across 24 documents. The deduplication allocator is the right default for structured configuration and API-schema YAML; for documents with unique free-form text, `dedup=False` is marginally faster.