2025-107/2025-121 - SpectraCodec: A Hilbert curve-based method for encoding metadata in mass spectra for machine learning applications

Applications

Embeds experimental metadata directly into mass spectrometry (MS) spectral files for unified data handling
Streamlines machine learning (ML) workflows by preserving metadata with spectral data
Enables reproducible, version-controlled MS analyses across collaborative environments
Fully compatible with standard mzML formats and existing MS software tools

Advantages/Benefits

Eliminates separate metadata files, reducing complexity and risk of data loss
Maintains full compatibility with existing MS analysis software and pipelines
Easy ML workflow integration available for open-source Python implementation
Embeds metadata directly into mass spectrometry files using a Hilbert curve pattern

Background

Mass spectrometry is a key technology in systems biology and metabolomics. However, it often separates experimental metadata, crucial to sample interpretation, from the spectral data itself. This disconnect can lead to data loss, version control issues, and complications in machine learning (ML) workflows. These challenges are especially common in collaborative environmental settings, where spectral data and metadata are often exchanged separately. SpectraCodec addresses this by encoding metadata directly within spectral files, keeping critical experimental details attached and ensuring consistent, integrated data throughout the analysis pipeline.

Technology Overview

Berkeley Lab researchers have developed SpectraCodec, a Python-based framework that embeds experimental metadata directly into mass spectrometry (MS) data files with high fidelity and compatibility. The system encodes metadata into the first spectrum of mzML files using a Hilbert space-filling curve, a mathematical mapping that preserves spatial relationships while maximizing use of the m/z–intensity space. Compressed metadata undergoes base64 encoding followed by 7-bit binary conversion. Active bit coordinates are then mapped using two-dimensional Hilbert curve transformations. The resulting coordinate values are inserted as synthetic peaks at 5 m/z in the opening scan of LC-MS/MS data files, ensuring consistent dimensionality and reliable data recovery. By integrating metadata within the spectral file itself, SpectraCodec eliminates the need for separate annotation files, reducing the risk of data fragmentation, version mismatches, and metadata loss. This unified file format enhances data accessibility, reproducibility, and machine learning integration by maintaining the persistent linkage between contextual and feature data. SpectraCodec is open-source, platform-independent, and fully compatible with existing MS analysis pipelines, making it easy to adopt within current research workflows.

Development Stage

TRL 7, full-scale prototype

Principal Investigator(s)

Benjamin Bowen
Trent Northen

Status

Patent pending

Opportunities

Available for licensing or collaborative research

Footer