Under the Hood: Technical Insights from Building GGG Sankey

A technical deep-dive into the engineering challenges and solutions behind the microbial taxonomy visualization tool.

In our previous article, we introduced GGG-Sankey as a user-friendly tool designed to simplify the visualization of complex microbial data. While the interface is clean and straightforward, the logic operating behind the scenes deals with significant biological and technical complexity.

Developing a tool that transforms raw CSV data into interactive, scientifically accurate Sankey diagrams required us to solve several engineering hurdles. This article serves as a technical companion to our main overview, sharing the practical problems we encountered and the specific solutions we implemented to ensure reliability for researchers.

Challenge 1: Building a Robust Taxonomy Graph

At its core, a Sankey diagram represents flow between nodes. However, microbial taxonomy is not just simple flow; it is a hierarchical structure that must be strictly maintained (Kingdom → Phylum → Class → Order → Family → Genus → Species). One of the primary technical challenges we faced was constructing a graph backend that respects this biological hierarchy even when the input data is messy.

The Problem: Data Inconsistency

Real-world datasets are rarely perfect. We frequently encounter issues such as:

Inconsistent naming conventions across different databases.
Missing values (NA) at specific taxonomic levels (e.g., a sequence identified at Family level but unknown at the Genus level).
Disjointed hierarchies that broke the visual flow of the diagram.

The solution: Graph-Based Validation

To adress this, GGG-Sankey constructs a directed graph structure in the backend before any visualization takes place. This process involves:

Unified Naming Conventions: The engine normalizes names across levels to prevent duplicate nodes caused by capitalization or whitespace differences.
Intelligent NA Handling: Instead of dropping rows with missing data (which would distort abundance calculations), the system creates placeholder nodes or terminates the flow gracefully at the last known rank. This ensures that the total abundance remains accurate throughout the plot.
Hierachy Integrity: The algorithms enforce parent-child relationships, ensuring that a Species node always flows fro its correct Genus parent, maintaining the scientific validity of the plot.

Challange 2: Managing Visual Complexity with Abundance Filtering

Microbiome datasets are characterized by high diversity, often containing hundreds or thousands of distinct taxa. While comprehensive, this density is fatal for visualization. A Sankey diagram with 500 distinct “ribbons” becomes an unreadable block of color, rendering the plot useless for analysis.

The Problem: The “Spaghetti” Effect

Without intervention, plotting raw amplicon sequence variant (ASV) or operational taxonomic unit (OTU) tables result in extreme visual clutter. The “long tail” of low-abundance organisms obscures the dominant signals that researchers are typically interested in.

The Solution: Dynamic Abundance Cutoffs

We implemented a user-controlled dynamic filtering system directly into the processing pipeline:

Adjustable Thresholds: Users can set cutoff values (e.g., 1% relative abundance). Taxa falling below this threshold are aggregated or hidden depending on the configuration.
Real-Time Recalculation: When the slider is moved, the backend immediatly recalculates the relative abundances of the remaining taxa.
Balancing Detail vs. Clarity: This feature allows microbiologists to focus on the “core microbiome” or dominant drivers of a community without being distracted by rare biosphere noise.

Technical Note: This filtering is applied at the aggregation stage, meaning the visualization remains statistically representative of the displayed fraction, preventing misleading interpretations of the data.

Challange 3: Client-Side Rendering for Optimal Performance

A critical architectural decision was where to render the final visual output. While R Shiny is excellent for server-side data processing, rendering complex interactive graphics entirely on the server can lead to latency and poor responsiveness, especially with large datasets.

The Technical Decision

We chose to offload the final plot rendering to the client’s browser using JavaScript. The R server processes the CSV, handles the taxonomy logic, and calculates the graph nodes and links. This lightweight JSON structure is then sent to the browser, where the actual drawing occurs.

Benefits and Practical Implications

Scalability: By reducing the computational load on the server, the app can support multiple concurrent users more effectively.
Smart Sizing Algorithm: The client-side script dynamically calculates the optimal height for each taxon box and the font size for text labels. This ensures that labels do not overlap and that the layout adapts to the available screen real estate.

Practical User Tip

Because the rendering logic attempts to fit the graph into the current viewport, the aspect ratio of your browser window matters significantly. For the best experience, we recommend maximizing your browser window to full screen and then clicking “Reload” on the plot. This gives the smart sizing algorithm the maximum canvas size to work with, resulting in the cleanest separation between taxonomic flows.

Conclusion

The development of GGG-Sankey was not just an exercise in coding, but a response to the practical frustrations experienced by microbiologists working with high-dimensional taxonomy data. By building a robust graph backend, implementing smart data filtering, and leveraging client side rendering, we have created a tool that bridges the gap between raw data and publication-ready visualization.

We believe that understanding these technical decisions helps users trust the visualizations they create. We invite you to try the web app and experience how these solutions work in practice to bring clarity to your research data.

Fell free to read our introduction article on GGG-Sankey here.

GGG Sankey is know available at: GGG Sankey