Filtering Data#

ARGscape provides several ways to filter and subset your tree sequence for visualization. This is essential for exploring large datasets and creating focused figures.

Setup#

Let’s create a large tree sequence to demonstrate filtering.

import msprime
import argscape

# Simulate a larger tree sequence
ts = msprime.sim_ancestry(
    samples=50,
    sequence_length=10_000,
    recombination_rate=1e-8,
    population_size=10_000,
    random_seed=42
)
ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=42)

print(f"Full dataset:")
print(f"  Samples: {ts.num_samples}")
print(f"  Nodes: {ts.num_nodes}")
print(f"  Edges: {ts.num_edges}")
print(f"  Trees: {ts.num_trees}")
print(f"  Sequence length: {ts.sequence_length:,.0f} bp")
Full dataset:
  Samples: 100
  Nodes: 214
  Edges: 256
  Trees: 19
  Sequence length: 10,000 bp

Sample Subsetting#

ARGscape provides multiple ways to select which samples to display:

  • max_samples with subset_mode: Limit to N samples using even spacing or random selection

  • samples (list): Select specific samples by their node IDs

  • samples (tuple): Select a range of samples by index

# Visualize all 50 samples
viz = argscape.visualize(ts, show_sample_ids=True, width=2000)
viz.display()
<argscape.visualize.VizResult at 0x11b741e80>
# Subset to 20 samples (evenly spaced)
viz = argscape.visualize(
    ts, 
    max_samples=20,
    show_sample_ids=True,
    height=700
)
viz.display()
<argscape.visualize.VizResult at 0x12fa79950>
# Random subset with seed for reproducibility
viz = argscape.visualize(
    ts,
    max_samples=10,
    subset_mode="random",
    subset_seed=123,
    height=500
)
viz.display()
<argscape.visualize.VizResult at 0x11b6cf390>

Direct Sample Selection#

You can specify exactly which samples to include using a list of sample node IDs.

# Select specific samples by node ID
# Useful when you want to track particular individuals
selected_ids = [6, 7, 20, 21, 34, 35]
viz = argscape.visualize(
    ts,
    samples=selected_ids,
    show_sample_ids=True,
    width=900,
    height=600
)
viz.display()
<argscape.visualize.VizResult at 0x12fa9cc30>
# Select a range of samples by index (samples 0-29)
# Useful when samples are ordered meaningfully (e.g., by population)
viz = argscape.visualize(
    ts,
    samples=(0, 30),  # First 30 samples by index
    theme="liquid",
    show_sample_ids=True,
    width=900,
    height=700
)
viz.display()
<argscape.visualize.VizResult at 0x107c21350>

Genomic Range Filtering#

Filter to a specific genomic region using genomic_range=(start, end).

# First 5kb only
viz = argscape.visualize(
    ts,
    genomic_range=(0, 5_000),
    max_samples=25,
    theme="liquid",
    width=900,
    height=400
)
viz.display()
<argscape.visualize.VizResult at 0x12fa6a9f0>

To verify, we can click the bar at the bottom to view the currently selected genomic range. We can also enable edge labels to see the span covered by each edge, though this can become messy:

# First 5kb only
viz = argscape.visualize(
    ts,
    genomic_range=(0, 5_000),
    max_samples=25,
    theme="liquid",
    width=1000,
    height=700,
    show_edge_labels=True
)
viz.display()
<argscape.visualize.VizResult at 0x12fab56a0>
# Middle region: 3kb-7kb
viz = argscape.visualize(
    ts,
    genomic_range=(3_000, 7_000),
    max_samples=25,
    theme="liquid",
    width=900,
    height=400
)
viz.display()
<argscape.visualize.VizResult at 0x12fab5040>

Temporal Filtering#

Filter nodes by time using temporal_range=(min_time, max_time). Time 0 is the present.

# Get time range in the tree sequence
max_time = max(ts.node(n).time for n in range(ts.num_nodes))
print(f"Time range: 0 to {max_time:.0f} generations")
Time range: 0 to 106741 generations
# Recent history only (youngest 25%)
viz = argscape.visualize(
    ts,
    temporal_range=(0, max_time * 0.25),
    max_samples=30,
    show_root_ids=True,
    width=900,
    height=500
)
viz.display()
<argscape.visualize.VizResult at 0x11b6e2a50>
# Deeper history (youngest 80%)
viz = argscape.visualize(
    ts,
    temporal_range=(0, max_time * 0.8),
    max_samples=30,
    show_root_ids=True,
    width=900,
    height=700
)
viz.display()
<argscape.visualize.VizResult at 0x12face050>

Combining Filters#

Multiple filters can be combined to focus on specific regions of the data.

# Focused view: specific region, recent time, subset of samples
viz = argscape.visualize(
    ts,
    genomic_range=(2_000, 8_000),
    temporal_range=(0, max_time * 0.4),
    max_samples=20,
    subset_mode="even",
    theme="paper",
    show_mutations=True,
    width=900,
    height=500
)
viz.display()
<argscape.visualize.VizResult at 0x12fa8df40>

3D Spatial Filtering#

All filtering options work identically in 3D spatial mode. Let’s load a tree sequence with spatial locations to demonstrate.

import tskit

# Load tree sequence with spatial coordinates and populations
ts_spatial = tskit.load("../data/population_remix_simplified.trees")

print(f"Spatial dataset:")
print(f"  Samples: {ts_spatial.num_samples}")
print(f"  Populations: {ts_spatial.num_populations}")
print(f"  Trees: {ts_spatial.num_trees}")
Spatial dataset:
  Samples: 60
  Populations: 4
  Trees: 6
# Full 3D view with population coloring
viz = argscape.visualize(
    ts_spatial,
    mode="spatial_3d",
    theme="tskit",
    color_by_population=True,
    sample_node_size=10,
    spatial_multiplier=160,
    temporal_multiplier=7,
    height=900
)
viz.display()
<argscape.visualize.VizResult at 0x12fa8de50>
# Temporal filtering in 3D - focus on recent history
max_time_spatial = max(ts_spatial.node(n).time for n in range(ts_spatial.num_nodes))

viz = argscape.visualize(
    ts_spatial,
    mode="spatial_3d",
    theme="tskit",
    color_by_population=True,
    temporal_range=(0, max_time_spatial * 0.1),  # Recent 10%
    sample_node_size=10,
    spatial_multiplier=160,
    temporal_multiplier=12,
    width=900,
    height=700
)
viz.display()
<argscape.visualize.VizResult at 0x12fa467b0>
# Sample subsetting in 3D
viz = argscape.visualize(
    ts_spatial,
    mode="spatial_3d",
    color_by_population=True,
    max_samples=30,
    subset_mode="even",
    sample_node_size=12,
    spatial_multiplier=160,
    temporal_multiplier=12,
    width=900,
    height=800
)
viz.display()
<argscape.visualize.VizResult at 0x12fa47d90>

Performance Considerations#

Filtering is essential for visualizing large tree sequences:

  • Edge count: Keep below 1000 edges for smooth interactivity

  • Genomic range: Narrow regions have fewer trees/edges

  • Temporal range: Shallow time reduces node count

See Performance Optimization Guide for more optimization strategies.

Next Steps#