Skip to main content

Loading and exploring a graph

We start the tutorial by loading a pangenome graph object and exploring its properties. For this tutorial we will use the plasmids.json file that you can find in pypangraph repository under tests/data/plasmids.json

We can load the graph object with:

import pypangraph as pp

graph = pp.Pangraph.from_json("plasmids.json")
print(graph)
# pangraph object with 15 paths, 137 blocks and 1042 nodes

The components of a graph

As explained in the tutorial, a pangenome graph is composed of three main components: nodes, blocks and paths.

  • Blocks encode multiple sequence alignments that group together homologous parts of the input genomes.
  • Paths are representation of the input genomes as a sequences of blocks. More precisely, as sequence of nodes.
  • Nodes are the elements that connect blocks and paths. Each node connects a single element in a path with a single entry in a block alignment.

These are saved in three sub-properties of the graph object:

graph.blocks
graph.paths
graph.nodes

Paths

This is a graph made from 15 different plasmid sequences. Each sequence corresponds to a different path. We can get the path names with:

print(graph.paths.keys())
# ['RCS48_p1', 'RCS49_p1', 'RCS64_p2', 'RCS80_p1', ... ]

Paths are also numbered, and the connection between the name and the number can be retrieved with:

print(graph.paths.idx_to_name)
# {0: 'RCS48_p1', 1: 'RCS49_p1', 2: 'RCS64_p2', 3: 'RCS80_p1', ... }

We can recover a specific path with its identifier:

path = graph.paths["RCS48_p1"]
print(path)
# path object | name = RCS48_p1, n. nodes = 60, length = 80596 bp

And get the list of node ids that compose the path:

print(path.nodes)
# [11788816242159313242, 6289532891526049858, 9710696558260003146, ... ]

Nodes

Nodes are stored in a dataframe:

print(graph.nodes)
# node_id block_id path_id strand start end
# 11484376918084368 6227233701292645975 12 True 87911 88000
# 62802772372552842 14279814672519617104 10 True 30224 30923
# 66596091345916983 1967902255453418588 13 True 27906 28591
# ... ... ... ... ... ...
# [1042 rows x 5 columns]

In this dataframe each node, identified by its node_id, is associated with a block_id and a path_id. The strand, start and end columns indicate the orientation and position of the node in the input path genome.

Each node objects can be retrieved by its id:

node = graph.nodes[11788816242159313242]
print(node)
# block_id 14710008249239879492
# path_id 0
# strand True
# start 2358
# end 2552
# Name: 11788816242159313242, dtype: object

Blocks

Blocks encode multiple sequnce alignment. They can be accessed via their id.

block = graph.blocks[14710008249239879492]
print(block)
# block 14710008249239879492, consensus len = 183 bp, n. nodes = 4

Each block has a consensus sequence:

print(block.consensus())
# ATATATGGTGCGTTAATTTTTAAACCCTTATTTAATTTC...

And an alignment that connects nodes to sequences:

print(block.alignment.generate_alignment())
# { '4174336837421425166': 'ATATATGGTGCGTTAATTTTTAAACCCT...',
# '8533989107945450583': 'ATATATGGTGCGTTAATTTTTAAACCCT...',
# '11788816242159313242': 'ATATATGGTGCGTTAATTTTTAAACCCT...',
# '16194835320646696346': 'ATATATGGTGCGTTAATTTTTAAACCCT...'}

More details on alignments are provided in tutorial 3.