GSoC 2024 - Week 7 & 8 Progress

Graph Complete with Incomplete Structure


Week 7: Constructing the Trial Graph

In Week 7, I focused on constructing the graph for the TrialGraph. The first step was to write an SQL query to fetch the necessary data from the database:

SELECT 
    ROW_NUMBER() OVER (PARTITION BY c.whole_id ORDER BY c.id) - 1 AS index_,
    c.*, 
    cc.name, 
    cc.type
FROM code_component cc 
JOIN composition c
    ON c.part_id = cc.id
WHERE 
    c.trial_id = '0b5ff00f-74bb-49f8-8d11-e1737e0c7f84' 
    AND cc.trial_id = '0b5ff00f-74bb-49f8-8d11-e1737e0c7f84' 
    AND c.type != '*op_pos'
    AND cc.type != 'syntax';

This SQL query is designed to gather the components and their composition relationships for a specific trial. The ROW_NUMBER() function is used to assign an index to each row within the partition of the whole_id starting from 0, which helps in ordering and referencing components. This index is used later to establish relationships between components in the graph.

Next, I implemented the TrialDefinition class in Python, which is responsible for constructing the graph based on the fetched components and compositions:

class TrialDefinition:
    def __init__(self, definitions, def_dict):
        self.activations = defaultdict(lambda: defaultdict(int))
        self.children = []
        self.children_index = 0
        self.duration = defaultdict(int)
        self.full_tooltip = None
        self.name = None
        self.tooltips = defaultdict

        self.definitions = definitions
        self.def_dict = def_dict

        self.stack = []
        self.nodes = []

        self()

    def __call__(self):
        for component in self.components:
            if component.type == 'script' and not self.stack:
                last = self.insert_script(component)
                continue
            if component.type == 'return':
                pass

            node = self.construct_ast_node(component, self.def_dict)
            if node is None:
                continue
            self.node_dict[component.id] = node

        for composition in self.compositions:
            whole_node = self.node_dict.get(composition.whole_id)
            part_node = self.node_dict.get(composition.part_id)

            if whole_node is None or part_node is None:
                if composition.extra is None:
                    continue
            self.construct_ast_relationship(whole_node, part_node, composition)

        return self.node_dict[1]

    def add_edge(self, source, target, type, count=1):
        edge = TrialEdge(
            count={source.trial_id: count},
            source=source,
            target=target,
            type=type
        )

    def insert_node(self, definition, parent, label):
        if label is None:
            label = definition.name
        node = TrialNode(
            activations={definition.trial_id: [definition.id]},
            children=[],
            children_index=-1,
            duration={definition.trial_id: 1},
            full_tooltip=False,
            name=definition.name,
            tooltip={definition.trial_id: label},
            trial_ids=[definition.trial_id]
        )
        if parent is not None:
            node.parent_index = parent.index
            node.children_index = len(parent.children)
            parent.children.append(node)
        return node

    def insert_script(self, definition):
        self.root = node = self.insert_node(definition, None)
        self.add_edge(node, node, 'initial')
        self.stack.append(node)
        return self.call(definition, node)

    def insert_call(self, definition, last):
        pass

    def insert_sequence(self, definition, last):
        pass

    def insert_return(self, last):
        pass

Most of the logic defined here is adapted from the TrialAst class. The first for loop in the __call__ function is responsible for inserting nodes based on the type of edge that will be used. Nodes in the graph are represented by instances of the TrialNode class, which are inserted into the graph by linking them to their parent nodes.

The TrialNode class, which structures each node, is implemented using DotDict:

class DotDict(dict):
    """Dict that can be accessed by attributes"""
    def __getattr__(self, attr):
        if attr in self:
            return self[attr]
        return dict.__getattr__(self, attr)
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__

DotDict, from the noWorkflow codebase, allows dictionary items to be accessed as attributes, simplifying the manipulation of node properties. This design makes the TrialNode class more intuitive and easier to work with, especially when constructing and managing complex graph structures.

Week 8: Enhancing the Definition Model

After discussions with my mentor, I realized that using the existing proxy models defined in the noWorkflow codebase would be more efficient than creating a new SQL query with similar functionalities. Consequently, I decided to discard the previous SQL query approach and instead implement the Model class from base.py.

Since the graph represents the script, I am using the CodeComponent model and referencing it with trial.code_components from relationship.py. This approach aligns better with the existing codebase and simplifies the implementation.

The updated model is now defined in definition.py, which focuses on creating a definition of the trial and displaying the graph accordingly.

# Copyright (c) 2016 Universidade Federal Fluminense (UFF)
# Copyright (c) 2016 Polytechnic Institute of New York University.
# This file is part of noWorkflow.
# Please, consult the license terms in the LICENSE file.
"""Definition Object"""
from __future__ import (absolute_import, print_function,
                        division, unicode_literals)

from collections import OrderedDict

from future.utils import viewkeys

from ..persistence.models.base import Model, proxy_gen
from ..persistence.models.trial import Trial
from .graphs.definition_graph import DefinitionGraph


class Definition(Model):
    """This model represents a definition of trials
    Initialize it by passing a trials id:
        definition = Definition(1)

    There are four visualization modes for the graph:
        tree: activation tree without any filters
            diff.graph.mode = 0
        no match: tree transformed into a graph by the addition of sequence and
                  return edges and removal of intermediate call edges
            diff.graph.mode = 1
        exact match: calls are only combined when all the sub-call match
            diff.graph.mode = 2
        namesapce: calls are combined without considering the sub-calls
            diff.graph.mode = 3


    You can change the graph width and height by the variables:
        diff.graph.width = 600
        diff.graph.height = 400
    """

    __modelname__ = "Definition"

    DEFAULT = {
        "graph.width": 500,
        "graph.height": 500,
        "graph.mode": 0,
        "graph.use_cache": False
    }

    REPLACE = {
        "graph_width": "graph.width",
        "graph_height": "graph.height",
        "graph_mode": "graph.mode",
        "graph_use_cache": "graph.use_cache"
    }

    def __init__(self, trial_ref, **kwargs):
        super(Definition, self).__init__(trial_ref, **kwargs)
        self.trial = Trial(trial_ref)
        self.graph = DefinitionGraph(self)
        self.initialize_default(kwargs)

    @property
    def modules(self):
        """Diff modules from trials"""
        return self.trial.modules

    @property
    def environment(self):
        """Diff environment variables"""
        return self.trial.environment_attrs

    @property
    def file_accesses(self):
        """Diff file accesses"""
        return self.trial.file_accesses

    def _ipython_display_(self):
        from IPython.display import display
        bundle = {
            'application/noworkflow.trial+json': self._modes[self.mode]()[1],
            'text/plain': 'Trial {}'.format(self.trial.id),
        }
        display(bundle, raw=True)

The Definition class is initialized by passing a trial ID and provides various properties to access different aspects of the trial data. It supports four visualization modes for the graph: tree, no match, exact match, and namespace. For our purposes, we use only the “tree” mode, as it suffices for displaying the script structure without needing to illustrate the execution flow, similar to an AST structure. Property wrappers in the class facilitate the display of relevant information such as used modules and local environment details.

Trial Inforgraphic

Next, I implemented the DefinitionGraph class, which defines the graph structure and includes methods for manipulating it:

# Copyright (c) 2016 Universidade Federal Fluminense (UFF)
# Copyright (c) 2016 Polytechnic Institute of New York University.
# This file is part of noWorkflow.
# Please, consult the license terms in the LICENSE file.
"""Trial Graph Module"""
from __future__ import (absolute_import, print_function,
                        division, unicode_literals)

import weakref
import ast

from collections import defaultdict

from future.utils import viewitems

from .structures import prepare_cache, Graph
from ...persistence.models.code_component import CodeComponent
from .trial_graph import Node

cache = prepare_cache(                                                           # pylint: disable=invalid-name
    lambda self, *args, **kwargs: "trial {}".format(self.trial.trial.id))


class DefinitionGraph(Graph):
    """Definition Graph Class
       Present definition graph on Jupyter"""

    def __init__(self, trial):
        self.trial = weakref.proxy(trial)

        self.match_id = 0
        self.use_cache = False
        self.width = 500
        self.height = 500
        self.mode = 0
        self._modes = {
            0: self.tree,
            1: self.no_match,
            2: self.exact_match,
            3: self.namespace_match
        }


    def define_node(self, preorder, relationship):
        for node in preorder:
            if node.type == 'script':
                root = last = self.insert_node(node, None)
                continue
            elif node.type != 'syntax':
                 for relation in relationship:
                    if last.node_id == relation.whole_id and node.id == relation.part_id:
                        last = self.insert_node(node, last)
                        relationship.remove(relation)
                        break
                    else:
                        match = self.calculate_match(node)
                        find = self.matches[last.parent_index].get(match)
                        if find is None:
                            last = self.insert_node(node, self.nodes[last.parent_index], match)
                            relationship.remove(relation)
                            break
        return root

    def add_edge(self, source, target, type_, count=1):
        """Add edge"""
        ids = target.trial_ids
        trial_id = 0 if len(ids) > 1 else next(iter(ids))
        self.edges[source.index][target.index][type_][trial_id] += count

    def insert_node(self, node_, parent, match=None):
        """Create node"""
        node = Node(
            index=self.index,
            name=node_.type,
            parent_index=-1,
            children_index=-1,
            children=[],
            node_id = node_.id,
            activations=defaultdict(list),
            duration=defaultdict(int),
            full_tooltip=True,
            tooltip=defaultdict(str),
            trial_ids=[],
            has_return=False,
        )
        # self.merge(node, node_)
        trial_id = node_.trial_id
        if trial_id not in node.trial_ids:
            node.trial_ids.append(trial_id)
        node.activations[trial_id].append(node_.id)
        node.duration[trial_id] += 0

        node.tooltip[trial_id] += "T{} - {}<br>Name: {}<br>Type: {}<br>".format(
            trial_id, node_.id, node_.name, node_.type
        )

        self.index += 1
        if parent is not None:
            node.parent_index = parent.index
            node.children_index = len(parent.children)
            parent.children.append(node)
            if match is not None:
                self.matches[parent.index][match] = node

        self.nodes.append(node)
        return node

    def calculate_match(self, node):
        """No match"""
        self.match_id += 1
        return self.match_id

    def result(self, components, compositions):
        """Get summarization graph result"""
        # return self.trial.finished, summarization.graph(
        #     {self.trial.id: 0}, self.width, self.height
        # ), summarization.nodes
        self.index = 0
        self.nodes = []
        self.matches = defaultdict(dict)
        self.edges = defaultdict(
            lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
        )
        root = self.define_node(components, compositions)
        stack = [root]
        edges = []
        while stack:
            current = stack.pop()
            for index, child in enumerate(current.children):
                self.add_edge(current, child, 'call', index)
                stack.append(child)
        for source_nid, targets in viewitems(self.edges):
            for target_nid, types in viewitems(targets):
                for type_, count in viewitems(types):
                    edges.append({
                        'count': count,
                        'source': source_nid,
                        'target': target_nid,
                        'type': type_,
                    })
        min_duration = 0
        max_duration = 1
        trials = set()
        for node in self.nodes:
            for trial_id in node.trial_ids:
                trials.add(trial_id)
        tlist = list(trials)
        if not tlist:
            tlist.append(0)
        graph = {
            'root': root,
            'edges': edges,
            'min_duration': {self.trial.trial.id: min_duration},
            'max_duration': {self.trial.trial.id: max_duration},
            'colors': {self.trial.trial.id: 0},
            'trial1': tlist[0],
            'trial2': tlist[-1],
            'width': self.width,
            'height': self.height,
        }
        finished = self.trial.trial.finished
        return finished, graph, self.nodes

    @cache("tree")
    def tree(self):
        """Convert tree structure into dict tree structure"""
        return self.result(self.trial.trial.code_components, CodeComponent.compositions(self.trial.trial.id))

    @cache("no_match")
    def no_match(self):
        """Convert tree structure into dict graph without node matchings"""
        # ToDo if needed
        pass

    @cache("exact_match")
    def exact_match(self):
        """Convert tree structure into dict graph and match equal calls"""
        # ToDo if needed
        pass

    @cache("namespace_match")
    def namespace_match(self):
        """Convert tree structure into dict graph and match namespaces"""
        # ToDo if needed
        pass

    def _ipython_display_(self):
        from IPython.display import display
        bundle = {
            'application/noworkflow.trial+json': self._modes[self.mode]()[1],
            'text/plain': 'Trial {}'.format(self.trial.trial.id),
        }
        display(bundle, raw=True)

The DefinitionGraph class, defined in definition_graph.py, extends the Graph class and used for visualizing the structure of trial definitions. It utilizes a caching mechanism to optimize performance when processing large graphs. This class outlines how nodes and edges are managed and visualized:

  • Node Definition: Nodes in the graph represent various components of the trial. The insert_node method creates these nodes and manages their attributes. This method also handles the graphical representation of nodes, including tooltips and activation details.
  • Edge Management: Edges are established between nodes to illustrate relationships between each nodes. The add_edge method ensures that these relationships are accurately reflected in the graph.
  • Graph Construction: The define_node method builds the graph structure by linking nodes according to their relationships. It uses a preorder traversal approach to insert nodes and manage their hierarchical structure.
  • Visualization: The result method prepares the data for visualization, converting the graph structure into a format suitable for rendering. This includes managing node and edge details, trial IDs, and other attributes necessary for accurate representation.

The views.py file now includes a route for serving the definition graph as JSON. This endpoint allows users to retrieve the graph data in a format suitable for visualization or further processing.

@app.route("/experiments/<expCode>/definition/<tid>/<graph_mode>/<cache>.json")
@app.route("/definition/<tid>/<graph_mode>/<cache>.json")
def definition_graph(tid, graph_mode, cache,expCode=None):
    """Respond definition graph as JSON"""
    definition = Definition(tid)
    graph = definition.graph
    graph.use_cache &= bool(int(cache))
    _, tgraph, _ = getattr(graph, graph_mode)()
    return jsonify(**tgraph)

Some Root Problem

Incorrect Graph

As shown in the image above, the graph is indeed created, but there are some minor issues with its structure. Instead of child nodes referencing their parent nodes at the same level, the nodes continue to reference their parent nodes, leading to a recursive tree structure that becomes progressively deeper.

This problem can be resolved by implementing a stack mechanism to keep track of parent nodes during node insertion. This way, when a child node needs to reference a parent node, it will correctly point to the direct parent instead of continuing the recursive referencing.


I have write an issue regarding the hover mechanism in the noworkflow interface, which currently fails to display the correct information Tooltip Issue