GSoC 2024 - Week 3 Progress

Implementing code_component Visualization


Trial AST

This week, my focus shifted towards implementing the Trial Graph Visualization based on code_component, structured in the form of an Abstract Syntax Tree (AST). This week, my focus shifted towards implementing the Trial Graph Visualization based on code_component, structured in the form of an Abstract Syntax Tree (AST).

Initially, I began by reading the source code directly to understand the desired output format. This step was crucial for comprehending the structure and layout of the AST representation of the code components. The graph format is in DOT using Graphviz for simplicity in development.

The development process involved two main approaches:

  1. Reading from File: The first approach involved reading the entire script from a file and generating the AST. This method provided the correct output, as it accurately captured the structure and content of the code.
  2. Reading from Database: The second approach involved visualizing the AST using raw data queried directly from the database. This method aimed to integrate the visualization process with noWorkflow’s existing database structure but initially resulted in some issues with label accuracy.

Code Explanation

Database Model Definitions

The database models for CodeComponent and Composition are defined as follows:

from sqlalchemy import create_engine, Column, Integer, String, Text, ForeignKey, and_
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship

Base = declarative_base()

class CodeComponent(Base):
    __tablename__ = 'code_component'
    trial_id = Column(String, primary_key=True)
    id = Column(Integer, primary_key=True)
    name = Column(String)
    type = Column(String)
    mode = Column(String)
    first_char_line = Column(Integer)
    first_char_column = Column(Integer)
    last_char_line = Column(Integer)
    last_char_column = Column(Integer)
    container_id = Column(String)

class Composition(Base):
    __tablename__ = 'composition'
    trial_id = Column(String, primary_key=True)
    id = Column(Integer, primary_key=True)
    part_id = Column(Integer, ForeignKey('code_component.id'))
    whole_id = Column(Integer, ForeignKey('code_component.id'))
    type = Column(Text)
    position = Column(Integer)
    extra = Column(Text)
    part = relationship("CodeComponent", foreign_keys=[part_id])
    whole = relationship("CodeComponent", foreign_keys=[whole_id])

These classes define the schema for the code_component and composition tables, which store information about the code components and their compositions, respectively.

Querying the Database

The following code queries the database to retrieve the relevant CodeComponent and Composition entries:

engine = create_engine('sqlite:////content/db.sqlite')
Session = sessionmaker(bind=engine)
session = Session()

trial_id = 'b1ef84c0-b50f-4291-9b03-a961347dc9bc'

code_components = session.query(CodeComponent).filter(and_(CodeComponent.trial_id == trial_id, CodeComponent.type != "syntax")).all()
compositions = session.query(Composition).filter(and_(Composition.trial_id == trial_id, Composition.type != '*op_pos')).all()

The data will mainly be queried from the database. This involved querying code_components from the database, carefully filtering out entries where type is not equal to “syntax”. Similarly, for compositions, I filtered out entries where type is not equal to “*op_pos”, as these types do not contribute to the construction of AST nodes.

Building the AST from Database Entries

construct_node

The function below converts a CodeComponent entry into an AST node, based on its type:

def construct_node(component):
        if component.type == 'script':
            return ast.Module(name=component.name, body=[])
        elif component.type == "literal":
            return ast.Constant(value=component.name)
        elif component.type == 'function_def':
            return ast.FunctionDef(name=component.name, args=None, body=[], decorator_list=[], lineno=component.first_char_line,col_offset=component.first_char_column)
        elif component.type == 'arguments':
            args_list = component.name.split(',')
            arg_nodes = [ast.arg(arg=arg.strip(), annotation=None) for arg in args_list]
            return ast.arguments(args=arg_nodes, vararg=None, kwonlyargs=[], kw_defaults=[], kwarg=None, defaults=[])
        elif component.type == 'import':
            name_ = ' '.join(component.name.split()[1:])
            alias_name = name_.split('as ')[-1].strip() if ' as ' in name_ else name_
            return ast.Import(names=[ast.alias(name=component.name, asname=alias_name)])
        elif component.type == 'import_from':
            name_ = ' '.join(component.name.split()[1:])
            parts = name_.split()
            module_name = parts[1]
            alias_name = parts[3] if len(parts) > 3 else module_name
            return ast.ImportFrom(module=module_name, names=[ast.alias(name=module_name, asname=alias_name)], level=0)
        elif component.type == 'expr':
            return ast.Expr(value=None)
        elif component.type == 'assign':
            return ast.Assign(targets=[ast.Name(id=component.name.split('=')[0].strip(), ctx=ast.Store())], value=ast.Constant(value=None))
        elif component.type == "return":
            return ast.Return(value=None)
        elif component.type == "for":
            target = ast.Name(id=component.name.split()[1], ctx=ast.Store())
            iter_node = ast.Name(id=component.name.split()[3].strip(':'), ctx=ast.Load())
            body = []
            return ast.For(target=target, iter=iter_node, body=body, orelse=[])
        elif component.type == "if":
            return ast.If(test = None, body = [] ,orelse = [])
        elif component.type == "add":
            left, right = component.name.split(' + ')
            return ast.BinOp(left=ast.Name(id=left.strip(), ctx=ast.Load()), op=ast.Add(), right=ast.Name(id=right.strip(), ctx=ast.Load()))
        elif component.type == "sub":
            left, right = component.name.split(' - ')
            return ast.BinOp(left=ast.Name(id=left.strip(), ctx=ast.Load()), op=ast.Sub(), right=ast.Name(id=right.strip(), ctx=ast.Load()))
        elif component.type == "div":
            left, right = component.name.split(' / ')
            return ast.BinOp(left=ast.Name(id=left.strip(), ctx=ast.Load()), op=ast.Div(), right=ast.Name(id=right.strip(), ctx=ast.Load()))
        elif component.type == "mult":
            left, right = component.name.split(' * ')
            return ast.BinOp(left=ast.Name(id=left.strip(), ctx=ast.Load()), op=ast.Mult(), right=ast.Name(id=right.strip(), ctx=ast.Load()))
        elif component.type == "call":
            func_name = component.name.split('(')[0]
            args_str = component.name.split('(', 1)[1].rsplit(')', 1)[0]
            args = [arg.strip() for arg in args_str.split(',')]
            args_nodes = [ast.Constant(value=arg) if arg.isdigit() else ast.Name(id=arg, ctx=ast.Load()) for arg in args]
            return ast.Call(func=ast.Name(id=func_name, ctx=ast.Load()), args=args_nodes, keywords=[])
        elif component.type == "subscript":
            value_name = component.name.split('[')[0]
            slice_index = component.name.split('[')[1].rstrip(']')
            return ast.Subscript(value=ast.Name(id=value_name, ctx=ast.Load()), slice=ast.Index(value=ast.Constant(value=slice_index)), ctx=ast.Load())

Explanation:

  • Script (script): Creates an ast.Module node representing the script/module.
  • Literal (literal): Creates an ast.Constant node representing a literal value (e.g., numbers, strings).
  • Function Definition (function_def): Creates an ast.FunctionDef node representing a function definition.
  • Arguments (arguments): Creates an ast.arguments node representing arguments of a function.
  • Import (import): Creates an ast.Import node representing an import statement.
  • Import From (import_from): Creates an ast.ImportFrom node representing an import from statement.
  • Expression (expr): Creates an ast.Expr node representing an expression statement.
  • Assignment ‘assign): Creates an ast.Assign node representing an assignment statement.
  • Return (return): Creates an ast.Return node representing a return statement.
  • For Loop (for): Creates an ast.For node representing a for loop statement.
  • If Statement (if): Creates an ast.If node representing an if statement.
  • Arithmetic Operations (add, sub, div, mult): Creates ast.BinOp nodes representing addition, subtraction, division, and multiplication operations.
  • Function/Method Call (call): Creates an ast.Call node representing a function/method call.
  • Subscript Operation (subscript): Creates an ast.Subscript node representing a subscript operation.

build_ast

Now, let’s delve into the build_ast function, which constructs the entire AST by connecting nodes based on their relationships:

def build_ast(code_components, compositions):
    component_dict = {component.id: component for component in code_components}
    node_dict = {}

    for component in code_components:
        node = construct_node(component)
        if node is None:
            continue
        node_dict[component.id] = node


    for composition in compositions:
        whole_node = node_dict.get(composition.whole_id)
        part_node = node_dict.get(composition.part_id)

        if whole_node is None or part_node is None:
            continue

        if isinstance(whole_node, ast.Module):
            whole_node.body.append(part_node)
        elif isinstance(whole_node, ast.FunctionDef):
            if composition.type == '*body':
                whole_node.body.append(part_node)
            elif composition.type == 'args':
                if isinstance(part_node, ast.arguments):
                    whole_node.args = part_node
        elif isinstance(whole_node, ast.Expr):
            whole_node.value = part_node
        elif isinstance(whole_node, ast.Assign):
            if composition.type == 'value':
                whole_node.value = part_node
        elif isinstance(whole_node, ast.Return):
            whole_node.value = part_node
        elif isinstance(whole_node, ast.For) and composition.type == '*body':
            whole_node.body.append(part_node)
        elif isinstance(whole_node, ast.If) and composition.type == '*body':
            whole_node.body.append(part_node)

    return node_dict[1]

Explanation:

  • component_dict: This dictionary maps each CodeComponent entry by its id for efficient lookup during node construction.
  • node_dict: Stores AST nodes created from CodeComponent entries using the construct_node function.
  • construct_node(component): Converts each CodeComponent into an appropriate AST node based on its type.
  • Compositions: These define relationships between CodeComponent nodes, specifying how child nodes connect to parent nodes within the AST structure.

This function used for construction of the entire Abstract Syntax Tree (AST) by leveraging construct_node to create AST nodes and connect_nodes to establish relationships between them based on compositions.

Correct Output: Reading from File

The following ast_to_dot function produces the desired output when reading directly from the source file:

def ast_to_dot(node):
    """Converts AST node to DOT format for Graphviz."""
    def node_label(node):
        segment = ast.get_source_segment(code, node) if hasattr(node, 'lineno') else ''
        return f"{type(node).__name__}\n{segment}"

    def node_id(node):
        return f"node{str(id(node))}"

    dot = graphviz.Digraph()
    dot.attr(rankdir='TB', nodesep='0.75', ranksep='0.75', splines='true')
    root_id = node_id(node)

    def visit(node, parent=None):
        if not isinstance(node, ast.AST) or isinstance(node, ast.Load) or isinstance(node, ast.Store):
            return

        node_name = node_id(node)
        dot.node(node_name, node_label(node), margin='0.1,0.1', width='0.2', height='0.2', fontsize='10')

        if parent:
            dot.edge(parent, node_name, minlen='1', arrowsize='0.5')

        for child in ast.iter_child_nodes(node):
            visit(child, node_name)

    visit(node)
    return dot

This function reads the source code and uses the ast module to parse it. It then generates a DOT graph that accurately represents the structure of the code. The labels in this version include segments of the source code, providing a clear and correct visualization of the AST.

Correct AST Graph

Incorrect Output: Reading from Database

When reading data from the database, the ast_to_dot function produces the following output:

def ast_to_dot(node):
    """Converts AST node to DOT format for Graphviz."""
    def node_label(node):
        if isinstance(node, ast.Module):
            return f"Module\n{node.name}"
        elif isinstance(node, ast.FunctionDef):
            args = ', '.join(arg.arg for arg in node.args.args)
            body_content = '\n'.join(ast.dump(stmt) for stmt in node.body)
            return f"FunctionDef\ndef {node.name}({args}):\n{body_content}"
        elif isinstance(node, ast.arguments):
            return "arguments"
        elif isinstance(node, ast.arg):
            return f"arg\n{node.arg}"
        elif isinstance(node, ast.Import):
            return f"Import\n{node.names[0].name}"
        elif isinstance(node, ast.ImportFrom):
            return f"ImportFrom\n{node.names[0].name}"
        elif isinstance(node, ast.alias):
            return f"alias\n{node.name}"
        elif isinstance(node, ast.Expr):
            return f"Expr\n{ast.dump(node.value)}"
        elif isinstance(node, ast.Assign):
            return f"Assign\n{ast.dump(node.targets[0])} = {ast.dump(node.value)}"
        elif isinstance(node, ast.For):
            iter_name = ast.dump(node.iter)
            target_name = ast.dump(node.target)
            body_content = '\n'.join(ast.dump(stmt) for stmt in node.body)
            return f"For\n{target_name} in {iter_name}:\n{body_content}"
        elif isinstance(node, ast.If):
            return f"If\n{ast.dump(node.name)}"
        elif isinstance(node, ast.Return):
            return f"Return\n{node.value}"
        elif isinstance(node, ast.BinOp):
            return f"BinOp\n{type(node.op).__name__}"
        elif isinstance(node, ast.Call):
            return f"Call\n{ast.dump(node.func)}"
        elif isinstance(node, ast.Name):
            return f"Name\n{node.id}"
        elif isinstance(node, ast.Constant):
            return f"Constant\n{node.value}"
        elif isinstance(node, ast.Subscript):
            return "Subscript"
        elif isinstance(node, ast.Attribute):
            return f"Attribute\n{ast.dump(node.value)}.{node.attr}"
        return type(node).__name__

    def node_id(node):
        return f"node{str(id(node))}"

    dot = graphviz.Digraph()
    dot.attr(rankdir='TB', nodesep='0.75', ranksep='0.75', splines='true')
    root_id = node_id(node)

    def visit(node, parent=None):
        if not isinstance(node, ast.AST) or isinstance(node, ast.Load) or isinstance(node, ast.Store):
            return

        node_name = node_id(node)
        dot.node(node_name, node_label(node), margin='0.1,0.1', width='0.2', height='0.2', fontsize='10')

        if parent:
            dot.edge(parent, node_name, minlen='1', arrowsize='0.5')

        for child_name, child_node in ast.iter_fields(node):
            if isinstance(child_node, list):
                for child in child_node:
                    if isinstance(child, ast.AST):
                        visit(child, node_name)
            elif isinstance(child_node, ast.AST):
                visit(child_node, node_name)

    visit(node)
    return dot

In this version, the labels are based on the type and structure of the AST nodes, but they do not include the actual source code segments. As a result, the visualization is less accurate and does not fully represent the structure of the original code. The main issue lies in debugging the structure of the newly constructed AST to better visualize the label output of the actual code snippet. Ideally, the code snippet should contain the value of each child node, providing a clearer and more accurate representation of the source code.

Notably, the undesired labels only occurred on recursive nodes such as Assign and FunctionDef, whereas Constant, Value, and Import nodes are working quite fine.

Incorrect AST Graph

Currently, the AST captures the data sufficiently, but the visualization remains raw and unrefined. As a next step, I plan to enhance the AST by adding more Python constructs, such as classes and try-catch statements, to provide a better coverage of scripts.

Despite these difficulties, I am confident that with further debugging and refinement, the AST visualization will meet the desired outputs.


Moving forward, I plan to integrate the developed Trial Graph Visualization into noWorkFlow. Additionally, I aim to utilize the structured diagrams in vis.py to enhance the visualization output.