jQAssistant Language Concept Extractor Architecture
The Language Concept Extractor (LCE) architecture for jQAssistant provides a generic framework for building native tools to scan the source code of arbitrary programming languages and extract relevant language concepts from it. It then consolidates the extracted information into an easy-to-process JSON format for a jQA plugin.
Key Goals
- extensibility: easily implement the detection and extraction of new language concepts
- maintainability: the implementation should be easily adaptable to changes in the programming language
- up-to-date: used APIs and libraries need to closely follow release cycles of the analyzed programming language to allow for the fast adoption of new syntax constructs, etc.
Solution
Core Idea:
- split scanning process of source code into two parts:
- processing of AST using a natively implemented tool for the programming language, to easily extract/consolidate relevant information
- graph generation using the consolidated information from step one by using standard jQA scanner mechanisms
- usage of JSON as an intermediary format as it can easily be processed on most platforms
Basic Overall Process:
flowchart LR
source[(Source Code)]
json[[JSON Representation]]
neo4j[(Neo4j Graph)]
source-->|LCE Tool|json
json-->|jQA Plugin|neo4j
Language Concept Extraction Process: (performed by the LCE Tool)
The Extractor API orchestrates the extraction process to obtain project objects which are then exported to a JSON file. The orchestration process encompasses the following steps:
- native tools and APIs are used to get an enriched, structured view on the source code in the form of ASTs and other data structures
- traversers traverse the ASTs of all source files and execute different processors to extract information on a file-by-file basis
- the decision whether a processor is executed is defined in its execution condition
- the extracted information is stored in language concept objects that are organized in concept maps
- during the traversal/processing of the AST a processing context is maintained that can be used to access and/or share all information necessary for processing
- all available traversers/processors are dynamically registered in central feature collections (which enables extensions)
- metadata assignment rules can be used to enrich language concept objects with additional information that can in-turn be used by other processors further up the tree, or by post processors
- all extracted language concepts are bundled into individual project objects
- the project objects with the extracted language concepts are re-processed by post processors on a project-wide/cross-project basis, allowing for advanced resolution algorithms
- post processors have no access to the AST data, they only work on language concept objects (which may, however, contain attached metadata by metadata assignment rules)
- the processed project objects are then exported to a JSON file
Concepts & Mechanisms
- Extractor API and Projects
- Native Tools and APIs
- Language Concept
- Traversers
- Processors
- Post Processors
- Feature Collections and Extensions