UNT User Guide

Overview

Feature Description

Big data engines like Spark, Hive, and Flink offer a limited set of functions, which often fall short of meeting customers' requirements. To bridge this gap, customers develop their own user-defined functions (UDFs) tailored to their specific needs.

For example, when using Flink DataStream, users seek not only the compatibility and flexibility of the open source standard, but also better performance and lower operating costs. For the Flink engine, the existing optimization efforts, built on the open-source Flink Java project, have reached the performance limits. To break through the bottlenecks, we need to adopt a native execution engine to push system efficiency further without compromising Flink compatibility. While Spark has made notable strides in optimization, Flink still lags behind. Stream processing with Flink is widely used. More than 80% of service scenarios depend on UDFs. To meet this demand, we provide a system for automated UDF native compilation tailored for the native Flink engine. The system can automatically translate the user-loaded UDF bytecode into native binaries, replacing the original UDFs for execution in Flink and other big data engines.

UNT Features

  • It automatically translates the bytecode of service JAR packages into Intermediate Representation (IR) code, then extracts UDF code based on the rules of big data engines, and parses external dependencies of UDFs.
  • It optimizes UDF IR through automatic memory object management and hardware affinity acceleration.
    • Establishes lifecycle management rules for UDF objects, and automatically inserts code for memory object referencing and release based on the rules.
    • Establishes a pool for basic class objects, and automatically replaces memory allocation interfaces for basic classes with interfaces for obtaining objects from the pool.
    • Automatically matches the affinity libraries to hardware of the execution environment and automatically replaces calls in IR with calls to these hardware affinity libraries.
  • It automatically translates Java UDFs to C++ UDFs, and translates original Java programs into C++ code, and compiles the code.

Constraints

Before configuring this features, you need to understand the constraints on the UNT features.

  1. Overall specifications

    Function type constraints: Supports native translation for UDFs of types on a function trustlist.

    Member, method, and syntax constraints:

    text
    - Supports native translation for Java types on a type trustlist.
    - Supports native translation for Java statements on a statement trustlist.
    - Supports native translation for Java keywords on a keyword trustlist.
    - The name of a UDF member method cannot be the same as that of a basic library member method, for example, getRefCount and putRefCount.

    Member object constraints: For UDF native translation to succeed, the runtime type of a UDF member object must be the same as a statically defined type. Otherwise, the native translation will fail, and the UDF will revert to its original Java execution.

    Data transfer objects constraints: Data objects in the data transfer object trustlist can be transferred across tasks.

    Packaging constraints: The input JAR package for native translation must be a fat JAR containing all dependencies.

    Return value constraints: Parent-child classes with the same interface must have identical return value attributes for that interface (a return value attribute of 0 indicates a void return, a basic type, a collection item, or a class object field, while an attribute of 1 represents all other cases).

    Automatic memory release constraints:

    text
    - Currently, local variables in user-defined functions cannot be used across loops.
    - Circular dependencies are not supported for user-defined classes.
  2. Function type trustlist

    Supported function types:

    • FlatMapFunction
    • KeySelector
    • MapFunction
    • ReduceFunction
    • RichFilterFunction
    • RichFlatMapFunction
  3. Java keyword translation trustlist

    Supported Java keywords:

    • abstract
    • boolean
    • break
    • byte
    • case
    • char
    • class
    • continue
    • default
    • do
    • while
    • double
    • if
    • else
    • for
    • extends
    • float
    • final
    • int
    • implements
    • import
    • interface
    • instanceof
    • long
    • new
    • package
    • private
    • protected
    • public
    • return
    • short
    • static
    • switch
    • this
    • void
    • volatile
  4. Java type translation trustlist

    Supported Java types:

    • boolean
    • byte
    • char
    • short
    • int
    • long
    • double
    • float
    • Array
    • null
    • void
    • Class
  5. Java statement translation trustlist

    • InvokeStmt (function/method calling statement, excluding dynamicinvoke)

    Example:

    java
    public class DemoClass{
        public void print(int x){
            int a = increment(x);
            System.out.println(a);
            a = increment(x);
            System.out.println(a);
        }
        public int increment(int x){
            return x+1;
        }
    }
    • IdentityStmt (assignment of this member)

    Example:

    java
    public class DemoClass{
        private int counter;
        public void DemoClass(int counter){
            this.counter = counter;
        }
    }
    • AssignStmt (assignment statement)

    Example:

    java
    public class DemoClass{
        private int counter = 0;
        public int updateCounter(){
            counter = counter + 1;
            return counter;
        }
    }
    • IfStmt (if statement)

    Example:

    java
    public class DemoClass{
        public static void sampleMethod(int x){
            if(x % 2 == 0){
                System.out.println("Even");
            }else{
                System.out.println("Odd");
            }
        }
    }
    • Switch (switch statement)

    Example:

    java
    public class DemoClass{
        public void switchExample(int x){
            switch(x){
                case 1:
                    System.out.println("Input1");
                    break;
                case 2:
                    System.out.println("Input2");
                    break;
                default:
                   System.out.println("Input more than 2");
                   break; 
            }
        }
    }
    • ReturnStmt (return statement)

    Example:

    java
    public class DemoClass{
        public int increment(int x){
            return x + 1;
        }
    }
    • GotoStmt (goto statement)

    Example:

    java
    public class DemoClass{
        public static void sampleMethod(){
            for(int i = 0; i < 5; i++){
                if(i == 3){
                    break;
                }
            }
        }
    }

Installation and Deployment

Software Requirements

  • jdk1.8
  • python3
  • maven3.6.3

Hardware Requirements

  • AArch64 architecture
  • x86_64 architecture

Software Installation

The UNT is installed and deployed using RPM.

shell
rpm -ivh UNT-1.0-5.oe2403sp2.noarch.rpm

After the UNT is installed, the udf-translator folder is generated in the /opt/udf-trans-opt directory. This directory is the working directory of the UNT.

text
#Directory structure
bin: directory storing the execution script
conf: directory storing configuration files
lib: directory storing dependencies
cpp: directory storing generated cpp source files, where different subdirectories map to different JAR packages
log: directory storing logs generated during translation
output: directory storing the .so file generated after the compilation, where different subdirectories map to different JAR packages

The native_udf.py file for viewing translation information is generated in /usr/bin.

How to Use

The UNT translation depends on the configuration files.

Configuration files:

text
conf/depend_class.properties: mapping between java and native class names
conf/depend_include.properties: header file path
conf/depend_interface.config: dependency interface

conf is a relative directory. For details about the base directory configuration, see step 4 "Modify the UNT user configuration."

Procedure:

  1. Scan for missing interfaces.

    Run native_udf.py depend_info ${job_jar} to scan for missing interfaces.

    Example of the scanning result:

    text
    java.lang.String
    Methods:
        int length()

    The example indicates that the length() function interface is absent from String.

  2. Implement the missing interfaces.

    Native-side interfaces can be implemented according to the base library coding specifications. For example, you can declare the length() interface in String as follows and implement the interface.

    cpp
    // String header file.
    class String : public Object 
    {
    public:
        int32_t length() const;
    private:
        std::string inner;
    }
    cpp
    // String cpp file
    int32_t String::length() const
    {
        return static_cast<int32_t>(inner.size());
    }

    Once implemented, this code must be compiled into a libbasictypes.a file. The file name must be libbasictypes.a.

  3. Add interface configuration files.

    After the interface is implemented, you need to add the corresponding interface configuration file.

    • Add the mapping between java and native class names to the depend_class.properties file.
    text
    java.lang.String=String

    The key indicates the String class in Java, and the value indicates the native class.

    • Add the header file path configuration to the depend_include.properties file.
    text
    java.lang.String=basictypes/String.h

    The key indicates the String class in Java, and the value indicates the relative path of the header file on the native side. For details about the base directory configuration, see step 4 "Modify the UNT user configuration."

    • Add the dependency interface configuration to the depend_interface.config file.
    text
    <java.lang.String: int length()>, 0

    The first element of the configuration indicates the signature of a Java function, and the value indicates the memory semantics. For details about memory semantics, see the memory semantics specifications in the native specifications.

  4. Modify the UNT user configuration.

    By default, the configuration file is /opt/udf-trans-opt/udf-translator/conf/udf_tune.properties.

    The content of the configuration file is as follows:

    text
    basic_lib_path=/opt/udf-trans-opt/libbasictypes
    tune_level=0
    regex_lib_type=1
    regex_lib_path=/usr/local/ksl/lib/libKHSEL_ops.a
    compile_option=

    basic_lib_path is the base directory of the basic library and includes three subdirectories.

    • conf is the base directory for storing configuration files.
    • include is the base directory of the header files and stores the native header files implemented in step 2.
    • lib stores the .a static dependency files compiled by users.

    tune_level is used to configure the optimization levels. The optimization levels are described as follows:

    • level:0 indicates basic optimization, that is, automatic memory release. All further improvements are built upon this foundation.
    • level:1 indicates hardware acceleration, which is implemented by reading hardware acceleration library interface settings and integrating with this interface.
    • level:2 indicates memory allocation and deallocation acceleration.
    • level:4 indicates AI for Compiler Kit (AI4C) acceleration.

    Among the preceding optimizations, basic optimization is combined with all other optimizations by default. For the remaining optimizations, their respective level values are additive, representing the superposition of multiple optimization techniques (each level's value must be a power of 2).

    regex_lib_type is used to configure whether to optimize the regular expression library.

    Set this parameter to 1 to enable regular expression library optimization, or set it to 0 to disable this optimization. The settings take effect only when tune_level is set to 1.

    regex_lib_path specifies the linking path for the regular expression library.

    Configure the path for the regular library for optimization.

    compile_option is used to customize compilation options. You need to verify the settings for accuracy.

  5. Generate native source files and binary files using the translation command.

    text
    bash /opt/udf-trans-opt/udf-translator/bin/udf_translate.sh {JAR package path} flink

    After the command is executed, the source file is generated in the cpp directory, the SO file is generated in the outputdirectory, and the log is generated in the log directory.

  6. View translation information.

    • native_udf.py source_info $

    You can specify a job_jar to view the location of the native source file.

    • native_udf.py list $

    You can specify a job_jar to view IDs of successfully native-compiled UDFs and their binary file information.

    The binary file is generated in the output directory under the UNT installation path. The hash value of the subdirectory can be obtained using source_info.

    • native_udf.py depend_info $

    You can specify a job_jar to view the interface information of the dependency libraries.

    Currently, lambda expressions cannot be scanned.

    • native_udf.py fail_info $

    You can specify a job_jar to view the native compilation failure cause, for example, the absence of the dependency interfaces.

    • native_udf.py tune_level $

    You can specify the UDF native optimization level.

Native Specifications

The UNT tool helps create native code. Follow these steps to integrate this code into your translation automatically.

  1. Memory semantics specifications

    Release all objects except the output one. In addition, you need to configure the memory semantics in depend_interface.config. If the returned object is newly created, set the method signature to 1. Otherwise, set the method signature to 0.

    Example 1:

    java
    int32_t String::length() const
    {
        return static_cast<int32_t>(inner.size());
    }

    The length method returns a basic type, and no new object is involved. Therefore, set this parameter to 0.

    text
    <java.lang.String: int length()>, 0

    Example 2:

    java
    String *String::substring(const int32_t idx) const
    {
        std::string s = this->inner.substr(idx);
        return new String(std::move(s));
    }

    The substring method returns the String type, and the returned object is newly created. Therefore, set this parameter to 1.

    text
    <java.lang.String: java.lang.String substring(int)>, 1
  2. Inheritance specifications

    All classes must be inherited from Object.

    The Object class is described as follows:

    Object header file:

    cpp
    class Object 
    {
    public:
        Object();
        
        Object(nlohmann::json jsonObj);
        
        virtual ~Object();
        
        virtual int hashCode();
        
        virtual bool equals(Object *obj);
        
        virtual std::string toString();
        
        virtual Object *clone();
        
        Object(const Object &obj);
        
        Object(Object &&obj);
        
        Object &operator=(const Object &obj);
        
        Object &operator=(Object &&obj);
        
        void putRefCount();
        
        void getRefCount();
        
        void setRefCount(uint32_t count);
        
        bool isCloned();
        
        uint32_t getRefCountNumber();
        
    public:
        std::recursive_mutex mutex;
        bool isClone = false;
        bool isPool = false;
        uint32_t refCount = 1;
    }

    Object cpp file:

    cpp
    Object::Object() = default;
    
    Object::Object(nlohmann::json jsonObj)
    {
        return;
    }
    
    Object::~Object() = default;
    
    int Object::hashCode()
    {
        return 0;
    }
    
    bool Object::equals(Object *obj)
    {
        return false;
    }
    
    std::string Object::toString()
    {
        return std::string();
    }
    
    Object * Object::clone()
    {
        return nullptr;
    }
    
    Object::Object(const Object &obj)
    {
        this->refCount = obj.refCount;
        this->isClone = obj.isClone;
    }
    
    Object::Object(const Object &&obj)
    {
        this->refCount = obj.refCount;
        this->isClone = obj.isClone;
    }
    
    Object &Object::operator=(const Object &obj)
    {
        this->refCount = obj.refCount;
        this->isClone = obj.isClone;
    }
    
    Object &Object::operator=(Object &&obj)
    {
        this->refCount = obj.refCount;
        this->isClone = obj.isClone;
    }
    
    void Object::putRefCount()
    {
        if (__builtin_expect(--refCount != 0, true))
        {
            return;
        }
        delete this;
    }
    
    void Object::getRefCount()
    {
        ++refCount;
    }
    
    void Object::setRefCount(uint32_t count)
    {
        refCount = count;
    }
    
    bool Object::isCloned()
    {
        return isClone;
    }
    
    uint32_t Object::getRefCountNumber()
    {
        return refCount;
    }