UNT User Guide
Overview
Feature Description
Big data engines like Spark, Hive, and Flink offer a limited set of functions, which often fall short of meeting customers' requirements. To bridge this gap, customers develop their own user-defined functions (UDFs) tailored to their specific needs.
For example, when using Flink DataStream, users seek not only the compatibility and flexibility of the open source standard, but also better performance and lower operating costs. For the Flink engine, the existing optimization efforts, built on the open-source Flink Java project, have reached the performance limits. To break through the bottlenecks, we need to adopt a native execution engine to push system efficiency further without compromising Flink compatibility. While Spark has made notable strides in optimization, Flink still lags behind. Stream processing with Flink is widely used. More than 80% of service scenarios depend on UDFs. To meet this demand, we provide a system for automated UDF native compilation tailored for the native Flink engine. The system can automatically translate the user-loaded UDF bytecode into native binaries, replacing the original UDFs for execution in Flink and other big data engines.
UNT Features
- It automatically translates the bytecode of service JAR packages into Intermediate Representation (IR) code, then extracts UDF code based on the rules of big data engines, and parses external dependencies of UDFs.
- It optimizes UDF IR through automatic memory object management and hardware affinity acceleration.
- Establishes lifecycle management rules for UDF objects, and automatically inserts code for memory object referencing and release based on the rules.
- Establishes a pool for basic class objects, and automatically replaces memory allocation interfaces for basic classes with interfaces for obtaining objects from the pool.
- Automatically matches the affinity libraries to hardware of the execution environment and automatically replaces calls in IR with calls to these hardware affinity libraries.
- It automatically translates Java UDFs to C++ UDFs, and translates original Java programs into C++ code, and compiles the code.
Constraints
Before configuring this features, you need to understand the constraints on the UNT features.
Overall specifications
Function type constraints: Supports native translation for UDFs of types on a function trustlist.
Member, method, and syntax constraints:
text- Supports native translation for Java types on a type trustlist. - Supports native translation for Java statements on a statement trustlist. - Supports native translation for Java keywords on a keyword trustlist. - The name of a UDF member method cannot be the same as that of a basic library member method, for example, getRefCount and putRefCount.Member object constraints: For UDF native translation to succeed, the runtime type of a UDF member object must be the same as a statically defined type. Otherwise, the native translation will fail, and the UDF will revert to its original Java execution.
Data transfer objects constraints: Data objects in the data transfer object trustlist can be transferred across tasks.
Packaging constraints: The input JAR package for native translation must be a fat JAR containing all dependencies.
Return value constraints: Parent-child classes with the same interface must have identical return value attributes for that interface (a return value attribute of
0indicates a void return, a basic type, a collection item, or a class object field, while an attribute of1represents all other cases).Automatic memory release constraints:
text- Currently, local variables in user-defined functions cannot be used across loops. - Circular dependencies are not supported for user-defined classes.Function type trustlist
Supported function types:
- FlatMapFunction
- KeySelector
- MapFunction
- ReduceFunction
- RichFilterFunction
- RichFlatMapFunction
Java keyword translation trustlist
Supported Java keywords:
- abstract
- boolean
- break
- byte
- case
- char
- class
- continue
- default
- do
- while
- double
- if
- else
- for
- extends
- float
- final
- int
- implements
- import
- interface
- instanceof
- long
- new
- package
- private
- protected
- public
- return
- short
- static
- switch
- this
- void
- volatile
Java type translation trustlist
Supported Java types:
- boolean
- byte
- char
- short
- int
- long
- double
- float
- Array
- null
- void
- Class
Java statement translation trustlist
- InvokeStmt (function/method calling statement, excluding dynamicinvoke)
Example:
javapublic class DemoClass{ public void print(int x){ int a = increment(x); System.out.println(a); a = increment(x); System.out.println(a); } public int increment(int x){ return x+1; } }- IdentityStmt (assignment of this member)
Example:
javapublic class DemoClass{ private int counter; public void DemoClass(int counter){ this.counter = counter; } }- AssignStmt (assignment statement)
Example:
javapublic class DemoClass{ private int counter = 0; public int updateCounter(){ counter = counter + 1; return counter; } }- IfStmt (if statement)
Example:
javapublic class DemoClass{ public static void sampleMethod(int x){ if(x % 2 == 0){ System.out.println("Even"); }else{ System.out.println("Odd"); } } }- Switch (switch statement)
Example:
javapublic class DemoClass{ public void switchExample(int x){ switch(x){ case 1: System.out.println("Input1"); break; case 2: System.out.println("Input2"); break; default: System.out.println("Input more than 2"); break; } } }- ReturnStmt (return statement)
Example:
javapublic class DemoClass{ public int increment(int x){ return x + 1; } }- GotoStmt (goto statement)
Example:
javapublic class DemoClass{ public static void sampleMethod(){ for(int i = 0; i < 5; i++){ if(i == 3){ break; } } } }
Installation and Deployment
Software Requirements
- jdk1.8
- python3
- maven3.6.3
Hardware Requirements
- AArch64 architecture
- x86_64 architecture
Software Installation
The UNT is installed and deployed using RPM.
rpm -ivh UNT-1.0-5.oe2403sp2.noarch.rpmAfter the UNT is installed, the udf-translator folder is generated in the /opt/udf-trans-opt directory. This directory is the working directory of the UNT.
#Directory structure
bin: directory storing the execution script
conf: directory storing configuration files
lib: directory storing dependencies
cpp: directory storing generated cpp source files, where different subdirectories map to different JAR packages
log: directory storing logs generated during translation
output: directory storing the .so file generated after the compilation, where different subdirectories map to different JAR packagesThe native_udf.py file for viewing translation information is generated in /usr/bin.
How to Use
The UNT translation depends on the configuration files.
Configuration files:
conf/depend_class.properties: mapping between java and native class names
conf/depend_include.properties: header file path
conf/depend_interface.config: dependency interfaceconf is a relative directory. For details about the base directory configuration, see step 4 "Modify the UNT user configuration."
Procedure:
Scan for missing interfaces.
Run
native_udf.py depend_info ${job_jar}to scan for missing interfaces.Example of the scanning result:
textjava.lang.String Methods: int length()The example indicates that the
length()function interface is absent fromString.Implement the missing interfaces.
Native-side interfaces can be implemented according to the base library coding specifications. For example, you can declare the
length()interface inStringas follows and implement the interface.cpp// String header file. class String : public Object { public: int32_t length() const; private: std::string inner; }cpp// String cpp file int32_t String::length() const { return static_cast<int32_t>(inner.size()); }Once implemented, this code must be compiled into a
libbasictypes.afile. The file name must belibbasictypes.a.Add interface configuration files.
After the interface is implemented, you need to add the corresponding interface configuration file.
- Add the mapping between java and native class names to the
depend_class.propertiesfile.
textjava.lang.String=StringThe key indicates the String class in Java, and the value indicates the native class.
- Add the header file path configuration to the
depend_include.propertiesfile.
textjava.lang.String=basictypes/String.hThe key indicates the String class in Java, and the value indicates the relative path of the header file on the native side. For details about the base directory configuration, see step 4 "Modify the UNT user configuration."
- Add the dependency interface configuration to the
depend_interface.configfile.
text<java.lang.String: int length()>, 0The first element of the configuration indicates the signature of a Java function, and the value indicates the memory semantics. For details about memory semantics, see the memory semantics specifications in the native specifications.
- Add the mapping between java and native class names to the
Modify the UNT user configuration.
By default, the configuration file is
/opt/udf-trans-opt/udf-translator/conf/udf_tune.properties.The content of the configuration file is as follows:
textbasic_lib_path=/opt/udf-trans-opt/libbasictypes tune_level=0 regex_lib_type=1 regex_lib_path=/usr/local/ksl/lib/libKHSEL_ops.a compile_option=basic_lib_pathis the base directory of the basic library and includes three subdirectories.confis the base directory for storing configuration files.includeis the base directory of the header files and stores the native header files implemented in step 2.libstores the.astatic dependency files compiled by users.
tune_levelis used to configure the optimization levels. The optimization levels are described as follows:- level:0 indicates basic optimization, that is, automatic memory release. All further improvements are built upon this foundation.
- level:1 indicates hardware acceleration, which is implemented by reading hardware acceleration library interface settings and integrating with this interface.
- level:2 indicates memory allocation and deallocation acceleration.
- level:4 indicates AI for Compiler Kit (AI4C) acceleration.
Among the preceding optimizations, basic optimization is combined with all other optimizations by default. For the remaining optimizations, their respective level values are additive, representing the superposition of multiple optimization techniques (each level's value must be a power of 2).
regex_lib_typeis used to configure whether to optimize the regular expression library.Set this parameter to
1to enable regular expression library optimization, or set it to0to disable this optimization. The settings take effect only whentune_levelis set to1.regex_lib_pathspecifies the linking path for the regular expression library.Configure the path for the regular library for optimization.
compile_optionis used to customize compilation options. You need to verify the settings for accuracy.Generate native source files and binary files using the translation command.
textbash /opt/udf-trans-opt/udf-translator/bin/udf_translate.sh {JAR package path} flinkAfter the command is executed, the source file is generated in the
cppdirectory, the SO file is generated in theoutputdirectory, and the log is generated in thelogdirectory.View translation information.
- native_udf.py source_info $
You can specify a
job_jarto view the location of the native source file.- native_udf.py list $
You can specify a
job_jarto view IDs of successfully native-compiled UDFs and their binary file information.The binary file is generated in the
outputdirectory under the UNT installation path. The hash value of the subdirectory can be obtained using source_info.- native_udf.py depend_info $
You can specify a
job_jarto view the interface information of the dependency libraries.Currently, lambda expressions cannot be scanned.
- native_udf.py fail_info $
You can specify a
job_jarto view the native compilation failure cause, for example, the absence of the dependency interfaces.- native_udf.py tune_level $
You can specify the UDF native optimization level.
Native Specifications
The UNT tool helps create native code. Follow these steps to integrate this code into your translation automatically.
Memory semantics specifications
Release all objects except the output one. In addition, you need to configure the memory semantics in
depend_interface.config. If the returned object is newly created, set the method signature to1. Otherwise, set the method signature to0.Example 1:
javaint32_t String::length() const { return static_cast<int32_t>(inner.size()); }The
lengthmethod returns a basic type, and no new object is involved. Therefore, set this parameter to0.text<java.lang.String: int length()>, 0Example 2:
javaString *String::substring(const int32_t idx) const { std::string s = this->inner.substr(idx); return new String(std::move(s)); }The
substringmethod returns theStringtype, and the returned object is newly created. Therefore, set this parameter to1.text<java.lang.String: java.lang.String substring(int)>, 1Inheritance specifications
All classes must be inherited from
Object.The
Objectclass is described as follows:Objectheader file:cppclass Object { public: Object(); Object(nlohmann::json jsonObj); virtual ~Object(); virtual int hashCode(); virtual bool equals(Object *obj); virtual std::string toString(); virtual Object *clone(); Object(const Object &obj); Object(Object &&obj); Object &operator=(const Object &obj); Object &operator=(Object &&obj); void putRefCount(); void getRefCount(); void setRefCount(uint32_t count); bool isCloned(); uint32_t getRefCountNumber(); public: std::recursive_mutex mutex; bool isClone = false; bool isPool = false; uint32_t refCount = 1; }Objectcppfile:cppObject::Object() = default; Object::Object(nlohmann::json jsonObj) { return; } Object::~Object() = default; int Object::hashCode() { return 0; } bool Object::equals(Object *obj) { return false; } std::string Object::toString() { return std::string(); } Object * Object::clone() { return nullptr; } Object::Object(const Object &obj) { this->refCount = obj.refCount; this->isClone = obj.isClone; } Object::Object(const Object &&obj) { this->refCount = obj.refCount; this->isClone = obj.isClone; } Object &Object::operator=(const Object &obj) { this->refCount = obj.refCount; this->isClone = obj.isClone; } Object &Object::operator=(Object &&obj) { this->refCount = obj.refCount; this->isClone = obj.isClone; } void Object::putRefCount() { if (__builtin_expect(--refCount != 0, true)) { return; } delete this; } void Object::getRefCount() { ++refCount; } void Object::setRefCount(uint32_t count) { refCount = count; } bool Object::isCloned() { return isClone; } uint32_t Object::getRefCountNumber() { return refCount; }