08/05/2024 | Press release | Distributed by Public on 08/05/2024 07:45
By:Ramesh Vanteru, Principal & Head of SAS COE
Enterprises worldwide are emphasizing flexibility, scalability, and cost-effectiveness to stay resilient and relevant. Disruptive technological advancements in cloud computing and artificial intelligence are driving organizations to embrace change while maintaining customer satisfaction as a top priority. However, organizations are experiencing a significant gap between customer expectations and their operational landscapes, products, or services. They are re-evaluating their ecosystems, including technology, business processes, services, and products. They are also making stringent strategic decisions, such as decommissioning long-standing solutions in favor of more futuristic and trending offerings. These changes aim to improve customer reach, strengthen data transformation journeys, and empower future business use cases at a fraction of the cost.
SAS, renowned for its statistical, analytical, and domain-specific solutions, has been widely used across various industries. However, limitations around proprietary licensing, interoperability, integration, and constantly improving cloud and AI offerings are leading organizations to explore technology stacks beyond SAS. The modern requirement across industries is to transform SAS processes to non-SAS platforms, optimizing cost and technological efficiency. This transition brings the challenge of effectively converting SAS code to suit other platforms like PySpark, Snowflake, Databricks, and others.
We have completed several prototypes and successful real-time SAS modernization projects facing inevitable challenges. Our specialization includes SAS modernization to Databricks and PySpark-based platforms. In our endeavor to master the process and overcome the challenges of the SAS code conversion process, we designed a solution. It includes a migration approach for transforming the existing SAS ecosystem to a completely different stack, be it PySpark, Databricks, or Snowflake. We have also created accelerators for automated and accurate code conversion.
Going through the modernization journeys of several organizations has also given us insights into how things work and what works best. A Prudent Markets study indicates a 33.9% CAGR in the adoption of Apache Spark. The verdict is clear. PySpark has emerged as a widely accepted alternative to SAS owing to its capabilities and open-source architecture. Our proprietary accelerator, Scintilla, is a great companion en route to the SAS modernization process, streamlining processes, speeding up code conversion, and ensuring efficiency and quality.
Scintilla is our flagship accelerator that we designed to streamline and speed up code conversion during the SAS modernization process. It is a pattern-driven converter that learns from past conversion exercises and advances the progress of upcoming iterations. Our accelerator also unlocks the benefits of modern big data processing and analysis.
A smart analyzer included in Scintilla summarizes SAS coding standards and simplifies the complexities of moving workloads. Powered by generative AI (Gen AI), the accelerator offers a versatile approach that is speedy and accurate. Its capabilities include:
Experts and enthusiasts worldwide are inclined to explore, adapt, and create applications based on Gen AI for better and faster outcomes. Well, so are we. We infused Scintilla with the capabilities of Gen AI and LLMs for SAS to PySpark code transpilation. During our research, we found many proprietary LLMs and open-source foundation models that we evaluated understood logic and pseudo codes. However, they could not handle code conversion tasks accurately, especially complex SAS codes. Given the current limitations of LLMs, they could not be used as-is for complex SAS to PySpark migration. Therefore, it was crucial to focus on tuning the models by selecting an appropriate LLM as the baseline. We identified tools and methods that would help us optimize and speed up the code conversion process. They were as follows:
Based on these results, we finalized and progressed with tuning the identified LLM, creating a distilled or child model for internal evaluation.
Scintilla now delivers enhanced conversion and documentation results with this newly fine-tuned model and its underlying artifacts. The integration of Scintilla and the LLM was seamless, fitting perfectly into Scintilla's code conversion and analysis processes.
Besides strengthening the tool and accelerating the code conversion process, integrating with Gen AI and LLM:
The following modules simplify the SAS code conversion process and make modernization faster.
UI module
This vibrant user interface is the gateway for all activities and process flows related to SAS to PySpark code conversion, documentation, and Gen AI LLM tasks. It is where the magic begins and ends.
Assessment module
SAS codes are meticulously parsed and analyzed at the block level in this module to generate comprehensive assessment reports.
Transpiler module
The heart of conversion, this utility transforms SAS code into clean, integrated, and syntactically accurate PySpark code. It combines native components with cutting-edge LLM integrations to ensure seamless transitions.
Core repository
This essential module houses a specialized code dictionary developed by our experts. It acts as an LLM model and facilitates the conversion of SAS to PySpark code, continuously evolving to enhance accuracy and capability.
AI Repository powered by Google Gemini Pro LLM
This module fine-tunes LLM to handle complex SAS logic that may be difficult for the core repository and ensures precise PySpark code generation. For this purpose, it leverages a sophisticated code dictionary we have developed in house.
Code Optimizer powered by Google Gemini Pro LLM
This module uses the code dictionary to fine-tune LLM, optimizing the PySpark code generated by the Transpiler module for peak performance.
TechWriter powered by Google Gemini Pro LLM
This module analyzes both SAS and PySpark code to produce detailed technical documentation. Enhanced by LLM, it offers superior code analysis and reduces the need for manual intervention.
Test case generation using Gen AI
This will generate test cases using LLM model for the generated PySpark code from the optimized code.
Synthetic data generation
This generates synthetic data using dbtldatagen library. It is possible to generate data in two ways-using sample data, using schema alone. It can also be customized based on the minimum and maximum values and other parameters.
Scintilla's innovative integration of generative AI algorithms and models has transformed the code conversion process, making it more efficient and accurate. Each module, from the vibrant UI to the meticulous TechWriter, plays a crucial role in this transformation. With the seamless integration of cutting-edge LLM technology, Scintilla simplifies SAS to PySpark migration and ensures high-quality documentation and optimized code performance. This comprehensive approach positions Scintilla as a powerful tool for modern data engineering needs.
Principal & Head of SAS COE
Ramesh is a SAS SME with close to two decades of experience in architecting and building data & SAS analytical applications. He is SAS Advanced Certified and a Snowflake SnowPro certified Architect providing technical advisory on building SAS and Snowflake data platforms with deep expertise in consulting, Taxation, BFSI, healthcare and implementation of Cloud Data Platforms.
Modern-day organizations that generate a huge amount of data look forward to leveraging the…
In the ever-evolving data processing landscape, transitioning from traditional systems to modern…
In today's rapidly evolving technological landscape, the demand for high-quality software has…
Cloud computing and modern data architectures require efficient data management and security.…
In early June 2024, the Databricks Summit in San Francisco drew a dynamic assembly of data…
In today's interconnected application ecosystem, data security and privacy are more significant…