LTIMindtree Ltd.

08/05/2024 | Press release | Distributed by Public on 08/05/2024 07:45

From SAS to PySpark: Scintilla's AI-Driven Transformation

From SAS to PySpark: Scintilla's AI-Driven Transformation

August 5, 2024

By:Ramesh Vanteru, Principal & Head of SAS COE

Enterprises worldwide are emphasizing flexibility, scalability, and cost-effectiveness to stay resilient and relevant. Disruptive technological advancements in cloud computing and artificial intelligence are driving organizations to embrace change while maintaining customer satisfaction as a top priority. However, organizations are experiencing a significant gap between customer expectations and their operational landscapes, products, or services. They are re-evaluating their ecosystems, including technology, business processes, services, and products. They are also making stringent strategic decisions, such as decommissioning long-standing solutions in favor of more futuristic and trending offerings. These changes aim to improve customer reach, strengthen data transformation journeys, and empower future business use cases at a fraction of the cost.

SAS, renowned for its statistical, analytical, and domain-specific solutions, has been widely used across various industries. However, limitations around proprietary licensing, interoperability, integration, and constantly improving cloud and AI offerings are leading organizations to explore technology stacks beyond SAS. The modern requirement across industries is to transform SAS processes to non-SAS platforms, optimizing cost and technological efficiency. This transition brings the challenge of effectively converting SAS code to suit other platforms like PySpark, Snowflake, Databricks, and others.

SAS modernization gaining ground

We have completed several prototypes and successful real-time SAS modernization projects facing inevitable challenges. Our specialization includes SAS modernization to Databricks and PySpark-based platforms. In our endeavor to master the process and overcome the challenges of the SAS code conversion process, we designed a solution. It includes a migration approach for transforming the existing SAS ecosystem to a completely different stack, be it PySpark, Databricks, or Snowflake. We have also created accelerators for automated and accurate code conversion.

Going through the modernization journeys of several organizations has also given us insights into how things work and what works best. A Prudent Markets study indicates a 33.9% CAGR in the adoption of Apache Spark. The verdict is clear. PySpark has emerged as a widely accepted alternative to SAS owing to its capabilities and open-source architecture. Our proprietary accelerator, Scintilla, is a great companion en route to the SAS modernization process, streamlining processes, speeding up code conversion, and ensuring efficiency and quality.

Accelerating SAS modernization with Scintilla

Scintilla is our flagship accelerator that we designed to streamline and speed up code conversion during the SAS modernization process. It is a pattern-driven converter that learns from past conversion exercises and advances the progress of upcoming iterations. Our accelerator also unlocks the benefits of modern big data processing and analysis.

A smart analyzer included in Scintilla summarizes SAS coding standards and simplifies the complexities of moving workloads. Powered by generative AI (Gen AI), the accelerator offers a versatile approach that is speedy and accurate. Its capabilities include:

  • SAS Code Analysis and Lineage Assessment Reports
  • SAS Code Transpilation to PySpark
  • PySpark Code Optimization
  • PySpark Code Analysis and Documentation
  • Synthetic Data Maker
  • Test case Generation

Integrating Gen AI and LLM with Scintilla

Experts and enthusiasts worldwide are inclined to explore, adapt, and create applications based on Gen AI for better and faster outcomes. Well, so are we. We infused Scintilla with the capabilities of Gen AI and LLMs for SAS to PySpark code transpilation. During our research, we found many proprietary LLMs and open-source foundation models that we evaluated understood logic and pseudo codes. However, they could not handle code conversion tasks accurately, especially complex SAS codes. Given the current limitations of LLMs, they could not be used as-is for complex SAS to PySpark migration. Therefore, it was crucial to focus on tuning the models by selecting an appropriate LLM as the baseline. We identified tools and methods that would help us optimize and speed up the code conversion process. They were as follows:

  • Efficient Model for SAS Code Conversion and Documentation: Google's Gemini Pro or Gemini Flash
  • Tuning Methods suitable: Prompt Engineering, Few-Shot Tuning.
  • Training Methods evaluated: RAG
  • Training tools evaluated: Vertex AI, Google Colab

Based on these results, we finalized and progressed with tuning the identified LLM, creating a distilled or child model for internal evaluation.

Scintilla now delivers enhanced conversion and documentation results with this newly fine-tuned model and its underlying artifacts. The integration of Scintilla and the LLM was seamless, fitting perfectly into Scintilla's code conversion and analysis processes.

Advantages of enhancing Scintilla with Gen AI and LLM

Besides strengthening the tool and accelerating the code conversion process, integrating with Gen AI and LLM:

  • Reduces and optimizes the overall cost incurred while communicating and connecting with the LLM
  • Enhances Scintilla's outcome and improves accuracy, specifically for complex SAS code snippets
  • Improves Scintilla's documentation and analytical capabilities and generates accurate documentation with less effort
  • Reduces efforts and SAS/PySpark skill requirements for manual remediations requiring human intervention

Key components of Scintilla

The following modules simplify the SAS code conversion process and make modernization faster.

UI module

This vibrant user interface is the gateway for all activities and process flows related to SAS to PySpark code conversion, documentation, and Gen AI LLM tasks. It is where the magic begins and ends.

Assessment module

SAS codes are meticulously parsed and analyzed at the block level in this module to generate comprehensive assessment reports.

Transpiler module

The heart of conversion, this utility transforms SAS code into clean, integrated, and syntactically accurate PySpark code. It combines native components with cutting-edge LLM integrations to ensure seamless transitions.

Core repository

This essential module houses a specialized code dictionary developed by our experts. It acts as an LLM model and facilitates the conversion of SAS to PySpark code, continuously evolving to enhance accuracy and capability.

AI Repository powered by Google Gemini Pro LLM

This module fine-tunes LLM to handle complex SAS logic that may be difficult for the core repository and ensures precise PySpark code generation. For this purpose, it leverages a sophisticated code dictionary we have developed in house.

Code Optimizer powered by Google Gemini Pro LLM

This module uses the code dictionary to fine-tune LLM, optimizing the PySpark code generated by the Transpiler module for peak performance.

TechWriter powered by Google Gemini Pro LLM

This module analyzes both SAS and PySpark code to produce detailed technical documentation. Enhanced by LLM, it offers superior code analysis and reduces the need for manual intervention.

Test case generation using Gen AI

This will generate test cases using LLM model for the generated PySpark code from the optimized code.

Synthetic data generation

This generates synthetic data using dbtldatagen library. It is possible to generate data in two ways-using sample data, using schema alone. It can also be customized based on the minimum and maximum values and other parameters.

Conclusion

Scintilla's innovative integration of generative AI algorithms and models has transformed the code conversion process, making it more efficient and accurate. Each module, from the vibrant UI to the meticulous TechWriter, plays a crucial role in this transformation. With the seamless integration of cutting-edge LLM technology, Scintilla simplifies SAS to PySpark migration and ensures high-quality documentation and optimized code performance. This comprehensive approach positions Scintilla as a powerful tool for modern data engineering needs.

Blogger's Profile

Ramesh Vanteru

Principal & Head of SAS COE

Ramesh is a SAS SME with close to two decades of experience in architecting and building data & SAS analytical applications. He is SAS Advanced Certified and a Snowflake SnowPro certified Architect providing technical advisory on building SAS and Snowflake data platforms with deep expertise in consulting, Taxation, BFSI, healthcare and implementation of Cloud Data Platforms.

More from Ramesh Vanteru

Modernize Your Workloads on Snowflake with Po…

Modern-day organizations that generate a huge amount of data look forward to leveraging the…

Read More

Unlocking the Power of Data Modernization with…

In the ever-evolving data processing landscape, transitioning from traditional systems to modern…

Read More

Latest Blogs

Driving Innovation and Sustainability: The Future…

In today's rapidly evolving technological landscape, the demand for high-quality software has…

Read More

Unlock Data Governance and Simplify Migration:…

Cloud computing and modern data architectures require efficient data management and security.…

Read More

Databricks Summit 2024: Leading the Charge in…

In early June 2024, the Databricks Summit in San Francisco drew a dynamic assembly of data…

Read More

Implementing Zero Trust in the Cloud: Overcoming…

In today's interconnected application ecosystem, data security and privacy are more significant…

Read More