LTIMindtree Ltd.

06/30/2023 | Press release | Distributed by Public on 06/30/2023 07:48

A Comprehensive Guide to Snowpark – Part 2

A Comprehensive Guide to Snowpark - Part 2

June 30, 2023

By:Poorna Chand Addala, Senior Data Engineer, Snowflake Center of Excellence, LTIMindtree

Background:

We're glad to have you back for the second part of our extensive Snowpark guide. In our previous blog, we delved into the different execution modes of Snowpark code and obtained a more profound comprehension of the inner workings of Snowpark execution. In this article, our attention will shift towards exploring the constraints of language-specific Stored Procedures (SPs) and immersing ourselves in the remarkable solution offered by User-Defined (Table) Functions (UD(T)Fs) to unlock Snowpark's potential.

Limitations of SPs

Snowpark SPs have certain limitations that should be considered:

  • Firstly, SPs are primarily designed to function on a single node. During the execution of an SP, the Snowpark DataFrames and SQL statements are transmitted to the Snowflake SQL engine, facilitating performance improvements and scale-out advantages. However, the execution of language-specific constructs within the SP is limited to the single node in the Snowflake virtual warehouse. Consequently, if the entire code runs on a Python Virtual Machine (PVM) of a single node, it becomes necessary to transfer all the data to that node for processing. This can result in notable performance difficulties, especially when handling extensive datasets.
  • Moreover, SPs lack support for concurrent queries and process creation. As a result, each statement or query within an SP executes sequentially, regardless of their interdependence.
  • Another constraint pertains to memory usage. SPs have memory limitations, and if the code requires more memory than what has been allocated, the execution may fail. Thus, carefully managing the code's memory requirements is crucial for successful execution.

So, now you might be wondering if there is a way to run your language-specific code in Snowflake, taking advantage of all the nodes of your virtual warehouse. UD(T)Fs are there to help you out.

How are UD(T)Fs solving the SPs problem?

Fig 10: UD(T)F Execution Flow

Figure 10 illustrates the execution flow of UD(T)Fs:

  • Python employs a Global Interpreter Lock (GIL) to ensure the sequential execution of threads, preventing the simultaneous utilization of multiple processor cores in a multi-threaded environment.
  • To overcome the limitations of the GIL, Snowflake employs multiple interpreter processes for executing Python functions within a query. Each process operates an independent Python interpreter instance, enabling parallel execution of threads across different processes. For instance, when dealing with a million-row table like "my_table," Snowflake may generate multiple interpreter processes to parallelize computations. Each process would concurrently execute the UDF function on a subset of rows, with the results merged to produce the outcome.
  • The platform initializes the Python interpreter when executing a Python function in Snowflake. This initialization process entails configuring the Python environment, importing modules, and other necessary steps for executing Python code.
  • After the Python interpreter initialization, Snowflake forks additional processes to execute the Python function. This approach reduces the time required to initialize the interpreter for each function's execution.
  • Upon completion of the query, packages, and imports remain stored in the local cache. However, the sandbox and interpreters are cleaned up. Subsequent executions of the same query or another query employing the same packages and imports will benefit from faster execution as the initialization steps are avoided.
  • In summary, Snowflake strives to leverage the full processing capability of the warehouse when running queries with Python or Java UDFs. This involves redistributing rows among nodes in the warehouse to parallelize resource-intensive computations.
  • Furthermore, Snowflake collects historical statistics and assesses the performance of UDFs (whether fast or slow) to optimize workload distribution among workers, aiming to achieve the fastest possible query execution.

UD(T)Fs address the SPs problem by allowing distributed execution on multiple nodes. When a UDTF is invoked in a query, Snowflake automatically partitions the input data and distributes it among the nodes in the warehouse. Each node then independently executes the UDTF on its assigned data subset. This parallel execution allows multiple nodes to process different input portions simultaneously, significantly improving performance.

Once the UDTF execution is complete on each node, the intermediate results are combined or merged to produce the final output. Snowflake handles this merging process transparently, providing a unified result set to the query. By leveraging the distributed computing capabilities of the Snowflake data warehouse, UD(T)Fs provide a scalable and performant solution to the SPs problem. They allow for the efficient utilization of multiple nodes and enable parallel processing, thereby overcoming the limitations of executing code on a single node.

Summary of Snowpark objects

Snowpark best practices

  • Utilizing Snowpark DataFrame instead of Pandas DataFrame offers a significant performance boost. It is considered a best practice to maximize the use of Snowpark DataFrames whenever possible.
  • Vectorized UDFs offer a performance boost for numerical computations, making them ideal for enhancing performance in such scenarios. However, for non-numeric computations, using vectorized UDFs may lead to performance degradation, so it's best to stick with non-vectorized UDFs for optimal performance.
  • It is recommended to minimize data retrieval directly into language-specific stored procedures. SPs execute on a single node within the virtual warehouse, limiting computational resources. SPs are most effective for controlling flow, acting as the main script, and initializing objects and maintaining state. Data-intensive computations within SPs should use the DataFrame API or separate UD(T)Fs for improved performance.
  • Caching Snowpark DataFrames can optimize performance by avoiding repetitive transformations. However, excessive use should be avoided as it may lead to performance issues. Selectively applying cache operations and considering query frequency helps prevent disk spillage and unnecessary processing time.
  • While the Snowpark client API bears similarities to the Pyspark API, it's crucial to acknowledge that Snowpark generates an underlying SQL query. Therefore, when utilizing Snowpark, it is advisable to consider and adhere to SQL best practices.

CONCLUSION

In the second part of our Snowpark guide, we examined the limitations of Snowpark Stored Procedures (SPs) and the solution offered by User-Defined (Table) Functions (UD(T)Fs) to overcome these limitations. We looked into the execution flow of UD(T)Fs, which involves Snowflake using multiple interpreter processes to execute Python functions concurrently. Furthermore, we shared best practices for Snowpark usage. By adhering to these practices and harnessing the distributed computing capabilities of Snowpark, users can fully unleash Snowpark's potential for scalable, high-performance, and efficient data processing and analytics.

References:

Blogger's Profile

Poorna Chand Addala

Senior Data Engineer, Snowflake Center of Excellence, LTIMindtree

Poorna is a valued member of LTIMindtree's Snowflake COE, where he serves as a Senior Data Engineer. He passionately invests his time in augmenting his expertise and refining his mastery in the realm of data engineering. Poorna's insatiable curiosity drives him to dive deep into his areas of interest, helping him stay at the forefront of advancements. He relishes the challenge of tackling complex technical problems and applies his creativity to find innovative solutions.

More from Poorna Chand Addala

A Comprehensive Guide to Snowpark - Part 1

Background Snowpark is a new developer framework of Snowflake composed of a client-side library…

Read More

Latest Blogs

A Comprehensive Guide to Snowpark - Part 1

Background Snowpark is a new developer framework of Snowflake composed of a client-side library…

Read More

Fast-track to the Future with GenAI-Powered Data…

Artificial intelligence has recently witnessed a groundbreaking development known as Generative…

Read More

Embracing the Future with Generative AI Operationalizing…

Introduction to AI and the rise of Generative AI The journey of artificial intelligence started…

Read More

Unlocking the future of Smart Manufacturing powered…

Smart manufacturing is all about unlocking the potential of data to gain valuable insights…

Read More