Clik here to view.

Snowflake has announced it is creating Polaris Catalog, an open standard implementation of Apache Iceberg, and that it will open source it in the next 90 days. This gives enterprises and the Iceberg community new levels of choice, flexibility, and control over their data.
While Snowflake offers its own highly efficient proprietary database table format, there are compelling reasons to use an open standard like Apache Iceberg. For example, it means you can store data in an open way in one repository and connect a wide, diverse range of tools to it, without having to transport and duplicate data all over the place.
Snowflake's announcement about the Polaris Catalog means organisations of all sizes can store their data using the open standard of choice for data lakehouses, data lakes, and other modern architectures, with full enterprise security and without vendor lock-in. Polaris Catalog will interoperate equally well with AWS, Confluent, Dremio, Google Cloud, Microsoft Azure, Salesforce, and more.
Going back a step, when it comes to open standards for database tables, the perennial comma-separated value (CSV) format is always up there. Of course, CSV is slow. If your data is in CSV every row needs to be scanned when searching for data. Thus rose Parquet, a column-based file format that allowed data tools to more rapidly scan files because the start and end point of each column in a row is well-defined. Yet, this still pales in comparison to the performance of a proprietary database table with indexes, statistics, and a myriad of other capabilities. In came Apache Iceberg, the clear frontrunner in open table standards, which essentially augments parquet with additional metadata to greatly accelerate performance. While Apache Iceberg only emerged in May 2020, it has surged in popularity due to its efficiency, and importantly, the peace of mind it gives enterprises that can store vast quantities of data without the risk their data will be trapped in a closed ecosystem.
{loadposition david08}
“Organisations want open storage and interoperable query engines without lock-in. Now, with the support of industry leaders, we are further simplifying how any organisation can easily access their data across diverse systems with increased flexibility and control,” said Snowflake EVP product Christian Kleinerman. “Polaris Catalog extends Snowflake’s commitment to Apache Iceberg as the open standard of choice, and signals the intent from industry leaders in enabling customers and the wider Iceberg community to harness their data through an open and neutral approach, empowering cross-engine interoperability on that data.”
Polaris Catalog isn't Snowflake's first foray into Iceberg. The company announced initial support for Iceberg over two years ago as part of its external tables offering. That is, a company could store its data using the Iceberg format in cheap, commodity cloud storage like AWS S3 or Cloudian and connect to it and work with it from Snowflake by setting up an external stage and defining external tables. Back then, what was possible with an external table was limited compared to native Snowflake tables - no row-level security for example - but this could be combated with materialised views or other workarounds. The biggest things Iceberg external tables gave Snowflake customers was the flexibility to use the exact same data store as was also being used for other data tools, and it also shifted cost. Let's say you're loading data from a range of sources into Snowflake and then acting on it. As Snowflake charges based on usage you are paying for that data load, whether or not users query the data. Using Iceberg in external storage you only pay for Snowflake compute when you do something with the data.
It wasn't without other cost and effort, however. You need to pay your storage provider, and you still had to load data into Iceberg format which invariably meant spinning up Apache Spark. So now you have more vendors and more servers.
Polaris Catalog is not the Iceberg external tables model of the past; it brings a single, centralised place for any engine to find and access an organisation's Iceberg tables with full, open interoperability. Polaris Catalog leverages the Iceberg open-source REST protocol to access and retrieve data from any engine that supports the API such as Apache Spark, Apache Flink, Dremio, Python, Trino, and others. Or, you can use none of these and have Polaris Catalog manage your Iceberg data with no other server to stand up.
Image may be NSFW.
Clik here to view.
Snowflake says Polaris Catalog will be available as a Snowflake-hosted option in public preview soon, and is also available for self-hosting in your own infrastructure as a containerised app. Snowflake also says it will open source Polaris Catalog within 90 days, and organisations can then freely swap their hosting infrastructure and avoid any vendor lock-in.
Snowflake has further committed to contributing to the Apache Iceberg standard and continuing to build on its existing partnership with Microsoft that allows seamless interoperability between Snowflake and Microsoft Fabric.
“From day one at Microsoft, we’ve been focused on empowering every user on the planet to achieve more, and this starts with a strong data foundation. Through our support and contributions to open data standards, including Delta Parquet, Apache Iceberg, and Apache XTable, we’re furthering this mission by enabling organisations with a new level of open data interoperability, so they can do more with their data,” said Microsoft corporate VP Azure Data Arun Ulagaratchagan. “Snowflake continues to serve as a strategic partner of ours, and we’re excited by their willingness to work with the Iceberg community on an open catalog to empower our joint customers and the wider open-source community with more flexibility and control over their open Iceberg data.”
Snowflake further reiterated its commitment to its proprietary table format. Snowflake EVP engineering Greg Czajkowski confirmed to iTWire that the company had no plans to deprecate it. "More than half of the Fortune 500 are Snowflake customers," he said, explaining that customers can have very long development cycles and require stability for numerous years.
Snowflake head of Data Lake and Iceberg Ron Ortloff added there are three kinds of customer concerns that Iceberg ticks the box for.
- It is an open table format, and thus serves customers who see value in things that avoid lock-in, and give freedom of choice and architecture.
- It eliminates the need to make redundant copies of data. He says some customers tell him they have a three-petabyte data lake, but if it were de-duplicated, it's closer to one petabyte. Using Iceberg, customers can have one copy of their data accessed by multiple tools, without needing to copy data to more and more apps.
- It allows customers to use their data with whatever engine makes sense for them. "You never see an architecture diagram with only one logo," Ortloff joked.
However, Ortloff notes, Iceberg writes to a cloud storage account that requires separate administration. "There is still a class of customer that values a single platform, single bill, and end-to-end play," he said.
Ortloff said there is still a lot on the Snowflake roadmap for Iceberg - "credential delegation, encryption, handing off encryption keys, a ton of stuff will continue to grow," he said.
In other open-source Snowflake news, the company also recently announced Snowflake Arctic, one of the most open, enterprise-grade large language models (LLM) on the market. As part of Snowflake’s commitment to open source, it not only released Arctic’s weights under an Apache 2.0 license but also extensive details of how it was trained through a series of cookbooks. In addition, Snowflake supports the Streamlit open-source community, which now has over 275K monthly active developers and over six million monthly application views. Since Snowflake acquired Streamlit in March 2022, the open-source community has continued to flourish, growing over 500% in the past two years, as Snowflake and Streamlit continue to invest in cutting-edge open-source advancements for developers.