

To create this technology, Orr said: “Lakefs bring Git-like operations to data with version control. When a virtual copy is made of a source data set this becomes data (a set of pointers) as far as Lakefs is concerned and both this and the source dataset metadata is stored in an S3 bucket. This provides access to dataset metadata.

The source data is stored in object buckets – S3 in AWS, Azure Blob, GCP, and MinIO – with Lakefs creating deduplicated metadata, pointers to the source data, and managing and operating on that.Īccessing systems used by pipeline developers have an Lakefs client which contacts a Lakefs server. These branches proliferate and are managed as if in a tree, with the Lakefs software traversing the tree. The Lakefs concept starts with a main dataset and snapshot (virtual) copies, branches, made from it. But it was relatively minor and, with data lakes, unstructured data joined the structured data and dataset complexity and scale grew and grew. The same problem exists with data warehouses and their structured data, Orr said. That realization spurred the two to start up their business and create the Lakefs software product. It would have been much simpler to do that if the datasets involved had version control, said Orr. The recovery of thousands of tables took three weeks. Orr and co-founder CTO Oz Katz worked at SimilarWeb wherev they told us a small bug caused 1 PB of data to be lost. Treeverse is creating software tools to aid engineering best practices for data practitioners. This enables zero copy dev/test isolated environments, continuous quality validation, atomic rollback on bad data, reproducibility, and more, the company said. Lakefs says it brings such version control to datasets.Ĭo-founder and CEO Einat Orr told an IT Press Tour audience in Tel Aviv that: “LakeFS is an open source project that provides data engineers with versioning and branching capabilities on their data lakes, through a Git-like version control interface.” If the source dataset has a snapshot, copies made then can be used to create virtual copies, timestamped sets of pointers to a source dataset that can be used for pipeline development. The testing effort requires datasets on which the pipeline operate and these are mostly copies of a source dataset. Such pipelines are equivalent to software programs and they take effort and time to develop and test. Israeli startup Treeverse is developing dataset copy management and version control for data pipeline builders with its open-source Lakefs product.Īnalytics and AI/ML data supply pipelines depend upon consistent, repeatable and reliable delivery of clean data sets extracted from source data lakes.
