Data is growing fast, and that’s a fact. Every action we take whether it be online or offline, is inevitably ingested into some database somewhere for some reason. Companies are collecting and generating hoards of data, but all this big data only means something if it can be contained, processed, queried, and analyzed… oh and located!
The extreme popularity of the data science profession is a testament to the competition companies face to ensure their data is primed for value-added actionable insights. However, the growth in the volume, variety and velocity of data runs in parallel to the risk that companies are losing control over the latent insights contained within the company’s data stores. Why might this be? It could be as simple as not understanding which data is available in the company, where it’s located, what it’s useful for, and if it’s trustworthy.
Lyft’s solution to this dilemma is Amundsen: “a data discovery application built on top of a meta data engine…. improving the productivity of our data users by providing a search interface for data”.
On July 9th, almost two hundred people tuned in to listen to Amundsen product manager, Mark Grover speak with Neo4j about the success and vision of the product. Truthfully, I was suprised more people hadn’t watched the live chat!
In it Mark Grover discussed how context and variety are important to the work skilled-professional were doing within the organization when working toward data-driven insights.
Importantly, users within the organization were having trouble discovering the data that would enable the insights needed to successfully compete and operate the ride-sharing app, Lyft.
Did the data needed exist? Has someone used it for analysis? Is it validated? How is it shaped? What are its dimensions? Is it related to other data sets? Without the ability to properly and efficiently answer these questions, analysts were limited in their functions, and this has led to the issue of data discovery.
Neo4j is graph data base platform. A graph database uses graph structures for semantic queries where the nodes, edges and properties represent and store data. This is completely different from relational database structures like SQL or NoSQL.
Graph databases treat the relationships between data as equally important to the data itself. Neo4j, then, holds data without constricting it to a pre-defined model but instead shows how each entity connects or is related to others.
A great example of graph databases is in representing social networks, where relationships between data points (i.e. people) is as important as the information contained on the entity itself. After all, we live in a connected world and so to capture this, we need the proper database! (It really is a beautiful symmetry!)
Amundsen’s full-fledged metadata repository built ontop of Neo4j can enable building applications on top of this architecture such as those for enabling 1) Trust 2) Compliance 3) ETL and quality of data across the organization.
Lyft wanted a solution to have trust embodied in the solution itself – can an analyst or data scientist find and locate the data they require, and trust its quality?
Amundsen uses the graph database platform to study the relationships between the meta data being described. These meta data are: data stores, dashboards, events, schemas, stream jobs, processing jobs, the people in the company.
Not only has Amundsen experienced a really high adoption rate and Customer Satisfaction (CSAT) score, but it has solved a major barrier to data discovery – driving down the time to discover an artifact to be 5% of the pre-Amundsen baseline – with users discovering more data in a shorter time, and with a higher degree of trust.
“At Lyft, what we observed was that the while we wanted the majority of the time to be spent in model development (aka prototyping) and productization, a lot of the time was being spent in data discovery.” – Mark Grover
Amundsen is a solution to this. Amundsen uses Neo4j as a comprehensive and centralized backend graphing data source for data discovery. Operating like PageRank, tables that are most queried will improve the data set’s trust ranking. Other factors like how many times that page was populated also serves important functions in the trust and reliability metrics.
A table’s ‘details’ page shows the up-to-date schema and descriptions. Additional meta data is offered like: who owns the table and who uses the table. Analysts get a quick preview of the data, its profile, and its shape. The goal of the data page is to quickly figure out if you trust this data. Then the user run an analysis using the data.
Describing the future experience of the data analyst or data scientist, Mark Grover describes seeing a work space where “information is pushed to you that is relevant to you” instead of searching around for tribal knowledge and getting caught up in data search.
For more detailed information behind the product development of Amundsen, Mark Grover’s blog for Lyft and corresponding Github repositories can be found below.
Below are relevant links to Lyft engingeering, Neo4j and the Amundsen project github repository.