Sometimes it is a debatable question whether we require to implement a Data Warehouse or a Data Lake? A few weeks ago the same question was asked by one of the clients. The direct answer is It depends….So I thought, it would be better to share my thoughts on it and explain a little about both Data Warehouse and Data Lake and why you should choose each then it will be easy for you to decide which implementation will be right for your requirement.
In a Data Warehouse can stay on-premises not really necessary to move to the cloud unless there is a requirement like a scalability or faster complex query processing with help of MPP architecture. In a data warehouse, we properly store cleansed, aggregated, filtered structured data and also we are always implementing it for a target audience, means we know the end users, who and what we are going to produce. Ex: Finance data stores in the Data Warehouse to serve analytic reports and dashboard to CEO, CFO or Finance team. But, in general, there will be multiple users at different levels in an organization. If we consider about the education domain, properly designed Data Warehouse in a school can serve insights start from Student, Parent to Principal. So, here we know which sort of analytics we are doing to deliver, based on that model the data warehouse with right granularities.
Oppose to Data Warehouse, a Data Lake is a pool of raw data predominantly implement in a cloud like Azure, AWS. By it means a data lake will contain different types of format data from various sources like Operational databases, real-time events, social media feeds, media data like images, videos, audios etc. Essentially we are ingesting and storing Big Data here. An interesting thing is we really don’t know who are the end users or who will be using it in the future. One of the key advantages of having a data-lake implementation is different users can have different transformation logics based on their own unique requirement. Because when you have a data-lake you can serve analytic reports to typical business users and it will be a source for data scientists who create Machine Learning models for image processing, sentiment analysis for Twitter feeds or else predictive analytics for corporate finance. Theoretically, there won’t be any limit of use cases.
Summary: If your plan to serve it for traditional Business Intelligencereporting, moving data into cloud also a concern due to reasons like data privacy, then Data Warehouse solution would work for you. But, if you have a plan to have BI reporting plus more advanced workloads like streaming data processing from sensor data, Process web click activities, real-time Social media analytics like sentiment analysis, or advanced analytics using Machine learning then Data Lake would be the solution for your scenario.
Blog courtesy: Nisal Mihiranga