Difference between Hadoop and Google's Cloud Dataflow

Hey all, Could anyone please tell me the major differences between Hadoop and Google’s Cloud Dataflow?

Hey Michelle, Here the the major differences between Hadoop and Google’s Cloud Dataflow. Google’s Cloud Dataflow is a fully managed service for creating data pipelines that ingest, transform and analyze data in both batch processing and streaming modes. It is the successor to MapReduce and is based on Flume and MillWheel. It makes it easy to get actionable insights from data by lowering operational costs without the hassles of deploying, maintaining or managing infrastructure. It can used for use cases like ETL, batch data processing and streaming analytics and automatically optimizes, deploys and manages the code and resources required. Google Cloud Dataflow is meant to replace MapReduce, the software at the heart of Hadoop and other Big Data processing systems. MapReduce originally had been developed at Google and was later open sourced but is no longer used by Google. It has been replaced by Flume and MillWheel. The former lets you manage parallel pipelines for data processing, which Mapreduce does not provide on its own. The latter is described as “a framework for building low latency data processing applications”. Also it is touted to be superior to MapReduce in the amount of data that can be processed efficiently as MapReduce performs poorly when the amount of data reaches multipetabyte range. The greatest distinction between Hadoop and Google Cloud Dataflow though lies in where and how each is most likely to be deployed. In Hadoop the data is processed where it sits, so it is a data store as well as a data processing system. In the case of Cloud Dataflow, it is more likely that it will be used to enhance the applications already written for Google Cloud where the data directly resides in the Google’s system or is being collected there.
Hope this helps you :slight_smile: