Why wait to process data in hourly or daily batches when you can have correct results now? Streaming technologies have reached a level of maturity sufficient for mainstream adoption. With this practical book, data engineers, data scientists, and developers will learn how to work with streaming data in a conceptual and platform-agnostic way.
This handy pocket reference explains the what, where, when, and how of processing real-time data streams. You'll learn:
- Core principles and concepts behind robust out-of-order data processing
- Strategies for choosing data processing windows
- How watermarks track progress and completeness in infinite datasets
- How the concepts of streams and tables form the foundations of both batch and streaming data processing
- How time-varying relations provide a link between stream processing and the world of SQL and relational algebra
- Modern technologies used in the streaming ecosystem
For a more detailed look at stream processing, check out O'Reilly's Streaming Systems by Tyler Akidau, Slava Chernyak, and Reuven Lax.
About the Author: Tyler Akidau is principal software engineer at Snowflake. Previously senior staff software engineer at Google, he was the technical lead for the Data Processing Languages & Systems group, responsible for Google's Apache Beam efforts, Google Cloud Dataflow, and internal data processing tools like Google Flume, MapReduce, and MillWheel. His also a founding member of the Apache Beam PMC. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer in batch and streaming as two sides of the same coin, with the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 Dataflow Model paper and the Streaming 101 and Streaming 102 articles on the O'Reilly website. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.
Slava Chernyak is a senior software engineer at Google Seattle. Slava spent over five years working on Google's internal massive-scale streaming data processing systems and has since become involved with designing and building Windmill, Google Cloud Dataflow's next-generation streaming backend, from the ground up. Slava is passionate about making massive-scale stream processing available and useful to a broader audience. When he is not working on streaming systems, Slava is out enjoying the natural beauty of the Pacific Northwest.
Reuven Lax is a senior staff software engineer at Google Seattle, and has spent the past nine years helping to shape Google's data processing and analysis strategy. For much of that time he has focused on Google's low-latency, streaming data processing efforts, first as a long-time member and lead of the MillWheel team, and more recently founding and leading the team responsible for Windmill, the next-generation stream processing engine powering Google Cloud Dataflow. He's very excited to bring Google's data-processing experience to the world at large, and proud to have been a part of publishing both the MillWheel paper in 2013 and the Dataflow Model paper in 2015. When not at work, Reuven enjoys swing dancing, rock climbing, and exploring new parts of the world.
Austin Bennett designs data systems to help move, share, gather insights and develop data products efficiently.