Budapest Data 2015: Full Schedule

13:30 CEST

Designing Agile Data Pipelines

Agile software development values responding to change over following a plan. Responding to changes means allowing data scientists to experiment with data and allowing developers to easily modify data processing and even make mistakes without taking huge risks. Well designed data pipelines gives organizations flexible data analysis. In this session we'll show how to design architectures that make it easy and safe to extend and modify data analysis software.

We will look at how to design an agile data processing architecture using Apache Hadoop, Apache Kafka and stream processing frameworks. The architectures we’ll discuss make it easy to add new data sources, experiment with new analysis algorithms and correct data processing errors. All this makes the data pipeline both flexible and safe.

Speakers

Ashish Singh

Software Engineer, Cloudera

Ashish Singh is a Software Engineer, working with Cloudera to empower Hadoop ecosystem to answer bigger questions. He contributes to Apache Kafka, Hive, Parquet and Sentry. Prior to joining Cloudera, he worked on optimizing MPI collective communications on High Performance Computing... Read More →

Thursday June 4, 2015 13:30 - 14:00 CEST
Mátyás II.

Stream

14:05 CEST

Real-time data processing with Apache Flink

Flink Streaming is a distributed, fault-tolerant, real-time data processing engine provided by the Apache Flink data analytics platform. It is currently programmable in Java and Scala using stateful functional operators including map, aggregations and temporal joins amongst many others. The streaming API also features flexible windowing semantics to express a wide variety of business logic.

In the Flink runtime layer both batch and streaming jobs are executed as a common data flow graph thus unifying batch and stream processing in an elegant way. Flink provides a more straight-forward and transparent approach than the lambda architecture or other state of the art solutions. Flink also provides exactly-once processing guarantees for streaming programs with a combination of upstream backup and consistent user state snapshots.

The highly efficient runtime layer offers competitive performance compared to current streaming solutions with a rich and expressive API. This talk will focus on the API and runtime features of Flink Streaming in comparison with current industry standard streaming solutions.

Speakers

Gyula Fóra

Researcher, Distributed Systems, SICS

Gyula is a committer and PMC member for the Apache Flink project, currently working as a researcher at the Swedish Institute of Computer Science. His main expertise and interest is real-time distributed data processing frameworks, and their connections to other big data applications... Read More →

Thursday June 4, 2015 14:05 - 14:35 CEST
Mátyás II.

Stream

14:40 CEST

STREAMLINE: learning from data streams with Apache Flink

STREAMLINE is the research project of TU Berlin, SICS Stockholm and SZTAKI Budapest for reducing system and human latencies in the analytics of high speed data streams. On top of Apache Flink, we

Develop automatic optimization, parallelization, and system adaptation technologies that reduce the programming expertise required by data scientists, thereby enabling them to more freely focus on domain specific matters.

Overcome the complexity of the so-called ‘lambda architecture’ by delivering simplified operations that jointly support “data at rest” and “data in motion” in a single system that is compatible with the Hadoop ecosystem.

Develop new machine learning technologies capable of very fast reacting to changes in the stream.

In the presentation we show results of our experiments over telecommunication and recommendation use cases.

Speakers

Benczúr András

Head of Big Data Research Group, MTA SZTAKI

András Benczúr is the head of Informatics Laboratory of 30 doctoral students, post-docs and developers. Benczúr received his Ph.D. at the Massachusetts Institute of Technology in 1997, since then his interest turned to Information Retrieval and Web Search. He was representing SZTAKI... Read More →

Thursday June 4, 2015 14:40 - 15:10 CEST
Mátyás II.

Stream

15:40 CEST

Bootstrap Real Time pipeline in 30 minutes

In a world where every "Thing" is producing lots of data, ingesting and processing that large volume of data becomes a big problem. In today's dynamic world, firms have to react to changing conditions very fast, or even better in real time. In this talk we will take on this interesting challenge using latest and greatest tools from Big Data community. We will try to combine awesomeness of Kafka, a resilient pub-sub messaging system, with the powers of Spark streaming for scalable, high-throughput, fault-tolerant stream processing of live data streams. Combining different systems to get even a more powerful system is great, but has its own complexity. With a demo of building a pipeline to ingest and process real time data using these systems, we will explore how the two systems can be intertwined to make the most out of the combined system.

Speakers

Ashish Singh

Software Engineer, Cloudera

Thursday June 4, 2015 15:40 - 16:10 CEST
Mátyás II.

Stream

Budapest Data 2015

13:30 CEST

Ashish Singh

14:05 CEST

Gyula Fóra

14:40 CEST

Benczúr András

15:40 CEST

Ashish Singh

Recently Active Attendees