How Azure Data Factory can help process huge files
One of our clients recently faced an issue with loading and summarizing increasing volumes of data using a legacy database. In addition to a sizable bulk of historic data, large amounts of additional data were being generated regularly. By this point the data had become so voluminous as to push the limits of the existing system’s processing capacity. It was starting to consume unreasonable amounts of processing time as well as the end users’ time for report generation.
A wider issue
Provision of services, billing, transaction processing, and almost every facet of customer interaction is becoming increasingly data-driven. Organizations are generating ever increasing amounts of more versatile and complex data. Moreover, they are starting to generate data at multiple simultaneous locations. All this data needs to be collated and processed quickly to enable informed business decisions. Many of our clients have been using very popular and effective legacy data processing systems successfully for years. These systems are now proving to be simply incapable of meeting these new requirements at scale. It is time to think in a radical new direction.
The problem statement
After some analysis and discussion on this case we listed the following key considerations:
- A replacement was required for the legacy data processing system.
- The solution had to be capable of processing large data files every day.
- The new system had to be able to process a one-time load of historical data along with newly created files in an automated fashion.
- Downloading and ingesting new data also had to be automated.
- It was important to keep track of file status to ensure accuracy of each file and record.
- The system had to be able to cope with breaks in connections and processes, which are inherent and are expected to occur from time to time.
- The system had to be resource-efficient, low-cost, low-maintenance, and low-effort for end users.
The obvious choice
In this case we recommended a solution using Azure Data Factory (ADF). There were multiple reasons that made Azure the natural choice for this situation.
- ETL (Extract, Transform, Load) is a native service in Azure. It easily ties into the other services which Microsoft offers. This is an advantage over other data migration solutions.
- A key functionality of Azure Data Factory is the ability to define pipelines for moving data automatically when a new file is available in a source location. This can be done in a web interface with minimal need for writing code. Definitions and schedules are therefore simpler to setup.
- Pipelines allow setting up schedules for ingesting data, thereby automating this process. Schedules can either be event-based or setup to occur at specified time intervals. Also, being a native service, it is a server-less solution. This means there is no need to worry about where and how the pipelines are running. Repetitive configuration and management requirements are minimized.
- Azure Data Factory allows for programming parameters such as loops and waits for entire pipelines. This gives the necessary flexibility for designing workflows.
- Parallelism is enabled by default in Azure. This allows end users to run parallel queries. Parallelism on large data sets using conventional solutions would have required prohibitively expensive and inefficient scaling of hardware. Azure Data Factory gets this done at a fraction of the cost.
- The platform allows for effortless scaling. This is something clients will inevitably need, considering the increasing rate at which they continue to generate data.
We started with a PoC (proof of concept) that involved combining ADF with Snowflake, Azure’s cloud data warehouse.
- Migrating and validating the client’s data to cloud database was relatively simple once we had it in flat files in a shared location.
- We updated the schedules to start monitoring for new data files regularly.
- We setup the ADF pipelines and started a monitoring schedule at regular intervals.
- We mapped all the data to staging tables. Then we applied standard and prescribed validations and loaded the data into production tables.
- We transformed the data with additional schema changes to match the client’s requirement.
The PoC showed that it was now easy for the client to run analyses on structured data using their preferred BI tools. It also proved to be surprisingly cost-efficient compared to any conventional solution. Below is a high level architectural diagram showing the entire work flow.
Once Source and destination are created then Pipeline is required to transfer the data from source to destination.
With no code needed we can set up the pipelines in Azure, set the source and destination and voila we have the solution ready to go.
Step 1: Properties set in Pipeline
Step 2: Source set in Pipeline
Step 3: Destination set in Pipeline
Step 4: Set output settings
Step 5: Validate the Summary
Step 6: Deploy the Pipeline
Step 7: Deployment Watch window
ADF does a masterful job of ETL, combining multiple sources and types of data usably. In addition to processing it allows for monitoring. ADF sends alerts and makes it easy to take corrective actions where needed. Combining the best of SaaS, PaaS and IaaS, ADF is an excellent example of leveraging the massive power and scale of the cloud, making possible (and affordable) things that are simply unimaginable with traditional infrastructure.Talk to