Developed at the University of California, Berkeley in 2009, Spark is a powerful cluster-computing engine known for its fast, in-memory, large-scale data processing capability. Spark was acquired by the Apache Software Foundation in 2013 and is currently available as open source technology. In addition to the capability it offers, Apache Spark provides APIs in multiple programming languages, hence its flexibility for business applications across multiple industry verticals.
This article identifies five important trends that indicate the acceptance, adoption, and application of Apache Spark we can expect over the next few years.
Trend #1: The shift from storage to computational power
The era of data warehouse modernization was driven by large organizations focused on distributed storage mechanisms using Hadoop. Recently, businesses have started to focus their attention on deriving value from data analysis on big data (thereby translating data into actionable insights that provide a competitive advantage). As a result, processing power or RAM dedicated to analyzing data has begun to outpace the resources dedicated to storing data.
Spark, with its large-scale, in-memory data processing capability, is at the center of this smart-computation evolution. We should expect to see significant growth in Spark investment, especially in highly competitive industry sectors such as financial services, manufacturing, and pharmaceuticals.
Trend #2: Improved cloud-based infrastructures
Organizations employ Spark to leverage its rapid innovation cycles fueled by contributions from the open source community. It is significantly faster to upgrade to newer versions of software in the cloud than it is for any on-premises implementation.
One way for organizations to get up and running quickly on Spark is to utilize cloud-based implementations. However, this has been a viable option only for smaller companies and start-ups whose data volume was small. For enterprises with sizable data volumes or investments in large data centers, moving their data into the cloud was expensive. Larger organizations opted for a hybrid strategy where a cloud implementation of Spark was used to analyze streaming data while an on-premises Spark cluster was used to analyze historical and aggregated data.
The cloud infrastructure has improved significantly in the last few years with considerable investments from Amazon, Google, and Microsoft. Scalability, elasticity, and ease of use are the pillars of the mainstream cloud infrastructure. Migration to the cloud has never been easier. Based on these cloud infrastructure improvements, even organizations with large data volumes may now adopt an entirely cloud-based Spark implementation. This would result in a more widespread adoption of Spark.