An important trend with broad implications is the extent to which data analysis tasks are becoming computationally demanding. The problem is that data volumes are growing exponentially, driven by Moore's law; meanwhile, many interesting analyses depend on the intercomparison of data items, and thus have a cost that grows faster than linearly with the amount of data. Thus even exponentially improving processors can't keep up. The fact that storage costs are currently decreasing faster than computing costs makes things worse.
We see the impact of these issues in the attached figure, from a nice article by Folker Meyer of the Argonne/Chicago Computation Institute in CTWatch, with contrasts the number of genetic sequences obtained with the number of annotations generated. The issues here are not solely computational, as many annotatons are generated manually. But nevertheless, it is striking to see how fast we're falling behind.
As always, a solution to this problem will need to combine improvements in hardware, software, and algorithms:
- Hardware: Because individual devices aren't getting faster particularly rapidly, we will see increasing parallelism in storage, computers, and networks. We hear about these trends a great deal at places like Google, but it is becoming widespread.
- Software: As the number of devices and the amount of work to do both increases, software needs to get smarter. We need to orchestrate massively parallel computations across many devices and manage the flow of data into and out of (and amomg) those computations--and, wherever possible, avoid performing computations by caching and other techniques.
- Algorithms: Neither hardware nor software improvements can overcome the basic exponentials. Thus we need better algorithms. Probablistic algorithms that perform sampling to extract "good enough" knowledge will become important. So will the ability to evaluate how "good" a particular conclusion really is.