Regardless of what we develop, what platform, what use case, everything we do at some point comes down to data. Without data there's not much point to using any application.
Lots of applications require copious amounts of data, reliably accessible, reliably recreateable, and confirmed content that will meet the needs of our tests. At the same time, we may want to generate loads of fresh data to drive our applications, inform our decisions, or give us a sense of security that the data we create is safe and protected from prying eyes. Regardless of who you are and where you are in the application development cycle, data is vital, and its care, feeding and protection is critical.
Smita Mishra is giving us a run down on Big Data / Enterprise Data Warehouse / ETL Process (Extract, Transport, Load) and Business Intelligence practices. We can test with 1KB of data or 1TB of data. The principles of testing are the same, but the order of magnitude difference can be huge.
Big Data in my world view is used to describe really large sets, so large that it cannot easily fit into a standard database or file system. Smita points out that 5 Petabytes or more defines "Big Data". Smita also showed us an example of an "Internet Minute" and what happens and transmits during a typical minute over the Internet. Is anyone surprised that the largest bulk of data comes from Netflix ;)?
Big Data requires different approaches for storage and processing. Large parallel systems, databases of databases, distributed cloud system implementations, and large scale aggregation tools all come into play. Ultimately, it's just a broad array of tools designed to work together to cut massive data amounts to some manageable level.
In my own world, I have not yet had to get into really big data, but I do have to consider data that spans multiple machines and instances. Additionally, while I sometimes think Business Intelligence is an overused word and a bit flighty in its meaning, it really comes down to data mining, analytical processing, querying and reporting. That process itself is not too hard to wrap your head around, but again, the order of magnitude with Big Data applications makes it a more challenging endeavor. Order of operations, and aggregation/drill down becomes essential. Consider it a little bit like making Tradizionale Balsamic Vinegar, in the sense that your end product effectively gets moved to ever smaller barrels as the concentration level increases. I'll admit, that's a weird comparison, but in a sense it's apt. The Tradizionale process can't go backwards, and your data queries can't either.
There are some unique challenges related to data warehouse testing. Consistency and quality of data is problematic. Even in small sample sets, inconsistent data formats and missing values can mess up tests and deployments. Big Data takes those same issues and makes them writ large. we need to consider the entry points of our data, and to extend that Tradizionale example, each paring down step and aggregation entry point needs to verify consistency. If you find an error, stop that error at the point that you find it, and don't case it to be passed on and cascade through the system.