A Big Data tester will be working with semi-structured or unstructred data, so the tester will have to derive the structure dynamically in order to test it. Since HDFS gives the power to store huge amounts of variety of data, run queries on entire data set and get the results in reasonable time, applying transformation and business rules on the data is easier.
In order to test such a data the testers need a test environment based on HDFS. For validation purposes there are no defined tools and the tools currently available on the Hadoop eco system are pure programming tools like MapReduce to wrappers that are built on the top of MapReduce like HIVE QL or PIGlatin. HiveQL is suitable for flat data structures only and cannot handle complex nested data structures. For complex scenarios PIGlatin which is based on statements can be used. Also Aws has also come up with a testing approach called Big Data Testing Drive (aws.amazon.com/testdrive/bigdata/) which allows you to test Big Data solutions on the top of AWS.