Hey Gayithri, I would suggest S3 as it is easy to pull your data into a fresh cluster. All that you have to do is, point the Hadoop S3 File System to your Amazon S3 account, specify a URL for your data (e.g. s3://my-bucket/my-files) and data streams directly into your job. You can save results back to S3 easily when you are done.
There are some tips that you can follow to get better performance from S3, namely;
1. Organize your files in S3 bucket for better performance.
2. Combine your data into fewer, larger files, this minimizes time spent listing files in your S3 bucket.
3. Avoid Underscores in Bucket Names.
4. Stream your data directly to S3 with EMR - EMR’s S3 File System also adds Multipart Upload, which splits your writes into smaller chunks and uploads them in parallel. These two features significantly improve performance and resiliency, as your jobs upload data in parallel without waiting for processing to complete.
Hope this helps.