Term
|
Definition
| Hadoop distributed file system - High-performance distributed file system for storing data. |
|
|
Term
|
Definition
| Map Reduce 2.0 -splits the 2 major functionalities of the job tracker, resource managment and scheduling/monitoring, into the Resource Manager and Application Master. |
|
|
Term
|
Definition
| Used for migrating data between structured data stores and hdfs/hadoop storage |
|
|
Term
|
Definition
| interpreting language layered over map reduce - high level language for data analysis |
|
|
Term
|
Definition
| data wharehouse facilitating querying and managing large datasets - mimics relational database syntax and such |
|
|
Term
|
Definition
| utility to create and run map reduce jobs with any executable or script as the mapper or reducer |
|
|
Term
|
Definition
| distributed, scalable, big data store - stores data as sorted key/value pairs with the key consisting of row and columns - used for fast lookup |
|
|
Term
|
Definition
Robust, scalable, high-performance data storage and retrieval key/value store
cell-based access controls |
|
|
Term
|
Definition
| Serialization framework that compresses and serializes data for storage or transfer. Relies heavily on schemas |
|
|
Term
|
Definition
| columnar storage format for Hadoop. |
|
|
Term
|
Definition
| Machine learning library to build scalable machine learning algorithms implemented on top of Hadoop MapReduce |
|
|
Term
|
Definition
| Distributed real-time computation system - processes streaming data in real time in memory, making it extremely fast |
|
|
Term
|
Definition
| centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services |
|
|
Term
|
Definition
| Open-source in-memory key/value stores |
|
|
Term
|
Definition
| fast, general engine for large-scale data processing |
|
|
Term
|
Definition
| Batch workflow job scheduler to run hadoop jobs |
|
|
Term
|
Definition
| NoSQL database for managing large amounts of structured, semi-structured, and unstructured data |
|
|
Term
|
Definition
Design pattern - group records together by a field or set of fields and calculate a numerical aggregate per group...
mapper, partitioner, reducer |
|
|
Term
|
Definition
| Design pattern - Generate an index from a data set to enable fast searches or data enrichment. Takes time, but greatly reduces search times, output can be ingested into a key/value store |
|
|
Term
|
Definition
| Design pattern - used to do to do concatena@on prior to the reduce phase |
|
|
Term
|
Definition
| Design pattern - use mapreduce framework's counter utility to calculate global sum entirely on the map side, producing no output |
|
|
Term
|
Definition
| Filtering pattern - (map side) filtering |
|
|
Term
|
Definition
| Filtering pattern - keep records that are a member of a large predefined set of values - tiny possibility of false positives. Example: filtering out comments that don't contain a keyword |
|
|
Term
|
Definition
|
|