Shared Flashcard Set

Details

Title

Hadoop Review

Description

General review for Hadoop Final

Total Cards

Subject

Computer Science

Level

Undergraduate 4

Created

05/12/2014

Click here to study/print these flashcards.

Create your own flash cards! Sign up here.

Additional Computer Science Flashcards

Cards Return to Set Details

Term

hdfs

Definition

Hadoop distributed file system - High-performance distributed file system for storing data.

Term

yarn

Definition

Map Reduce 2.0 -splits the 2 major functionalities of the job tracker, resource managment and scheduling/monitoring, into the Resource Manager and Application Master.

Term

sqoop

Definition

Used for migrating data between structured data stores and hdfs/hadoop storage

Term

apache pig

Definition

interpreting language layered over map reduce - high level language for data analysis

Term

Hive

Definition

data wharehouse facilitating querying and managing large datasets - mimics relational database syntax and such

Term

hadoop streaming

Definition

utility to create and run map reduce jobs with any executable or script as the mapper or reducer

Term

apache hbase

Definition

distributed, scalable, big data store - stores data as sorted key/value pairs with the key consisting of row and columns - used for fast lookup

Term

apache accumulo

Definition

Robust, scalable, high-performance data
storage and retrieval key/value store

cell-based access controls

Term

apache avro

Definition

Serialization framework that compresses and serializes data for storage or transfer. Relies heavily on schemas

Term

Parquet

Definition

columnar storage format for Hadoop.

Term

apache mahout

Definition

Machine learning library to build scalable machine learning algorithms implemented on top of Hadoop MapReduce

Term

storm

Definition

Distributed real-time computation system - processes streaming data in real time in memory, making it extremely fast

Term

ZooKeeper

Definition

centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services

Term

Redis,Memcached

Definition

Open-source in-memory key/value stores

Term

spark

Definition

fast, general engine for large-scale data processing

Term

azkaban

Definition

Batch workflow job scheduler to run hadoop jobs

Term

Apache Cassandra

Definition

NoSQL database for managing large amounts of structured, semi-structured, and unstructured data

Term

Numerical Summarization

Definition

Design pattern - group records together by a field or set of fields and calculate a numerical aggregate per group...

mapper, partitioner, reducer

Term

Inverted Index

Definition

Design pattern - Generate an index from a data set to enable fast searches or data enrichment. Takes time, but greatly reduces search times, output can be ingested into a key/value store

Term

Combiner

Definition

Design pattern - used to do to do concatena@on prior to the reduce phase

Term

Counting with counters

Definition

Design pattern - use mapreduce framework's counter utility to calculate global sum entirely on the map side, producing no output

Term

basic Filtering

Definition

Filtering pattern - (map side) filtering

Term

Bloom filtering

Definition

Filtering pattern - keep records that are a member of a large predefined set of values - tiny possibility of false positives. Example: filtering out comments that don't contain a keyword

Term

Reduce side join

Definition

Flashcard Machine - create, study and share online flash cards

Shared Flashcard Set

Details

Additional Computer Science Flashcards

Cards Return to Set Details

My Flashcards

Flashcard Library

Browse

About

Help

Mobile