Ontdek wat deze opleiding jou kan bieden!

Home Alle Opleidingen

PySpark voor Big Data

terug naar zoekresultaten

PySpark voor Big Data

Deelnemerskosten

€ 2.964,50 incl. BTW

Lesgeld

€ 2.450,00

Totaal excl. BTW

€ 2.450,00

BTW

€ 514,50

Totaal incl. BTW

€ 2.964,50

Startdata in een groep op locatie en online

Locatie: Houten

Start: 15-12-2025

Toon rooster

Aanmelden

Locatie: Amsterdam

Start: 15-12-2025

Toon rooster

Aanmelden

Locatie: Rotterdam

Start: 15-12-2025

Toon rooster

Aanmelden

Locatie: Eindhoven

Start: 15-12-2025

Toon rooster

Aanmelden

In de cursus PySpark voor Big Data leren de deelnemers Apache Spark vanuit Python te gebruiken.

Image

Spark Architectuur

In de cursus PySpark voor Big Data komt aan de orde komt de architectuur van Spark, de Spark Cluster Manager en het verschil tussen Batch en Stream Processing.

Hadoop

Na een bespreking van het Hadoop Distributed File System wordt ingegaan op parallelle operaties and het werken met RDD's, Resilient Distributed Datasets. De configuratie van PySpark applicaties via SparkConf en SparkContext komt eveneens aan bod in de cursus PySpark voor Big Data.

...

Spark Architectuur

In de cursus PySpark voor Big Data komt aan de orde komt de architectuur van Spark, de Spark Cluster Manager en het verschil tussen Batch en Stream Processing.

Hadoop

MapReduce en SQL

Uitgebreid wordt ingegaan op de mogelijke operaties op RDD's waaronder map en reduce. Ook komt het gebruik van SQL in Spark aan de orde. De GraphX library wordt besproken en er wordt ingegaan op DataFrames. Verder komen iteratieve algorithmen aan de orde.

Mlib library

Tenslotte wordt in de cursus PySpark voor Big Data aandacht besteed aan machine learning met de Mlib library.

Doelgroep Cursus PySpark voor Big Data

De cursus PySpark voor Big Data is bedoeld voor developers en aankomende Data Analisten die Apache Spark willen leren gebruiken vanuit Python.

Voorkennis training PySpark voor Big Data

Om aan deze cursus deel te nemen is kennis enige ervaring met programmeren bevorderlijk voor de begripsvorming. Voorafgaande kennis van Python of big data handling met Apache Spark is niet nodig.

Uitvoering cursus PySpark voor Big Data

De theorie wordt behandeld aan de hand van presentaties. Illustratieve demo’s worden gebruikt om de behandelde concepten te verduidelijken. Er is voldoende gelegenheid om te oefenen en afwisseling van theorie en praktijk. De cursustijden zijn van 9.30 tot 16.30.

Certificering cursus PySpark voor Big Data

De deelnemers krijgen na het goed doorlopen van de cursus een officieel certificaat PySpark voor Big Data.

Modules

Module 1 : Python Primer

Python Syntax
Python Data Types
List, Tuples, Dictionaries
Python Control Flow
Functions and Parameters
Modules and Packages
Comprehensions
Iterators and Generators
Python Classes
Anaconda Environment
Jupyter Notebooks

Module 2 : Spark Intro

What is Apache Spark?
Spark and Python
PySpark
Py4j Library
Data Driven Documents
RDD's
Real Time Processing
Apache Hadoop MapReduce
Cluster Manager
Batch versus Stream Processing
PySpark Shell

Module 3 : HDFS

Hadoop Environment
Environment Setup
Hadoop Stack
Hadoop Yarn
Hadoop Distributed File System
HDFS Architecture
Parallel Operations
Working with Partitions
RDD Partitions
HDFS Data Locality
DAG (Direct Acyclic Graph)

Module 4 : SparkConf

SparkConf Object
Setting Configuration Properties
Uploading Files
SparkContext.addFile
Logging Configuration
Storage Levels
Serialize RDD
Replicate RDD partitions
DISK_ONLY
MEMORY_AND_DISK
MEMORY_ONLY

Module 5 : SparkContext

Main Entry Point
Executor
Worker Nodes
LocalFS
SparkContext Parameters
Master
RDD serializer
batchSize
Gateway
JavaSparkContext instance
Profiler

Module 6 : RDD’s

Resilient Distributed Datasets
Key-Value pair RDDs
Parallel Processing
Immutability and Fault Tolerance
Transformation Operations
Filter, groupBy and Map
Action Operations
Caching and persistence
PySpark RDD Class
count, collect, foreach,filter
map, reduce, join, cache

Module 7 : Spark Processing

SQL support in Spark
Spark 2.0 Dataframes
Defining tables
Importing datasets
Querying data frames using SQL
Storage formats
JSON / Parquet
GraphX
GraphX library overview
GraphX APIs

Module 8 : Broadcast and Accumulator

Performance Tuning
Serialization
Network Traffic
Disk Persistence
MarshalSerializer
Data Type Support
Python’s Pickle Serializer
DStreams
Sliding Window Operations
Multi Batch and State Operations

Module 9 : Algorithms

Iterative Algorithms
Graph Analysis
Machine Learning API
mllib.classification
Random Forest
Naive Bayes
Decision Tree
mllib.clustering
mllib.linalg
mllib.regression

Lees meer

Opleidingsinformatie

Opleidingssoort

Training

Opleidingsmethode

In een groep op locatie en online

Type certificaat/diploma

Certificaat

Opleidingsduur

3 dagen

Max. deelnemers

Studiebelastingsuren

18 per opleiding

Tijdstip

Overdag

Taal in opleiding

Nederlands

Meer van deze aanbieder

Aanbieder

SpiralTrain is een opleidingsinstituut dat zich bij uitstek richt op trainingen voor software developers en zaken die raken aan software development.

SpiralTrain BV

NRTO