Sebastian is available for hire

Sebastian Brestin

Verified Expert in Engineering

Azure Databricks Developer

Location

Cluj-Napoca, Cluj County, Romania

Toptal Member Since

March 12, 2018

Since 2012, Sebastian一直在为Solaris等各种平台开发分布式系统, IBM AIX, HP-UX to Linux and Windows. He's worked with various technologies such as Apache Spark, Elasticsearch, PostgreSQL, RabbitMQ, Django, and Celery to build data-intensive scalable software. Sebastian is passionate about delivering high-quality solutions and is extremely interested in big data challenges.

Data Engineering Data Mining Python Celery PostgreSQL PySpark Apache Spark Elasticsearch Pandas Hadoop Amazon Web Services (AWS)NumPy Windows Linux RabbitMQ HP-UX OAuth 2.0

Portfolio

Fortune 500 Retail Company (via Toptal)

Jupyter Notebook, Pandas, Amazon Web Services (AWS)， Parquet, Python, PySpark

Reconstrukt(via Toptal)

Amazon Web Services (AWS), Asyncio, Tornado, Python

Spyhce

Jenkins, Docker, Cassandra, Redis, Elasticsearch, PostgreSQL, Apache Kafka...

Experience

Python - 10 years Apache Spark - 4 years Elasticsearch - 4 years PySpark - 4 years Java - 3 years Pandas - 3 years Azure Databricks - 2 years Amazon Web Services (AWS) - 2 years

Availability

Part-time

Preferred Environment

Amazon Web Services (AWS)， Apache Kafka, Spark, Linux, Hadoop

The most amazing...

...我用Docker、Django、PostgreSQL和芹菜构建了一个MapReduce系统.

Work Experience

Data Engineer

2019 - 2021

Fortune 500 Retail Company (via Toptal)

Redesigned Spark application in order to make it more robust and flexible for data scientists and software engineers.
Researched to increase Spark S3 parquet write performance. In order to analyze Spark jobs I have used Spark History Server and Ganglia to pinpoint where time is spent and to compare configuration fixes.
为Spark管道实现了一个健壮的通用测试框架.
Redesigned Spark application from a monolith into a modular Spark application which writes intermediate results and can be run in parallel.
Built a Spark application for ingesting 1TB of data from multiple sources and generate 40k possible features based on which the data science team can perform EDA and test models.
实现新的应用程序api，以帮助其他团队提高生产力.

技术:Jupyter Notebook, Pandas, Amazon Web Services (AWS)， Parquet, Python, PySpark

Python Developer

2018 - 2019

Reconstrukt(via Toptal)

Implemented a concurrent orchestrator for a real-time video rendering system using Python Tornado.
Used HTTP, Websockets, 原始TCP连接、AWS S3和NAS存储来构建处理内容所需的管道.

技术:亚马逊网络服务(AWS)、Asyncio、Tornado、Python

Data Engineer

2016 - 2018

Spyhce

Matched mutable objects (which users can create/update) with other millions of immutable objects in real time (or as close as possible) by creating three Spark-based apps. 有关该项目的其他详细信息可以在我的投资组合部分找到.
Built a task manager Django application over Celery. The application allows administrators to easily manage tasks and view progress/statistics without an additional monitoring service.
Developed a Django audit application over the Django ORM in order to keep track of all of the user actions.

Technologies: Jenkins, Docker, Cassandra, Redis, Elasticsearch, PostgreSQL, Apache Kafka, RabbitMQ, Celery, Django, Python, Spark

Software Engineer

2012 - 2016

Hewlett-Packard

Developed a Python based build system for a virtual appliance that allows HP customers to deploy the product into production with little effort.
Maintained the project.
Introduced software/patch time-windowed installation that the server agents use in order to avoid loading/rebooting the server during critical hours.
领导所有服务器和客户端组件之间从SSL到TLS的升级.
Redesigned the strategy that the server agents use to select the IP address in order to communicate with the core components.
遗留代码重构，以支持Windows服务器代理的自定义安装路径.

Technologies: OpenSSL, Windows, Unix, Linux, PostgreSQL, Oracle Database, C++, Python, Spring, WebLogic, WildFly, Java

Software Developer

2012 - 2012

GFI Software

通过使用局域网邻居的缓存提高补丁下载速度.
Enhanced the build system for a better UX.
Maintained the project.
重新设计产品遗留架构，以便轻松扩展新功能.
增加了发现局域网内Android和iOS设备的功能.

Technologies: Microsoft SQL Server, Delphi, .NET, C#, C++

Experience

个性化引擎优化(财富500强零售公司)

Context:
We wanted to improve the user and developer experience. The Spark application was presenting performance problems, and tech debt became a blocker for adding new functionality.

Solution:
Redesigned Spark application to make it more robust and flexible for data scientists and software engineers:
Partitioned data to avoid shuffles and data skew
Decreased number of partitions while increasing partition size to avoid data skew and driver congestion
Used strategic caching in order avoid recomputation and improve computation by cutting down the query plan
Refactored the application as a pipeline for engineers to add features as plugins to the pipeline
在具有相等比较值的行的情况下生成确定性Spark输出

Results:
减少了50%的执行时间，增加了配置灵活性, 在第二轮中，哪一个提高了数据科学家的生产力
减少了技术债务，从而改善了开发时间和开发体验
Improved functional tests by having a deterministic output

个性化引擎编写优化(财富500强零售公司)

Context:
我们希望提高集群的可用性，因为它被长时间运行的Spark作业阻塞了.

Solution:
Researched to increase Spark S3 parquet write performance. To analyze Spark jobs, I have used Spark History Server and Ganglia to pinpoint where time is spent and compare configuration fixes.

Results:
Increased write performance by a factor of three.

个性化引擎测试框架(财富500强零售公司)

Context:
我们希望提高产品的稳定性和我们的开发过程.

Solution:
为Spark管道实现健壮和通用的功能测试框架;
Allowed multiple instances of a test to run in parallel
允许测试作为套件的一部分以管道方式运行
Simulated production runs by using a representative dataset

Results:
Faster POC development
Reduced development time by 50%
Increased product stability
通过使用更少的资源模拟生产运行，降低了AWS成本

个性化引擎管道优化(财富500强零售企业)

Context:
我们的一些Spark应用程序是长时间运行长达9小时的应用程序. 在此时间范围内，节点可能会内存不足或失去连接. 我们希望我们的应用程序运行得更快，并且在发生故障时更容易恢复.

Solution:
Redesigned Spark application from a monolith into a modular Spark application which writes intermediate results and can be run in parallel.

Results:
No failures due to out of memory
Easier to recover by re-running only the failed module

个性化引擎流失模型(财富500强零售公司)

Context:
We wanted a churn model that we can use to predict the future for our members and also integrate it into other Spark applications.

Solution:
Built a Spark application for ingesting 1TB of data from multiple sources and generate 40k possible features based on which the data science team can perform EDA and test models.

个性化引擎应用开发(财富500强零售公司)

Context:
数据科学团队必须最有效地使用数据和我们的系统来提高生产力.

Solution:
实现新的应用程序api，以帮助其他团队提高生产力;
1. Wrote better API for complex Spark or Cloud functionality
2. Wrote wrappers for complex application configuration
3. Implemented integration with adjacent applications
4. Wrote documentation and Unit Tests
5. Ran data analysis and data quality checks

Django MapReduce System

For a minimum viable product, I designed and implemented a custom Django MapReduce system for matching an object with other million objects. 系统在60个Docker节点上运行，一个匹配任务在60秒内运行. In order to test the system I have extended Django test framework in order to test locally the Django custom MapReduce system without spawning virtual machines only by using multiple processes and cores.

Spyhce | Mutable Object Matching Project

This project aimed to match mutable objects (which users can create/update) with millions of immutable objects in real time (or as close as possible).

We chose Apache Spark because of its Python support, rich analytics toolkit, and streaming capabilities.

The solution was split into three Spark apps:
01. 使用Kafka和Spark流从数据源中检索可变数据, extracting features and saving the result to Cassandra.
02. 使用Kafka和Spark流从数据源中检索不可变数据, 提取特征并将结果保存到磁盘作为Parquet文件.
03. Loads data from Parquet files, computing a match percentage between all immutable objects and a single mutable object from Cassandra.

In order to query data in a reasonable amount of time, 我使用Elasticsearch对PostgreSQL表(数十亿条记录)进行了反规范化. 这将读性能提高了两个数量级，但增加了写损失. 因为只有部分文档频繁更改, the problem was solved using Elasticsearch bulk partial updates with Groovy scripts for complex fields.

Single Sign-on Platform

I led the implementation of a single-sign-on platform, using OpenID Connect, Django, PostgreSQL, Angular/TypeScript和AWS(它集成了客户的整个产品套件).

Machine Learning

While working as a contractor, I used NLTK, Scikit-learn, AWS Transcribe, AWS Lambda运行情绪分析以改善客户支持.

High-content-streaming Platform

我领导了一个使用Django的高内容流平台的实现, PostgreSQL, Elasticsearch, and AWS. 该系统允许使用RESTful API高效地推送/提取内容.

Spark Secondary Sort

http://www.qwertee.io/blog/spark-secondary-sort/

在使用Spark时，我注意到缺乏关于PySpark的适当文档. This situation motivated me to put together this article aimed at thoroughly explaining the secondary sort design pattern using PySpark.

The article covers two solutions. The first solution uses Spark's groupByKey which does a sort in the reducer phase while the second solution uses Spark's repartitionAndSortWithinPartitions which leverages the shuffle phase and iterator-to-iterator transformation to sort more efficiently and not run out of memory.

PostgreSQL Data Partitioning and Django

http://www.qwertee.io/blog/postgres-data-partitioning-and-django/

I wrote this article about PostgreSQL and data partitioning in general with practical examples using the Django web framework.

The first part of the article goes on about the reasons to partition data and also how partitioning can be implemented in PostgreSQL.

The second part describes a few solutions one can use to take advantage of PostgreSQL partitioning while using a web framework such as Django.

PostgreSQL B-Tree Index Explained | Part 1

http://www.qwertee.io/blog/postgresql-b-tree-index-explained-part-1/

Working with PostgreSQL for many years has inspired me to create an article about indexes that any developer should know.

In the first section of the article, I explain about the fundamentals of the PostgreSQL B-tree index structure with a focus on B-tree data structure and its main components—leaf nodes and internal nodes, what are their roles, 以及在执行查询时如何访问它们(这个过程也称为索引查找). The section ends with the index classification, with a broader view over the index key arity which helps explain the impact on a query’s performance.

In the second section, 下面是对查询计划的快速介绍，以便更好地理解后面的示例.

In the third section, I introduce the concept of predicates, and the explanations cover the process PostgreSQL uses to classify predicates depending on index definition and feasibility.

最后一节将深入讨论扫描的机制以及索引如何定义, data distribution, 甚至谓词的使用也会影响查询的性能.

Skills

Languages

Python, C++, Java, Delphi, C#, TypeScript, JavaScript

Frameworks

Hadoop, Apache Spark, JSON Web Tokens (JWT)， OAuth 2, Django REST框架，Django， .NET, Spring, Spark, Twisted, Spring MVC, Angular

Libraries/APIs

PySpark, NumPy, Pandas, Asyncio, OpenSSL, Scikit-learn, ZeroMQ, Natural Language Toolkit (NLTK)

Tools

Celery, RabbitMQ, WildFly, Ganglia, Amazon Elastic MapReduce (EMR), VMware vSphere, Apache Avro, Subversion (SVN), Git, Jenkins

Platforms

Amazon Web Services (AWS), Apache Kafka, Windows, Linux, Ubuntu, Oracle Database, Unix, Jupyter Notebook, HP-UX, Solaris, Docker

Storage

Elasticsearch, PostgreSQL, Microsoft SQL Server, Cassandra, Amazon S3 (AWS S3), Memcached, Redis, Oracle RDBMS, MongoDB, SQL Server 2012, MySQL

Other

RESTful Web Services, Data Mining, Data Engineering, OpenID Connect (OIDC), Azure Databricks, WebLogic, Parquet, VMware ESXi, Apache Cassandra, NATS, Apache Flume, Cryptography, Tornado

Paradigms

Scrum

Education

2010 - 2013

Bachelor's Degree in Computer Science

University of Babeș-Bolyai - Cluj-Napoca, Romania

Certifications

MAY 2014 - MAY 2016

Certified Scrum Master

Scrum Alliance

APRIL 2014 - PRESENT

Oracle Certified Associate, Java SE 7 Programmer

Net BRINEL SA

Collaboration That Works

How to Work with Toptal

Toptal matches you directly with global industry experts from our network in hours—not weeks or months.

Share your needs

在与Toptal领域专家的电话中讨论您的需求并细化您的范围.

Choose your talent

Get a short list of expertly matched talent within 24 hours to review, interview, and choose from.

Start your risk-free talent trial

与你选择的人才一起工作，试用最多两周. Pay only if you decide to hire them.

Top talent is in high demand.

Start hiring