Uwe’s Blog

My writing about data engineering, opensource development, general programming and thoughts about engineering culture.

  • Fast JDBC access in Python using pyarrow.jvm

    While most databases are accessible via ODBC where we have an efficient way via turbodbc to turn results into a pandas.DataFrame, there are nowadays a lot of databases that either only come solely with a JDBC driver or the non-JDBC drivers are not part of free or open-source offering. To access these databases, you can use

  • Taking DuckDB for a spin

    TL;DR: Recently, DuckDB a database that promises to become the SQLite-of-analytics, was released and I took it for an initial test drive. Install it via conda install python-duckdb or pip install duckdb.

  • Writing a boolean array for pandas that can deal with missing values

    When working with missing data in pandas, one often runs into issues as the main way is to convert data into float columns. pandas provides efficient/native support for boolean columns through the numpy.dtype('bool'). Sadly, this dtype only supports True/False as possible values and no possibility for storing missing...

  • Data Engineers: The best friends of Data Scientists you forgot to hire.

    At the moment in Computer Science, there are two hot topics: AI and Blockchain. Behind these two buzzwords, there are industries striving to build successful products. Currently, I work in the sector often labelled as AI. Usually, it is also described with other terms like Machine Learning or Big Data. In this sector the currently most sought-after job is the...

  • Data Science I/O - A baseline benchmark for 2019

    Data Science and Machine Learning are tasks that have their own requirements on I/O. As many other tasks, they start out on tabular data in most cases. In contrast to a typical reporting task, they don’t work on aggregates but require the data on the most granular level. Some machine learning algorithms are able to directly work on aggregates but...

  • PyFlame: profiling running Python processes

    Identifying performance bottlenecks in long-running processes often involves careful instrumentation ahead or guessing where the root of the problem may be. A very welcome set of tools are the ones that help you diagnose problems of live systems without modifying them. One important tool I recently came across is the pyflame profiler.

  • Use Numba to work with Apache Arrow in pure Python

    Apache Arrow is an in-memory memory format for columnar data. In more “plain” English, it is a standard on how to store DataFrames/tables in memory, independent of the programming language. One of its most prominent uses is for the @pandas_udf decorator in Apache Spark to move data quickly between Scala and Python/pandas.

  • AHL Python Hackathon April 2018

    Three weeks ago MAN AHL organised an opensource hackathon at their London office. As part of the Hackathon people should contribute to one of the PyData artifacts they regularly use. To support them in making their first contribution, AHL also coordinated that several core committers of opensource projects were present at the event. I joined in as the representative...