, , ,

Bitwise Operations for Data Engineers

Ugh. Cursed bitwise operations … something usually reserved for the low-level mythical engineers writing code no one should have to write. I’ve escaped all but twice during my meager existence, recently I had to use a bitwise operation while converting a Python hashing algorithm into PySpark code. It made my brain hurt. What is this wizardry all about anyways? It got me thinking, I should really attempt to learn something about bitwise operations since it comes up once every 10 years.

bitwise operations

What are bitwise operations? Anyone who writes code, and some of the ones who don’t, realize or were taught that all the strings, characters, and numbers we see and use every day have a binary or bit representation, a group of bits (0 and 1’s) can represent just about anything.

integers ….

>>> for x in range(1,11):
...     print('{i} is {x}'.format(i=bin(x), x=x))
... 
0b1 is 1
0b10 is 2
0b11 is 3
0b100 is 4
0b101 is 5
0b110 is 6
0b111 is 7
0b1000 is 8
0b1001 is 9

or strings …

>>> bin(ord('a'))
'0b1100001'

But bitwise operations … what are they? They are operations that work on the individual bits of a value.

They are low-level and fast. Faster than doing simple multiplication, division, or addition. Mostly because the processor can do them directly. At least that’s what people say.

Is there any actual real-world use for them in everyday data engineering work? Very rarely, I’ve only had to do bitwise operations twice in all my years of data engineering. But, the more I’ve learned about them, the more attractive they sound.

bitwise operations are always confusing

Once you start digging into bitwise operations, especially for math, you will begin to notice they don’t make much sense and can be confusing.

Take the example of simple addition in Python using bitwise operators.

>>> (1 & 1) << 1
2
>>> (2 & 2) << 1
4

I mean other than the strange-looking syntax…

>>> (2 & 3) << 1
4

Hold up. Since when does 2+3 = 4 ??

It’s because of something called carry. Nothing with bitwise operators is straightforward.

When working with Big Data, hundreds of TB’s and more … calculations take time and are costly in compute, why not make something at fast as possible.

How could this actually be done?

bitwise operations with PySpark.

Spark and PySpark offer the typical bitwise operations that are available in most programming languages.

  • bitwiseAND
  • bitwiseOR
  • bitwiseXOR
  • shiftLeft
  • shiftRight

What would this look like in real life? What if we had a 200TB+ data source and we need to do some simple multiplication on some integers.

“A left shift by n bits is equivalent to multiplication by pow(2, n)”

Python docs.
# BITWISE multiplication
>>> df = spark.createDataFrame([(2,),(4,),(8,)], ['test'])
>>> df.withColumn('mult', F.shiftLeft('test', 1)).show()
+----+----+
|test|mult|
+----+----+
|   2|   4|
|   4|   8|
|   8|  16|
+----+----+

# STANDARD multiplication
>>> df.withColumn('mult', F.col('test')*2).show()
+----+----+
|test|mult|
+----+----+
|   2|   4|
|   4|   8|
|   8|  16|
+----+----+

I’m curious if it’s measurably faster? Everyone says it is. Download some data from Divvy bike trips, maybe half a GIG of data will be enough to notice a difference?

>>> df = spark.read.csv('data/*.csv', header='true')
>>> df.show()
+----------------+-------------+-------------------+-------------------+--------------------+----------------+--------------------+--------------+---------+----------+---------+----------+-------------+
|         ride_id|rideable_type|         started_at|           ended_at|  start_station_name|start_station_id|    end_station_name|end_station_id|start_lat| start_lng|  end_lat|   end_lng|member_casual|
+----------------+-------------+-------------------+-------------------+--------------------+----------------+--------------------+--------------+---------+----------+---------+----------+-------------+
|8CD5DE2C2B6C4CFC|  docked_bike|2020-06-13 23:24:48|2020-06-13 23:36:55|Wilton Ave & Belm...|             117|Damen Ave & Clybo...|           163| 41.94018| -87.65304|41.931931|-87.677856|       casual|
|9A191EB2C751D85D|  docked_bike|2020-06-26 07:26:10|2020-06-26 07:31:58|Federal St & Polk St|              41|  Daley Center Plaza|            81|41.872077|-87.629543|41.884241|-87.629634|       member|
|F37D14B0B5659BCF|  docked_bike|2020-06-23 17:12:41|2020-06-23 17:21:14|  Daley Center Plaza|              81|State St & Harris...|             5|41.884241|-87.629634|41.874053|-87.627716|       member|

>>> df = df.withColumn('started_at', F.col('started_at').cast('timestamp'))
>>> df = df.withColumn('ended_at', F.col('ended_at').cast('timestamp'))
>>> df = df.withColumn('seconds', F.unix_timestamp(F.col('ended_at')) - F.unix_timestamp(F.col('started_at')))
>>> df.select('seconds').show()
+-------+
|seconds|
+-------+
|    727|
|    348|
|    513|
+-------+

Let’s try an example script, see what the difference is.

from datetime import datetime
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate()
df = spark.read.csv('data/*.csv', header='true')
df = df.withColumn('started_at', F.col('started_at').cast('timestamp'))
df = df.withColumn('ended_at', F.col('ended_at').cast('timestamp'))
df = df.withColumn('seconds', F.unix_timestamp(F.col('ended_at')) - F.unix_timestamp(F.col('started_at')))

t1 = datetime.now()
df = df.withColumn('mult', F.shiftLeft('seconds', 1))
df.write.csv('results', header='true')
t2 = datetime.now()
print(t2-t1)

0:00:08.350034 for the bitshift multiplication. Normal multiplication 0:00:08.488893

Slightly faster, not bad, I suppose it might be more noticeable at a large scale. What about if we multiply our files so we have about 2.5GB?

0:00:26.320752 for bitshift and 0:00:25.154538 for normal multiplication

So apparently in Spark, you aren’t really going to see much difference in performance using bitwise operators.

Thoughts on bitwise operations and Data Engineering.

I’m sure there might be legitimate reasons that bitwise operations come up in day-to-day data engineering, but I somehow doubt it. I think it’s a good exercise to understand and least somewhat, what bitwise operators can and can’t do, and their possible uses.

It seems like the most obvious use would be for doing simple math on integers. Although I’m curious why someone would bother with such a thing as it would just make the code and what’s happening less obvious.

I mean if you don’t like your co-workers and you want to throw them for a loop I’m sure it’s a good idea.