When I first got started in tech on a data analytics team in 2012, the company I worked for was getting ready to implement Hadoop and move all our data out of Oracle. One team member gathered us in the conference room for an hour to pass down the wisdom he’d received in a four-day Cloudera training.
“It’s called MapReduce,” he said. “It’s a way to do processing on a lot of files at once. There’s a name node and worker nodes, and they all do work in parallel. It’s much more powerful than relational databases.”
For months I struggled alone to understand this new architecture, supported only by equally lost question-askers on Stack Overflow and the Apache mailing lists.
This idea of multiple machines controlled by a single, central machine, all doing work together, seemed important and novel. But I didn’t receive any more context on Hadoop at work—my team members were as new to distributed computing as I was. We spent months recreating the star schemas we’d so laboriously put together in Oracle, and countless hours trying to get HDFS, Hive, and Pig to do what we’d previously done in a relational setting. As an analyst, I was on my own, running Hive queries and finding the new data in HDFS. For months I struggled alone to understand this new architecture, supported only by equally lost question-askers on Stack Overflow and the Apache mailing lists.
We weren’t leveraging the power of Hadoop effectively. The benefit of the MapReduce paradigm is that you can do a large amount of computations quickly, but you need to have a small number of large files, otherwise your parallel processes start to become inefficient on distributed systems. When you migrate your data from relational databases, you also lose something powerful: the ability to query things very quickly to get answers. Optimizing Hive queries, which are really SQL syntax transpiled to Java methods, requires a different programming paradigm than traditional relational SQL queries. But in order to know this, you’d have to understand the assumptions under which Hadoop operates, and how the relational databases that came before it were different. You’d also need to understand the use cases of each.
My team didn’t have this context. We struggled to get up and running, and at one point we weren’t getting any good data out of Oracle (because we were focused on moving it all to Hadoop) or out of Hadoop (because we didn’t entirely understand how it worked yet). We weren’t alone: In those early days, numerous companies, such as Blackboard, had been lured by Hadoop’s power, only to find themselves spinning their wheels as they couldn’t get data out or efficiently process and query the data already in the platform.
My team’s problem was really twofold: Not only was I, a new entrant to tech, thrown in at the deep end, but also many companies—including mine—were moving to adopt Hadoop so quickly that they did so without hiring or consulting people who had experience with the architecture.
I was still thinking about repeatable software patterns and mentorship last year when my husband and I came across a naval commissioning ceremony at the Intrepid Museum, an aircraft carrier-turned-exhibition space that’s docked in New York City. During the final act, the newly commissioned officers returned their first salutes from the enlisted service members who had assisted in their officer candidate process. These service members are often mentor figures who have helped the newly promoted officers during their training. After the salute is returned, the newly commissioned officers shake hands with their new subordinates and pass them a silver dollar, a traditional token of respect and gratitude that dates back to the Revolutionary War.
I was struck by this act. There’s no similar tradition in the tech industry wherein developers are paired with a senior team member before they deploy to production for the first time.
I was struck by this act. There’s no similar tradition in the tech industry wherein developers are paired with a senior team member before they deploy to production for the first time. Indeed, technologists often enter the industry without any knowledge of how the work has been done before—a fact that isn’t helped by the breathless pace of technological change. Early-career engineers may have to evaluate Redshift versus S3 versus HDFS versus Postgres for storing their data within the cadence of a sprint cycle, not realizing that, while all have their specific use cases, Redshift and Postgres are driven by the same relational hierarchy, S3 and HDFS share a similar folder-like architecture, and most people choose to move to relational models in the long run. (Though that won’t stop them from running into the perennial temporal database problem.)
Because few come into their roles with the historical and contextual knowledge they need to design and implement effective software architecture, developers must instead gather that information proactively. Keeping a rotating list of architectural patterns (and their backgrounds) in mind can head off the need to rework and prevent architectural misfit.
For example, the MapReduce architecture works by running your data through a simple pattern. The initial document is split into multiple parts; a function then runs over each part, aggregating word counts. Those word counts are then added together across all the documents. The developers of MapReduce did not discover this pattern; rather, they looked to the map and reduce methods already available in functional programming. The citations listed in the original 2004 MapReduce paper make clear that the model is not new—it arose from the work of hundreds of academics across a number of different platforms. (For example, one of the citations is from a paper on “parallel prefix computation” published all the way back in 1980.) The map and reduce pattern also exists as a function in languages like Scala, Scheme, and Clojure. In 2011, Hadley Wickham called this the “split-apply-combine” pattern. Dask and Spark work on this same principle, with slightly different implementations.
You could even use this pattern without using a distributed system. (I’ve previously written about how running map and reduce patterns on your laptop can be just as programmatically efficient.) For example, when my former team and I were struggling to understand Hadoop, we spent a lot of time learning how the system worked by implementing word count, the canonical “Hello, World!” in big data. It was only later, when I was reading Programming Pearls by John Bentley, that I saw he had implemented a word count program in C++. Programs to count words have existed and been implemented in languages as old as FORTRAN, but somehow I’d never thought about them outside of the context of Hadoop.
There’s no time to learn from the past when the current state is a broken build and the business needs new features.
Often, a software developer’s first encounter with a system is a highly practical one: Something is broken and needs to be fixed. There’s no time to learn from the past when the current state is a broken build and the business needs new features. As a result, in the early 2010s many people ended up solving problems from scratch in Hadoop that had already been solved in previous iterations of distributed systems and map and reduce paradigms.
This lack of institutional memory has also resulted in the repeated rise and fall and rise and fall of SQL. Relational databases have existed since the 1970s, but around the same time Hadoop was released in the late 2000s, companies started experimenting with NoSQL patterns of storage and processing. The resulting products included nonrelational stores like MongoDB, HDFS, and Cassandra. In these new paradigms, you could store your data without any entity mapping. No relationships or indices were needed, and the solutions prided themselves on quick writes.
This was fantastic—until analysts and data scientists needed to count things. It was nearly impossible to query by writing MapReduce scripts against folders and folders of disaggregated logs with the same efficiency as SQL and indices. After the log-based files got stuck in HDFS purgatory, the industry again built up SQL-based retrieval engines: Presto on Hadoop, KSQL aggregation for Kafka streaming, and CQL for Cassandra.
In the rare cases where historical memory does get passed on, the benefits are clear. Look to the experience of former Python Benevolent Dictator for Life Guido van Rossum, who took a job at Dropbox in 2012. As the initial designer and developer of Python, van Rossum not only had arguably the most experience with the language, but also the institutional memory of why design changes were made. At Dropbox, he encouraged the development of mypy, a type-checking library for Python.
Because of van Rossum’s experience with the Python codebase and community, he recognized the desire, and need, for type hints as larger companies ran into the limitations of running Python without type checking. He hired Jukka Lehtosalo to undertake the project, and, encouraging the building out of type-checking architecture, also promoted its spread among various teams at Dropbox. Under van Rossum’s guidance, Dropbox has been migrating to Python 3, the latest version of the language, a process that many other large companies have struggled with. (Python 2 is slated to be sunset in January 2020.)
Cutting across time and place, even small bits of institutional knowledge can save a company hundreds of hours.
Cutting across time and place, even small bits of institutional knowledge can save a company hundreds of hours. For example, our analyst team’s efforts to learn Hadoop were greatly accelerated when an experienced Unix admin joined the team. Mentorship can also help people become more productive developers when they may otherwise have left the company or stagnated in their roles. Today, some organizations offer formal mentorship programs, including Airbnb, Pinterest, and Adobe.
A handful of years after Hadoop gained popularity, the platform is losing ground as alternatives like Spark (also based on MapReduce) and cloud-based ETL solutions take its place. (Even now, some are looking to Dask in favor of Spark.) We see similar trends in software engineering at large: companies shuttling between monoliths and microservices, or between single, on-prem servers and the cloud. Iteration and incremental improvement is a natural feature of technology—everyone wants devices to be smaller and code to run more quickly. However, these technological iterations are also due in part to the lack of historical continuity provided by formal mentorship practices. Much of the time no one steps in to say, “Wait—we just did this five years ago.” If they did, perhaps we’d see tech stacks change at a more reasonable pace—and work with more stable architectures as a result.