Software Engineering Team CU Dept. of Biomedical Informatics

Blog

2024

Parquet: Crafting Data Bridges for Efficient Computation

Apache Parquet is a columnar and strongly-typed tabular data storage format built for scalable processing which is widely compatible with many data models, programming languages, and software systems. Parquet files (typically denoted with a .parquet filename extension) are typically compressed within the format itself and are often used in embedded or cloud-based high-performance scenarios. It has grown in popularity since it was introduced in 2013 and is used as a core data storage technology in many organizations. This article will introduce the Parquet format from a research data engineering perspective.

Navigating Dependency Chaos with Lockfiles

Writing software often entails using code from other people to solve common challenges and take advantage of existing work. External software used by a specific project can be called a “dependency” (the software “depends” on that external work to accomplish tasks). Collections of software are oftentimes made available as “packages” through various platforms. Package management for dependencies, the task of managing collections of dependencies for a specific project, is a specialized area of software development that can involve the use of unique tools and files. This article will cover package dependency management through special files generally referred to as “lockfiles”.

Python Memory Management and Troubleshooting

Have you ever run Python code only to find it taking forever to complete or sometime abruptly ending with an error like: 123456 Killed or killed (program exited with code: 137)? You may have experienced memory resource or management challenges associated with these scenarios. This post will cover some computer memory definitions, how Python makes use of computer memory, and share some tools which may help with these types of challenges.

2023

Tip of the Week: Codesgiving - Open-source Contribution Walkthrough

Thanksgiving is a holiday practiced in many countries which focuses on gratitude for good harvests of the preceding year. In the United States, we celebrate Thanksgiving on the fourth Thursday of November each year often by eating meals we create together with others. This post channels the spirit of Thanksgiving by giving our thanks through code as a “Codesgiving”, acknowledging and creating better software together.

Tip of the Week: Data Quality Validation through Software Testing Techniques

Data orientated software development can benefit from a specialized focus on varying aspects of data quality validation. We can use software testing techniques to validate certain qualities of the data in order to meet a declarative standard (where one doesn’t need to guess or rediscover known issues). These come in a number of forms and generally follow existing software testing concepts which we’ll expand upon below. This article will cover a few tools which leverage these techniques for addressing data quality validation testing.

Tip of the Week: Python Packaging as Publishing

Python packaging is the craft of preparing for and reaching distribution of your Python work to wider audiences. Following conventions for packaging help your software work become more understandable, trustworthy, and connected (to others and their work). Taking advantage of common packaging practices also strengthens our collective superpowers: collaboration. This post will cover preparation aspects of packaging, readying software work for wider distribution.

Tip of the Week: Using Python and Anaconda with the Alpine HPC Cluster

This post is intended to help demonstrate the use of Python on Alpine, a High Performance Compute (HPC) cluster hosted by the University of Colorado Boulder’s Research Computing. We use Python here by way of Anaconda environment management to run code on Alpine. This readme will cover a background on the technologies and how to use the contents of an example project repository as though it were a project you were working on and wanting to run on Alpine.

Tip of the Week: Automate Software Workflows with GitHub Actions

There are many routine tasks which can be automated to help save time and increase reproducibility in software development. GitHub Actions provides one way to accomplish these tasks using code-based workflows and related workflow implementations. This type of automation is commonly used to perform tests, builds (preparing for the delivery of the code), or delivery itself (sending the code or related artifacts where they will be used).

Tip of the Week: Branch, Review, and Learn

Git provides a feature called branching which facilitates parallel and segmented programming work through commits with version control. Using branching enables both work concurrency (multiple people working on the same repository at the same time) as well as a chance to isolate and review specific programming tasks. This article covers some conceptual best practices with branching, reviewing, and merging code using Github.

Tip of the Week: Software Linting with R

This article covers using the software technique of linting on R code in order to improve code quality, development velocity, and collaboration.

Tip of the Week: Timebox Your Software Work

Programming often involves long periods of problem solving which can sometimes lead to unproductive or exhausting outcomes. This article covers one way to avoid less productive time expense or protect yourself from overexhaustion through a technique called “timeboxing” (also sometimes referenced as “timeblocking”).

Tip of the Week: Linting Documentation as Code

Software documentation is sometimes treated as a less important or secondary aspect of software development. Treating documentation as code allows developers to version control the shared understanding and knowledge surrounding a project. Leveraging this paradigm also enables the use of tools and patterns which have been used to strengthen code maintenance. This article covers one such pattern: linting, or static analysis, for documentation treated like code.

2022

Tip of the Week: Remove Unused Code to Avoid Software Decay

The act of creating software often involves many iterations of writing, personal collaborations, and testing. During this process it’s common to lose awareness of code which is no longer used, and thus may not be tested or otherwise linted. Unused code may contribute to “software decay”, the gradual diminishment of code quality or functionality. This post will cover software decay and strategies for addressing unused code to help keep your code quality high.

Tip of the Week: Data Engineering with SQL, Arrow and DuckDB

Apache Arrow is a language-independent and high performance data format useful in many scenarios. DuckDB is an in-process SQL-based data management system which is Arrow-compatible. In addition to providing a SQLite-like database format, DuckDB also provides a standardized and high performance way to work with Arrow data where otherwise one may be forced to language-specific data structures or transforms.

Tip of the Week: Diagrams as Code

Diagrams can be a useful way to illuminate and communicate ideas. Free-form drawing or drag and drop tools are one common way to create diagrams. With this tip of the week we introduce another option: diagrams as code (DaC), or creating diagrams by using code.

Tip of the Week: Use Linting Tools to Save Time

Have you ever found yourself spending hours formatting your code so it looks just right? Have you ever caught a duplicative import statement in your code? We recommend using open source linting tools to help avoid common issues like these and save time.