By Tara Vickers
I have done a lot of work with BNIA regarding automating their data processing and analyzation. Of course, I did come across some issues with the code housed on BNIA’s GitHub, though this is not the fault of the original authors. As many know, technology is an ever-evolving field, where new innovations, however miniscule, are tested and launched incredibly fast. This is also true for programming languages.
Programming languages tend to be updated around every few months to a year, at least regarding more popular programming languages such as Python – the primary language used within BNIA to automate tasks. Python has gone through many changes over the past few years, including some aspects of the language going through a process called depreciation.
Depreciation is a process wherein a function or method in a programming language is no longer supported and is replaced by a new method. This does not mean that it cannot function, but it can cause errors if used with other technologies, such as NumPy (a scientific computing library).
Depreciation extends to other programming technologies, such as programming libraries. A library imported in almost every program within the BNIA GitHub is Pandas, a data analysis library. Pandas has also depreciated methods within the past few years, such as append (which has been replaced by concat).
To put it simply, the code within the BNIA GitHub tends to be around 2-3 years old. This results in a lot of code that, while usable in the past, now needs to be updated to be functional long term.
Skills/Tools/Techniques to Solve the Problem
Updating the entire GitHub would be laborious, to say the least. And, depending on the next update, those updates may all be moot. Therefore, communication and proper research is key for future programmers at BNIA to utilize in their work. This is true for programming in general, but these skills are imperative to be used when dealing with established code.
The point is to work smarter, not harder. Much of the code within the GitHub is incredibly valuable for performing specific tasks and creating certain indicators. I propose that there should be a document that illustrates the versions of the current languages and technologies used within the GitHub. This can be as simple as linking to the official Python version changelogs, as well as for Pandas, as these are the two most popular technologies within it, as well as NumPy, as that technology is often used in tandem with Pandas.
https://docs.python.org/3/whatsnew/changelog.html
https://pandas.pydata.org/docs/whatsnew/index.html
https://numpy.org/doc/stable/release.html
There is a note to be made about version converters. If, for example, a piece of Python code is incredibly old (such as being from Python Version 2 instead of 3), there are converters online that can convert the code for you. This is not applicable to versions between 2 and 3, 3 and beyond (for example, Version 3.8 to Version 3.9).
Key Takeaways
Programming is a behind-the-scenes job when it comes to data science. However, it is incredibly important when optimizing workflow and automating data processing. Like many other skills and professions, communication is key. It could even be considered more so with programming, especially if one isn’t as experienced in one programming language compared to a past programmer.
We have to see how this implementation of a documentation affects the workflow for future BNIA programmers, but I know that for myself, having such a document, even if it simply reminded me that SOME FUNCTIONS OR METHODS HAVE BEEN DEPRECIATED, it would have saved me more time on the code I was working on.
Here is an example of some code I had used for the Vital Signs 2024 project. Let’s say I wanted to add a new row to a Dataframe (a fancy word for a table) in the python file FDIC_banks.py.
banks.loc[len(banks.index)] = [‘Baltimore City’, banks[‘count’].sum()]
As you can see, we use the loc method to add a row for Baltimore City into the banks Dataframe. Loc is perfectly fine for adding rows. However, it can be very, very picky regarding the index of the Dataframe, and can lead to this error if you happen to move the line of code:
banks.loc[len(banks.index)] = [‘Baltimore City’, banks[‘count’].sum()]
~~~~~~~~~^^^^^^^^^^^^^^^^^^
ValueError: cannot set a row with mismatched columns
Therefore, other methods may be used, especially if the structure of the code must be heavily modified. Append is one of the functions that shows up in search results for alternatives to Loc (NOT concat, its replacement), and the code would look something like this (approximately):
banks = pd.DataFrame(banks).append(‘Baltimore City’, ignore_index=True)
This, however, results in the following error:
AttributeError: ‘DataFrame’ object has no attribute ‘append’
This happens because ‘Append’ has been depreciated and replaced with concat (with Loc as the alternative). Therefore, I wasted precious time with an outdated function, and I was no closer to solving the issues of properly adding in a new row with a new code structure. If I was given a small document or link regarding past depreciations, I would have spent more time honing in on the issues of the code structure and properly utilizing Loc rather than using a depreciated piece of code.
What I have learned about data science for social good
The problems and skills I illustrated in this post relate heavily to the process of working with data. Still, even with programming, it is clear as day to see how data science can and has been used for social good. With the data I’ve worked with, I’ve noticed many things. How communities have grown, shrank, how they have improved or faltered. Important indicators such as those related to banks or rehabilitation centers were prevalent in my work. This holds important information that can be used to improve the communities of Baltimore in a myriad of ways.
My problems and solutions are imperative to improve the process of data processing and analysis to put forth this important information to the populace.