– Kaitlyn Baker, BNIA Data Science Corps, Summer 2021
Web scraping is a significant tool to have when considering the acquisition of data from online sources. A web scraper can be designed to extract data from one or more websites to be used as information, such as product pricing and comparison, movie and theatre information, or gaming data. The web scraper used in this project was built for the Baltimore organization called Return Home, whose mission is dedicated to helping previously incarcerated citizens return to Baltimore. Return Home has developed a website that compiles a wide range of resources for returning citizens such as housing, food, job training, addiction counseling, and more. More information can be found on their website at returnhome.org.
The system that Return Home currently uses to display the contact information of all the resources on their website works off of static address and phone numbers that have been manually entered into a spreadsheet. The web scraper currently in place is triggered each day and sends a report out noting if each site is responding. The language used to code the scraper is Python and Google Sheets is used to store data.
The issue with the current system is that there is no way to know that an organization changed their contact information, since the contact values displayed are manually entered. The information change would have to be detected by manually crosschecking each organization’s contact info with the info that is in the spreadsheet. To resolve this issue, we turned to using XPath expressions to update Return Home’s website with the most current information. XPath is a language that points to a specific piece of data contained within an XML document, and an XPath expression is the mechanism that allows one to target specific data within the XML document. A specific language is used, and an expression is created to target the specified data.
For each resource Return Home adds to their website, the web scraper targets the XPath expression of both the address and phone number listed on a website and displays that contact information on the Return Home website. This tool is a way to ensure Return Home’s website stays up to date with information provided to the user.
The XPath expression for both address and phone number were targeted and copied from each resource’s website and added to the Google spreadsheet. To find the XPath expression for a specific piece of data, target the information needed, right click on the specific piece, and click “inspect” to pull up the developer tools. From here, right click the target data from the developer tools, hover over “Copy”, and click “Copy XPath”, similar to the figure displayed here. This copies an expression that points the web scraper to the data it needs to gather.
To further improve the accuracy of this web scraper, we added code that compares the currently scraped values to the previously scraped values. This is displayed in the code snippet above, in the green brackets. The intention of this is to help ensure that if any value that is being collected changes, the webmaster will be alerted and can check to see if a course of action needs to take place to solve the issue.
We then clean the data by removing any extra spaces and letters from the values collected (red arrows) and those values are then updated in the corresponding columns of the spreadsheet (green arrows). The updated contact information values are then updated on the Return Home website.