Processing Government Data: ZIP Codes, Python, and OpenRefine

Donnelly, Frank
July 2014
Code4Lib Journal;7/21/2014, Issue 25, p10
Academic Journal
While there is a vast amount of useful US government data on the web, some of it is in a raw state that is not readily accessible to the average user. Data librarians can improve accessibility and usability for their patrons by processing data to create subsets of local interest and by appending geographic identifiers to help users select and aggregate data. This case study illustrates how census geography crosswalks, Python, and OpenRefine were used to create spreadsheets of non-profit organizations in New York City from the IRS Tax-Exempt Organization Masterfile. This paper illustrates the utility of Python for data librarians and should be particularly insightful for those who work with address-based data.


Related Articles

  • Python in ChIP-Seq data analysis. Li Zhang; Yuansen Hu; Jinshui Wang; Guangle Zhang // Journal of Chemical & Pharmaceutical Research;2014, Vol. 6 Issue 3, p1002 

    Python is an interpreted programming language that is simple, clear and powerful. To many scientists in life sciences, Python has become their favorite choice to perform routine work, such as text processing, image plotting, basic statistics, GUI programming and even prototype development. In...

  • Use of Python in data manipulation and interfacing spreadsheets (Excel). Boon Kwee Chan // Python Papers Monograph;2010, Vol. 2, p1 

    The article focuses on the importance of using Python in data manipulation and interfacing spreadsheets (Excel). It states that Python provides a more analytical view of the data using the simplest and fastest approach. It mentions that its has a variety of format to output the final information...

  • PyXNAT: XNAT in Python. Schwartz, Yannick; Barbot, Alexis; Thyreau, Benjamin; Frouin, Vincent; Varoquaux, Gaël; Siram, Aditya; Marcus, Daniel S.; Poline, Jean-Baptiste // Frontiers in Neuroinformatics;May2012, p1 

    As neuroimaging databases grow in size and complexity, the time researchers spend investigating and managing the data increases to the expense of data analysis. As a result, investigators rely more and more heavily on scripting using high-level languages to automate data management and...

  • Stereo pairs in Astrophysics. Vogt, Frédéric; Wagner, Alexander // Astrophysics & Space Science;Jan2012, Vol. 337 Issue 1, p79 

    Stereoscopic visualization is seldom used in Astrophysical publications and presentations compared to other scientific fields, e.g., Biochemistry, where it has been recognized as a valuable tool for decades. We put forth the view that stereo pairs can be a useful tool for the Astrophysics...

  • Bioinformatic pipelines in Python with Leaf. Francesco Napolitano; Renato Mariani-Costantini; Roberto Tagliaferri // BMC Bioinformatics;2013, Vol. 14 Issue 1, p1 

    Background: An incremental, loosely planned development approach is often used in bioinformatic studies when dealing with custom data analysis in a rapidly changing environment. Unfortunately, the lack of a rigorous software structuring can undermine the maintainability, communicability and...

  • Spyke Viewer: a flexible and extensible platform for electrophysiological data analysis. Pröpper, Robert; Obermayer, Klaus // Frontiers in Neuroinformatics;Nov2013, Vol. 7, p1 

    Spyke Viewer is an open source application designed to help researchers analyze data from electrophysiological recordings or neural simulations. It provides a graphical data browser and supports finding and selecting relevant subsets of the data. Users can interact with the selected data using...

  • Pyvolve: A Flexible Python Module for Simulating Sequences along Phylogenies. Spielman, Stephanie J.; Wilke, Claus O. // PLoS ONE;9/23/2015, Vol. 10 Issue 9, p1 

    We introduce Pyvolve, a flexible Python module for simulating genetic data along a phylogeny using continuous-time Markov models of sequence evolution. Easily incorporated into Python bioinformatics pipelines, Pyvolve can simulate sequences according to most standard models of nucleotide,...

  • WHY PYTHON IS THE NEXT WAVE IN EARTH SCIENCES COMPUTING. WEI-BING LIN, JOHNNY // Bulletin of the American Meteorological Society;Dec2012, Vol. 93 Issue 12, p1823 

    The article discusses Python as the future language in Earth sciences computing. Python is a modern, object-oriented, open-source language used in software engineering and in 2012, it is considered an essential tool in all kinds of atmospheric sciences work ranging from data analysis to...

  • Seleção de Atributos de Dados Inconsistentes em ambiente HDF5+Python na cloud INCD. Apolónia, João; Cavique, Luís // Revista de Ciências da Computação;2019, Vol. 14, p85 

    The treatment of large datasets is an issue that is often addressed today and whose task is not simple, given the computational limitations that still exist. One possible approach is to perform a feature selection that allows a considerably reduction of data size without increasing...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics