{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Reading data in to Python\n", "**by [Richard W. Evans](https://sites.google.com/site/rickecon/), May 2019**\n", "\n", "This notebook focuses on saving and loading data using Python's `pickle` library, `NumPy`, and `pandas`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Saving and loading using pickle\n", "\n", "From the documentation for the [`pickle`](https://docs.python.org/3/library/pickle.html) library:\n", "> “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” [1] or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling”.\n", "\n", "The `pickle` library is a set of methods for saving python objects as a more compressed binary file or retrieving them back to their native Python form. Below is a list of differences between `.pkl` format serialized data versus the popular `.json` (JSON) serialized data format.\n", "* JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded to utf-8), while pickle is a binary serialization format.\n", "* JSON is human-readable, while pickle is not.\n", "* JSON is interoperable and widely used outside of the Python ecosystem, while pickle is Python-specific.\n", "* JSON, by default, can only represent a subset of the Python built-in types, and no custom classes; pickle can represent an extremely large number of Python types (many of them automatically, by clever usage of Python’s introspection facilities; complex cases can be tackled by implementing specific object APIs).\n", "\n", "This last point is one of the main reasons why Python `.pkl` files are preferred in many cases. But `.pkl` files are not as universally accessible as are `.json` files." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose you have a Python dictionary (`dict1`) in which you have stored a number of objects." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dict1 = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],\n", " 'year': [2000, 2001, 2002, 2001, 2002],\n", " 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}\n", "dict1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can \"pickle\" this file for storage on your hard disk by importing the `pickle` library and using the `pickle.dump()` function. You will notice that the second argument in the function is an [`open()`](https://docs.python.org/3/library/functions.html#open) built-in function. The `open()` function takes as its first argument the file name or complete path where the file will be stored. The second argument is a set of single character instructions. In this case, `wb` signifies `w` = \"open for writing, truncating the file first\" and `b` = \"binary mode\". " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pickle\n", "\n", "pickle.dump(dict1, open('DataFiles/dict1.pkl', 'wb'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now import that `.pkl` file by using the `pickle.load()` function. In this case, the argument of the `pickle.load()` function is the `open()` built-in function. The `'rb'` specification in the second argument signifies `r` = \"open for reading (default)\" and `b` = \"binary mode\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dict2 = pickle.load(open('DataFiles/dict1.pkl', 'rb'))\n", "dict2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Saving and loading in NumPy\n", "\n", "The primary \"save\" and \"read\" commands in the NumPy environment are `numpy.save()` and `numpy.load()`, respectively. The [NumPy I/O documentation page](https://docs.scipy.org/doc/numpy/reference/routines.io.html) has many other options available." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we first tried to save the dictionary `dict1` from above as a `.npy` NumPy format binary file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "np.save('dict1.npy', dict1)\n", "print(type(dict1))\n", "dict1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could then try to read in that file as a new object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dict3 = np.load('dict1.npy')\n", "print(type(dict3))\n", "print(dict3.shape)\n", "dict3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The NumPy I/O commands are good for N-dimensional arrays that are all one type--most commonly strings, integers, or floats." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Saving and loading in pandas\n", "\n", "The [pandas I/O tools page](http://pandas.pydata.org/pandas-docs/stable/io.html#io-stata-reader) has a comprehensive list of input and output options with pandas. We cover a few of the options here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Reading in data in pandas\n", "The `pandas` library has a number of nice functions for reading data in various formats into a pandas `DataFrame` object. Below is a table of some of the commonly used `pandas` read-in functions (taken partially from Table 6.1 of McKinney, 2013).\n", "```\n", "| Function | Description |\n", "| -------------- | --------------------------------------------------------------- |\n", "| read_csv | Load delimited data from a file, URL, or file-like object. |\n", "| | Use comma as default delimiter. |\n", "| read_table | Load delimited data from a file, URL, or file-like object. |\n", "| | Use tab ('\\t') as default delimiter. |\n", "| read_fwf | Read data in fixed-width column format (that is, no delimiters) |\n", "| read_clipboard | Version of `read_table` that reads data from the clipboard. |\n", "| | Useful for converting tables from webpages. |\n", "| read_stata | Load .dta format Stata data file as a DataFrame |\n", "| read_excel | Load .xls or .xlsx Excel data file as a DataFrame |\n", "| read_sas | Load .sas SAS data file as a DataFrame |\n", "| read_json | Load .json data file as a DataFrame |\n", "| read_pickle | Load .pkl Python pickle data object file as a DataFrame |\n", "```\n", "Type inference is one of the more important features of these functions; that means you don't have to specify which coluns are numeric, integer, boolean, or string. Handling dates and other custom types requires a bit more effort, though." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the first example, we read in comma-separated data file (`ex1.csv`) that has the following structure\n", "```\n", "a,b,c,d,message\n", "1,2,3,4,hello\n", "5,6,7,8,world\n", "9,10,11,12,foo\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('DataFiles/ex1.csv')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could have also used `read_csv` and specified the delimiter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_csv('DataFiles/ex1.csv', sep=',')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A file will not always have a header row. Consider the comma-separated file, `ex2.csv`, that has the following structure.\n", "```\n", "1,2,3,4,hello\n", "5,6,7,8,world\n", "9,10,11,12,foo\n", "```\n", "You can let pandas choose the column names by not specifying anything for the columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_csv('DataFiles/ex2.csv', header=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you can specify the column names by supplying a list using the `names` option." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_csv('DataFiles/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose you wanted the `message` column to be the index of the returned DataFrame. you can either indicate you want the column at index 3 or named `message` using the `index_col` argument." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "names = ['a', 'b', 'c', 'd', 'message']\n", "pd.read_csv('DataFiles/ex2.csv', names=names, index_col='message')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may want a hierarchical index (multiple levels of indices) from multiple columns. The comma-separated data file, `csv_mindex.csv` has the following structure.\n", "```\n", "key1,key2,value1,value2\n", "one,a,1,2\n", "one,b,3,4\n", "one,c,5,6\n", "one,d,7,8\n", "two,a,9,10\n", "two,b,11,12\n", "two,c,13,14\n", "two,d,15,16\n", "```\n", "You can import the file as a DataFrame with a hierarchical index by passing a list of column numbers or names into the `index_col` option." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "one_ind = pd.read_csv('DataFiles/csv_mindex.csv')\n", "one_ind" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mult_ind = pd.read_csv('DataFiles/csv_mindex.csv', index_col=['key1', 'key2'])\n", "mult_ind" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [pandas documentation for `read_csv()`](http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table) has a comprehensive list of parser function arguments available. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another one of the powerful options is the `skiprows` argument. You will often want to skip a number of rows at the beginning of a file as well as, potentially, rows in the interior of the data file. The file `ex4.csv` has the following structure.\n", "```\n", "# hey!\n", "a,b,c,d,message\n", "# just wanted to make things more difficult for you\n", "# who reads CSV files with computers, anyway?\n", "1,2,3,4,hello\n", "5,6,7,8,world\n", "9,10,11,12,foo\n", "```\n", "From looking at this file, we want to skip rows with index 0, 2, and 3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_csv('DataFiles/ex4.csv', skiprows=[0, 2, 3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Handline missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some *sentinel* value. By default, `pandas` uses a set of commonly occurring sentinels, such as `NA`, `-1.#IND`, and `NULL`. The file `ex5.csv` has the following structure.\n", "```\n", "something,a,b,c,d,message\n", "one,1,2,3,4,NA\n", "two,5,6,,8,world\n", "three,9,10,11,12,foo\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = pd.read_csv('DataFiles/ex5.csv')\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.isnull(result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.notnull(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `na_values` option can take either a list or set of strings to consider as missing values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = pd.read_csv('DataFiles/ex5.csv', na_values=['NULL'])\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Different `NA` sentinels can be specified for each column in a `dict`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sentinels = {'message': ['foo', 'NA'], 'something': ['two']}\n", "pd.read_csv('DataFiles/ex5.csv', na_values=sentinels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Saving (writing) data in pandas\n", "The `pandas` library has a `to_csv` method on any DataFrame or Series. If you wanted to save a collection of DataFrames or Series, you would just stack them in a dictionary or list and use the Python `pickle` method from Section 1 of this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.to_csv('DataFiles/out.csv')\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A strange thing arises when we read the data back in. We get a new column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result2 = pd.read_csv('DataFiles/out.csv')\n", "result2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is because the saved `.csv` file saved the index as another column. This happens because the index in this DataFrame was not a named column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cat out.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One easy way to solve this is to just specify the index in the `read_csv()` call." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_csv('DataFiles/out.csv',index_col=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or you could just specify the columns that you want using the `usecols` option." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_csv('DataFiles/out.csv', usecols=['something', 'a', 'b', 'c', 'd', 'message'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But a cleaner way to do this is to use pandas `.to_pickle()` method to save and `pd.read_pickle()` to load. The pickle object carefully saves all the Pyth" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.to_pickle('DataFiles/out2.pkl')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result3 = pd.read_pickle('DataFiles/out2.pkl')\n", "result3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4 Dealing with large data\n", "\n", "We will discuss later in the year, the methods that `pandas` has for reading in large datasets. Pandas has a nice interface with the HDF5 data format as well as excellent \"chunking\" functionality." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "* McKinney, Wes, Python for Data Analysis, O'Reilly Media, Inc. (2013).\n", "* [Python labs](http://www.acme.byu.edu/?page_id=2067), Applied and Computational Mathematics Emphasis (ACME), Brigham Young University." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 1 }