{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading data in to Python\n",
    "**by [Richard W. Evans](https://sites.google.com/site/rickecon/), May 2019**\n",
    "\n",
    "This notebook focuses on saving and loading data using Python's `pickle` library, `NumPy`, and `pandas`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Saving and loading using pickle\n",
    "\n",
    "From the documentation for the [`pickle`](https://docs.python.org/3/library/pickle.html) library:\n",
    "> “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” [1] or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling”.\n",
    "\n",
    "The `pickle` library is a set of methods for saving python objects as a more compressed binary file or retrieving them back to their native Python form. Below is a list of differences between `.pkl` format serialized data versus the popular `.json` (JSON) serialized data format.\n",
    "* JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded to utf-8), while pickle is a binary serialization format.\n",
    "* JSON is human-readable, while pickle is not.\n",
    "* JSON is interoperable and widely used outside of the Python ecosystem, while pickle is Python-specific.\n",
    "* JSON, by default, can only represent a subset of the Python built-in types, and no custom classes; pickle can represent an extremely large number of Python types (many of them automatically, by clever usage of Python’s introspection facilities; complex cases can be tackled by implementing specific object APIs).\n",
    "\n",
    "This last point is one of the main reasons why Python `.pkl` files are preferred in many cases. But `.pkl` files are not as universally accessible as are `.json` files."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Suppose you have a Python dictionary (`dict1`) in which you have stored a number of objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict1 = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],\n",
    "         'year': [2000, 2001, 2002, 2001, 2002],\n",
    "         'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}\n",
    "dict1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can \"pickle\" this file for storage on your hard disk by importing the `pickle` library and using the `pickle.dump()` function. You will notice that the second argument in the function is an [`open()`](https://docs.python.org/3/library/functions.html#open) built-in function. The `open()` function takes as its first argument the file name or complete path where the file will be stored. The second argument is a set of single character instructions. In this case, `wb` signifies `w` = \"open for writing, truncating the file first\" and `b` = \"binary mode\". "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "\n",
    "pickle.dump(dict1, open('DataFiles/dict1.pkl', 'wb'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now import that `.pkl` file by using the `pickle.load()` function. In this case, the argument of the `pickle.load()` function is the `open()` built-in function. The `'rb'` specification in the second argument signifies `r` = \"open for reading (default)\" and `b` = \"binary mode\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict2 = pickle.load(open('DataFiles/dict1.pkl', 'rb'))\n",
    "dict2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Saving and loading in NumPy\n",
    "\n",
    "The primary \"save\" and \"read\" commands in the NumPy environment are `numpy.save()` and `numpy.load()`, respectively. The [NumPy I/O documentation page](https://docs.scipy.org/doc/numpy/reference/routines.io.html) has many other options available."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Suppose we first tried to save the dictionary `dict1` from above as a `.npy` NumPy format binary file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "np.save('dict1.npy', dict1)\n",
    "print(type(dict1))\n",
    "dict1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could then try to read in that file as a new object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict3 = np.load('dict1.npy')\n",
    "print(type(dict3))\n",
    "print(dict3.shape)\n",
    "dict3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The NumPy I/O commands are good for N-dimensional arrays that are all one type--most commonly strings, integers, or floats."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Saving and loading in pandas\n",
    "\n",
    "The [pandas I/O tools page](http://pandas.pydata.org/pandas-docs/stable/io.html#io-stata-reader) has a comprehensive list of input and output options with pandas. We cover a few of the options here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 Reading in data in pandas\n",
    "The `pandas` library has a number of nice functions for reading data in various formats into a pandas `DataFrame` object. Below is a table of some of the commonly used `pandas` read-in functions (taken partially from Table 6.1 of McKinney, 2013).\n",
    "```\n",
    "| Function       | Description                                                     |\n",
    "| -------------- | --------------------------------------------------------------- |\n",
    "| read_csv       | Load delimited data from a file, URL, or file-like object.      |\n",
    "|                | Use comma as default delimiter.                                 |\n",
    "| read_table     | Load delimited data from a file, URL, or file-like object.      |\n",
    "|                | Use tab ('\\t') as default delimiter.                            |\n",
    "| read_fwf       | Read data in fixed-width column format (that is, no delimiters) |\n",
    "| read_clipboard | Version of `read_table` that reads data from the clipboard.     |\n",
    "|                | Useful for converting tables from webpages.                     |\n",
    "| read_stata     | Load .dta format Stata data file as a DataFrame                 |\n",
    "| read_excel     | Load .xls or .xlsx Excel data file as a DataFrame               |\n",
    "| read_sas       | Load .sas SAS data file as a DataFrame                          |\n",
    "| read_json      | Load .json data file as a DataFrame                             |\n",
    "| read_pickle    | Load .pkl Python pickle data object file as a DataFrame         |\n",
    "```\n",
    "Type inference is one of the more important features of these functions; that means you don't have to specify which coluns are numeric, integer, boolean, or string. Handling dates and other custom types requires a bit more effort, though."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the first example, we read in comma-separated data file (`ex1.csv`) that has the following structure\n",
    "```\n",
    "a,b,c,d,message\n",
    "1,2,3,4,hello\n",
    "5,6,7,8,world\n",
    "9,10,11,12,foo\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('DataFiles/ex1.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could have also used `read_csv` and specified the delimiter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_csv('DataFiles/ex1.csv', sep=',')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A file will not always have a header row. Consider the comma-separated file, `ex2.csv`, that has the following structure.\n",
    "```\n",
    "1,2,3,4,hello\n",
    "5,6,7,8,world\n",
    "9,10,11,12,foo\n",
    "```\n",
    "You can let pandas choose the column names by not specifying anything for the columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_csv('DataFiles/ex2.csv', header=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or you can specify the column names by supplying a list using the `names` option."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_csv('DataFiles/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Suppose  you wanted the `message` column to be the index of the returned DataFrame. you can either indicate you want the column at index 3 or named `message` using the `index_col` argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "names = ['a', 'b', 'c', 'd', 'message']\n",
    "pd.read_csv('DataFiles/ex2.csv', names=names, index_col='message')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You may want a hierarchical index (multiple levels of indices) from multiple columns. The comma-separated data file, `csv_mindex.csv` has the following structure.\n",
    "```\n",
    "key1,key2,value1,value2\n",
    "one,a,1,2\n",
    "one,b,3,4\n",
    "one,c,5,6\n",
    "one,d,7,8\n",
    "two,a,9,10\n",
    "two,b,11,12\n",
    "two,c,13,14\n",
    "two,d,15,16\n",
    "```\n",
    "You can import the file as a DataFrame with a hierarchical index by passing a list of column numbers or names into the `index_col` option."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "one_ind = pd.read_csv('DataFiles/csv_mindex.csv')\n",
    "one_ind"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mult_ind = pd.read_csv('DataFiles/csv_mindex.csv', index_col=['key1', 'key2'])\n",
    "mult_ind"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [pandas documentation for `read_csv()`](http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table) has a comprehensive list of parser function arguments available. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another one of the powerful options is the `skiprows` argument. You will often want to skip a number of rows at the beginning of a file as well as, potentially, rows in the interior of the data file. The file `ex4.csv` has the following structure.\n",
    "```\n",
    "# hey!\n",
    "a,b,c,d,message\n",
    "# just wanted to make things more difficult for you\n",
    "# who reads CSV files with computers, anyway?\n",
    "1,2,3,4,hello\n",
    "5,6,7,8,world\n",
    "9,10,11,12,foo\n",
    "```\n",
    "From looking at this file, we want to skip rows with index 0, 2, and 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_csv('DataFiles/ex4.csv', skiprows=[0, 2, 3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Handline missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some *sentinel* value. By default, `pandas` uses a set of commonly occurring sentinels, such as `NA`, `-1.#IND`, and `NULL`. The file `ex5.csv` has the following structure.\n",
    "```\n",
    "something,a,b,c,d,message\n",
    "one,1,2,3,4,NA\n",
    "two,5,6,,8,world\n",
    "three,9,10,11,12,foo\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = pd.read_csv('DataFiles/ex5.csv')\n",
    "result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.isnull(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.notnull(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `na_values` option can take either a list or set of strings to consider as missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = pd.read_csv('DataFiles/ex5.csv', na_values=['NULL'])\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Different `NA` sentinels can be specified for each column in a `dict`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentinels = {'message': ['foo', 'NA'], 'something': ['two']}\n",
    "pd.read_csv('DataFiles/ex5.csv', na_values=sentinels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Saving (writing) data in pandas\n",
    "The `pandas` library has a `to_csv` method on any DataFrame or Series. If you wanted to save a collection of DataFrames or Series, you would just stack them in a dictionary or list and use the Python `pickle` method from Section 1 of this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result.to_csv('DataFiles/out.csv')\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A strange thing arises when we read the data back in. We get a new column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result2 = pd.read_csv('DataFiles/out.csv')\n",
    "result2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is because the saved `.csv` file saved the index as another column. This happens because the index in this DataFrame was not a named column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat out.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One easy way to solve this is to just specify the index in the `read_csv()` call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_csv('DataFiles/out.csv',index_col=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or you could just specify the columns that you want using the `usecols` option."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_csv('DataFiles/out.csv', usecols=['something', 'a', 'b', 'c', 'd', 'message'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But a cleaner way to do this is to use pandas `.to_pickle()` method to save and `pd.read_pickle()` to load. The pickle object carefully saves all the Pyth"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result.to_pickle('DataFiles/out2.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result3 = pd.read_pickle('DataFiles/out2.pkl')\n",
    "result3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4 Dealing with large data\n",
    "\n",
    "We will discuss later in the year, the methods that `pandas` has for reading in large datasets. Pandas has a nice interface with the HDF5 data format as well as excellent \"chunking\" functionality."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "* McKinney, Wes, Python for Data Analysis, O'Reilly Media, Inc. (2013).\n",
    "* [Python labs](http://www.acme.byu.edu/?page_id=2067), Applied and Computational Mathematics Emphasis (ACME), Brigham Young University."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}