Skip to content

2019

Next Data Science Steps and What I've Been Up To Recently

I haven't blogged in about two weeks. Here's what I've been up to with data science and programming recently.

  • I had a small but interesting consulting project with a client I've previously worked with. It involved a proof of concept to read and write records via the APIs of a proprietary CRM system. It was an opportunity for me to work with OAuth (OAuth2, actually) for the first time. I was pleasantly surprised by how the requests library (which I've used before and liked) has built-in OAuth2 support via the requests-oauthlib package and this made it much easier to work with OAuth2 than I expected.

  • Here are the details on requests' support for Web Application Flow (which is the flow I needed to use for this project).

  • And here is an article from Digital Ocean about the different types of OAuth2 flows which was recommended to me by someone with more OAuth2 experience than I have.

This was also a chance for me to work directly with the API, as there isn't a library available that I'm aware of. In contrast, when I've previously worked with Salesforce's APIs I used the simple_salesforce Python package, which worked well and which I was grateful for, but it did introduce a layer of abstraction that led to a slightly more superficial understanding.

  • I had another helpful informational interview with a data scientist doing machine learning. From this, I realized that I want to test whether I'm interested in machine learning, because that will influence my learning path. He recommended Andrew's Ng's Machine Learning Coursera course. I respect this person's recommendation because he has many years of programming experience, completed a data science masters degree at a prestigious university, and is working in the industry. Although the course material is fairly old, he thinks that Andrew explains the material really well. It's a 55-hour course (free!) and I've recently gotten started on it.

  • Finally, another client I've been assisting with work that's tangentially related to programming has more directly involved me in programming. I've been getting my toes wet with Ruby on Rails and it's my first time working with a production web app. It's been interesting trying to wrap my brain around all the parts of a web app (still more to learn!) and to see how things are set up at a startup where the founders are all accomplished developers.

Web scraping a site with embedded JSON

code snippet

def scrape_list():
    """Scrape the cleantech list and put companies into a list of dicts."""
    cleantech100_url = 'https://i3connect.com/gct100/the-list'
    cleantech100_base_url = 'https://i3connect.com'
    request = requests.get(cleantech100_url)
    bs = BeautifulSoup(request.content, "html.parser")
    table = bs.table
    # The HTML table has the headers: COMPANY, GEOGRAPHY, FUNDING, SECTOR,
    # YEAR FOUNDED
    header = [
        'cleantech_url',     # from COMPANY
        'company_country',   # from GEOGRAPHY
        'company_funding',   # from FUNDING
        'company_sector',    # from SECTOR
        'company_year_founded',  # from YEAR FOUNDED
        'company_region',    # Column exists but is not displayed in the header.
        'company_video'      # Column exists but is not displayed in the header.
    ]
    companies = []
    for row in table.tbody.find_all('tr'):
        company = {}
        index = 0
        if 'id' in row.attrs and row.attrs['id'] == 'gct-table-no-results':
            # Last row of table should be skipped because it's just got this:
            # <tr id="gct-table-no-results">
            #               <td colspan="7">No results found.</td>
            continue
        for cell in row.find_all('td'):
            # co_key is the key to use within company dict,
            # e.g. company[co_key] could point to company['cleantech_url']
            co_key = header[index]
            if cell.string is None:
                # The first and last columns of the HTML table have no text
                # (cell.string is None).
                try:
                    # The first column of the HTML table holds a link to the
                    # company detail page. This is handled by the try statement,
                    # as there is an `href` attribute within the <a> tag.
                    company[co_key] = cleantech100_base_url + cell.a.get('href')
                except AttributeError:
                    # The last column of the HTML table holds iframe links to
                    # videos. Therefore, there is no <a> tag within the cell,
                    # only a <span> element with a 'data-video-iframe' element.
                    # This is handled by the except statement.
                    video_url = cell.span.get('data-video-iframe')
                    if len(video_url) > 10:
                        company[co_key] = video_url
            else:
                company[co_key] = cell.string
            index += 1
        companies.append(company)
    return companies

pandas string methods

This blog post is based on lesson 12 ("How do I use string methods in pandas?") from Data School's pandas video series.

pandas has many string methods available on a Series via .str.some_method()

For example: - df.series_name.str.upper() changes all the strings in the Series called series_name (in the DataFrame called df) to uppercase - df.series_name.str.title() changes the strings to title case (first character of each word is capitalized) - String methods on a Series return a Series. In the case of

df.series_name.str.contains('bar')

the .contains() method returns a Series of Trues and Falses, in which True is returned if the string in the Series series_name contains bar and False is returned if the string in the Series series_name does not contain bar. - You could easily use the True/False Series returned by the .contains() method above to filter a DataFrame. For example:

df[df.series_name.str.contains('bar')]

will return a new DataFrame filtered to only those rows in which the series_name Series (aka the column called series_name) contains the string bar.

You can see all of the str methods available in the pandas API reference.

String methods can be chained together

For example:

df.series_name.str.replace('[', '').str.replace(']', '')

will operate on the Series called series_name in the DataFrame called df. The first .replace() method will replace [ with nothing and the second .replace() method will replace ] with nothing, allowing you to remove the brackets from the strings in the Series.

Many pandas string methods accept regular expressions

The two chained .replace() methods in the previous example can be replaced with a singular regex .replace(), like this:

df.series_name.str.replace('[\[\]]', '')

Here, the .replace() method is taking the regex string

'[\[\]]'

and replacing with nothing. That regular expression can be deconstructed as follows:

  • the outer brackets [ and ] define a character class, meaning that any of the characters within those character class brackets will be replaced
  • inside the outer brackets is \[\]. It represents the two characters [ and ] which will be replaced. However, since brackets have a special meaning in regular expressions, they need to be escaped with backslashes \. So the bracket characters to be replaced end up looking like this:
\[\]

You can see working code for all of the above examples in my Jupyter notebook

pandas vs. SQL

As I'm starting to learn pandas, I noticed that it seems to have the capabilities of doing everything that SQL can do, and more. So I was curious about whether I would need SQL in a data science career, or if pandas could suffice.

A quick search turned up this article about PostgreSQL vs. pandas. My takeaway is that, not surprisingly, the situation is more complex than simply one tool always being better than the other. Instead, SQL is best suited for certain tasks, and vice versa for pandas.

In particular, SQL is faster for typical database tasks, such as joins. Whereas pandas is better for complex operations like string manipulation or statistics.

While, in theory, you could do practically everything with either tool, it wouldn't be the most performant solution (pandas and SQL are each faster at certain tasks, under certain situations), nor the most desirable solution (e.g. SQL is a lingua franca and compatible with lots of programming languages, so pandas could be limiting in that regard).

The article, titled "PostgreSQL vs. pandas — how to balance tasks between server and client side", offers these ten rules of thumb for evaluating whether to use SQL or pandas for an analytics task:

  1. If doing a task in SQL can cut the amount of data returned to the client (e.g. filtering to a smaller subset of data), then the task belongs on the server.
  2. If the amount of data returned to the client remains unchanged or grows (e.g. adding complex calculated columns; cross-joins, etc.) by doing it in SQL, the task belongs into client side code.
  3. Test different setups on the server and client side to see which is more efficient. In the worst case, you’ll learn something.
  4. Never do in code what the SQL server can do well for you: Data extraction (CRUD, joins and set operations) & simple data analysis.
  5. If it’s painful or ugly, do it in client-side code: Complex data analysis belongs into code. Leave formatting or math for the client side. The database exists mainly to facilitate fast extraction of data.
  6. Minimise SQL complexity: Split overly complex, non-performant queries. Two simpler queries will save the headache of maintaining one mega-query. Alternatively, split it into a simple query and handle the complexity in client-side code.
  7. Minimise database round trips: Try to do as much as you can in one operation. Every semicolon is one round trip and adds another I/O operation.
  8. Configure your database carefully e.g. for postgres. Otherwise, you default to sub-optimal algorithms which is expensive.
  9. It’s well worth investing time in database schema optimisation.
  10. Same goes for setting optimal foreign/sort/distribution keys and properly normalised tables to maintain the integrity of data.

I think the full article is a worthwhile read.

pandas Import and Iteration Tips

This post is based on my takeaways from a Q & A video from Data School.

Import tips

Reading only certain columns during import with usecols argument

The usecols argument can take a list of column names (strs) or column positions (ints). For example:

# column names
foo = pd.read_csv(data_file,
                  usecols=['col_name1', 'col_name2'])
# column positions
foo = pd.read_csv(data_file,
                  usecols=[0, 3])

Reading a sample of rows during import with nrows argument

The nrows argument reads the first n rows of a file. For example, to only import the first three rows from a file:

foo = pd.read_csv(data_file,
                  nrows=3)

Importing columns based on data type using .select_dtypes(), include, and numpy

The select_dtypes() method on a DataFrame allows you to select which datatypes you want to keep in a DataFrame. The include argument on the select_dtypes() method indicates which columns you want to include. In this example, using numpy allows you to specify np.number which includes number types int64, float64, etc.:

import numpy as np
foo = pd.read_csv(data_file)
foo = foo.select_dtypes(include=[np.number])

Iteration tips

Iteration through a Series

This is incredibly easy, and just like iterating over any iterable in Python. For example:

for city in foo.City:  # foo is a pandas DataFrame
    print(city)

Iteration through a DataFrame with .iterrows()

Using the DataFrame .iterrows() method is similar to enumerate in python:

for index, row in foo.iterrows():  # foo is a pandas DataFrame
    print(index, row.City, row.State)

pandas Tips

Dot notation vs. bracket notation to select a Series

Each panda Series is essentially a column within a pandas DataFrame. Whenever a column is added to a DataFrame, the column name (Series) is added as an attribute on the pandas object. Series can be accessed through either dot notation or bracket notation:

# See this code snippet in context at: https://youtu.be/zxqjeyKP2Tk
import pandas as pd
ufo = pd.read_table('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv',
                   sep=',')
ufo_city = ufo['City']  # bracket notation
ufo_city = ufo.City  # dot notation

You must use bracket notation whenever there is:

  • a space in the name of a Series
  • a Series name that conflicts with a pandas method

You must also use bracket notation when you want to make a new Series:

# See this code snippet in context at: https://youtu.be/zxqjeyKP2Tk
import pandas as pd
ufo = pd.read_table('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv',
                   sep=',')
ufo['City_State'] = ufo['City'] + ', ' + ufo['State']
ufo.head()

Keyboard tricks in Jupyter notebooks

  • You can access all of an object's methods and attributes by typing <name_of_the_pandas_object>. + TAB-key. (That's the name of the pandas object, followed by a period and the tab key.)
  • To see the optional or required parameters for a function or a method, when inside of the parentheses, hit SHIFT + TAB one, two, three, or four times.

A few helpful methods and attributes

  • .describe(include='all') a method that returns statistics about a DataFrame and produces results like this:

.describe() results - .shape an attribute that returns a tuple with the number of rows and columns - .dtypes an attribute that returns the types of each of the Series in the DataFrame, which produces results like this:

.dtypes results

Note that -- just like in Python -- methods must be followed by () and attributes must not.

Renaming and Removing Columns in pandas DataFrames

Renaming columns

Four methods for renaming columns in a pandas DataFrame:

1. Rename specific columns

foo.rename(columns={}, inplace=True)

Pass a dict of columns to be renamed. For example:

foo.rename(columns={'old_col1': 'new_col1',
                    'old_col2': 'new_col2'},
                    inplace=True)

2. Rename all the column names

Set .columns to a list of all of the new column names. For example:

foo.columns = ['new_col1', 'new_col2']

3. Rename the columns when reading in a file

For example:

foo_new_names = ['new_col1', 'new_col2']
foo = pd.read_csv(data_file.csv,
                  names=foo_new_names,
                  header=0)

Note that in addition to setting the names parameter to a list of the new column names, you must also set header=0 to indicate that you're replacing the existing column names in the 0th row (if the 0th row is a header row).

4. Replace existing spaces in column names with underscores:

foo.columns = foo.columns.str.replace(' ', '_')

Removing columns (and rows)

Here is the general syntax:

foo.drop(col_str_or_list_of_strs,
         axis=1,  # 0 axis is rows, 1 axis is cols
         inplace=True)

to drop a single column:

foo.drop('col_name',
         axis=1,
         inplace=True)

to drop multiple columns:

foo.drop(['col1', 'col2'],
         axis=1,
         inplace=True)

to drop a single row (data row 0):

foo.drop(0,  # int identifying the data row
         axis=0,
         inplace=True)

to drop multiple rows (data rows 1 and 2):

foo.drop([1, 2],  # list of ints identifying the data rows
         axis=0,
         inplace=True)

Sorting and Filtering in pandas

Sorting

Sorting a Series using dot notation

This is a simple as:

foo.col_name.sort_values()

To sort in descending order:

foo.col_name.sort_values(ascending=False)

Note that the .sort_values() method does not change the underlying sort order . It is a method on a Series that returns a sorted Series.

Sorting a DataFrame by a single Series

This is even simpler than the first example!

foo.sort_values('col_name')

Sorting a DataFrame by multiple Series

Simply pass a list of the column names you want to sort by:

foo.sort_values(['col_name1', 'col_name2'])

Filtering

Filtering rows based on a single column criteria

Filtering by a single column criteria is easy in pandas:

foo[foo.col_name >= 42]

To understand how this works, please see my Jupyter notebook about filtering, which is based on Data School's pandas video about filtering.

Filtering rows based on multiple column criteria

Each criterion must be enclosed in parentheses and chained together with either & (AND) or | (OR). For example:

foo[(foo.col_name1 >= 42) & (foo.col_name2 == 'Bar')]

To use multiple OR criteria for the same column, you can use the isin method. For example:

foo[(foo.col_name1 >= 42) & (foo.col_name2.isin(['Bar', 'Baz', 'Boz']))]