Skip to content

Blog

Next Data Science Steps and What I've Been Up To Recently

I haven't blogged in about two weeks. Here's what I've been up to with data science and programming recently.

  • I had a small but interesting consulting project with a client I've previously worked with. It involved a proof of concept to read and write records via the APIs of a proprietary CRM system. It was an opportunity for me to work with OAuth (OAuth2, actually) for the first time. I was pleasantly surprised by how the requests library (which I've used before and liked) has built-in OAuth2 support via the requests-oauthlib package and this made it much easier to work with OAuth2 than I expected.

  • Here are the details on requests' support for Web Application Flow (which is the flow I needed to use for this project).

  • And here is an article from Digital Ocean about the different types of OAuth2 flows which was recommended to me by someone with more OAuth2 experience than I have.

This was also a chance for me to work directly with the API, as there isn't a library available that I'm aware of. In contrast, when I've previously worked with Salesforce's APIs I used the simple_salesforce Python package, which worked well and which I was grateful for, but it did introduce a layer of abstraction that led to a slightly more superficial understanding.

  • I had another helpful informational interview with a data scientist doing machine learning. From this, I realized that I want to test whether I'm interested in machine learning, because that will influence my learning path. He recommended Andrew's Ng's Machine Learning Coursera course. I respect this person's recommendation because he has many years of programming experience, completed a data science masters degree at a prestigious university, and is working in the industry. Although the course material is fairly old, he thinks that Andrew explains the material really well. It's a 55-hour course (free!) and I've recently gotten started on it.

  • Finally, another client I've been assisting with work that's tangentially related to programming has more directly involved me in programming. I've been getting my toes wet with Ruby on Rails and it's my first time working with a production web app. It's been interesting trying to wrap my brain around all the parts of a web app (still more to learn!) and to see how things are set up at a startup where the founders are all accomplished developers.

Web scraping a site with embedded JSON

code snippet

def scrape_list():
    """Scrape the cleantech list and put companies into a list of dicts."""
    cleantech100_url = 'https://i3connect.com/gct100/the-list'
    cleantech100_base_url = 'https://i3connect.com'
    request = requests.get(cleantech100_url)
    bs = BeautifulSoup(request.content, "html.parser")
    table = bs.table
    # The HTML table has the headers: COMPANY, GEOGRAPHY, FUNDING, SECTOR,
    # YEAR FOUNDED
    header = [
        'cleantech_url',     # from COMPANY
        'company_country',   # from GEOGRAPHY
        'company_funding',   # from FUNDING
        'company_sector',    # from SECTOR
        'company_year_founded',  # from YEAR FOUNDED
        'company_region',    # Column exists but is not displayed in the header.
        'company_video'      # Column exists but is not displayed in the header.
    ]
    companies = []
    for row in table.tbody.find_all('tr'):
        company = {}
        index = 0
        if 'id' in row.attrs and row.attrs['id'] == 'gct-table-no-results':
            # Last row of table should be skipped because it's just got this:
            # <tr id="gct-table-no-results">
            #               <td colspan="7">No results found.</td>
            continue
        for cell in row.find_all('td'):
            # co_key is the key to use within company dict,
            # e.g. company[co_key] could point to company['cleantech_url']
            co_key = header[index]
            if cell.string is None:
                # The first and last columns of the HTML table have no text
                # (cell.string is None).
                try:
                    # The first column of the HTML table holds a link to the
                    # company detail page. This is handled by the try statement,
                    # as there is an `href` attribute within the <a> tag.
                    company[co_key] = cleantech100_base_url + cell.a.get('href')
                except AttributeError:
                    # The last column of the HTML table holds iframe links to
                    # videos. Therefore, there is no <a> tag within the cell,
                    # only a <span> element with a 'data-video-iframe' element.
                    # This is handled by the except statement.
                    video_url = cell.span.get('data-video-iframe')
                    if len(video_url) > 10:
                        company[co_key] = video_url
            else:
                company[co_key] = cell.string
            index += 1
        companies.append(company)
    return companies

pandas string methods

This blog post is based on lesson 12 ("How do I use string methods in pandas?") from Data School's pandas video series.

pandas has many string methods available on a Series via .str.some_method()

For example: - df.series_name.str.upper() changes all the strings in the Series called series_name (in the DataFrame called df) to uppercase - df.series_name.str.title() changes the strings to title case (first character of each word is capitalized) - String methods on a Series return a Series. In the case of

df.series_name.str.contains('bar')

the .contains() method returns a Series of Trues and Falses, in which True is returned if the string in the Series series_name contains bar and False is returned if the string in the Series series_name does not contain bar. - You could easily use the True/False Series returned by the .contains() method above to filter a DataFrame. For example:

df[df.series_name.str.contains('bar')]

will return a new DataFrame filtered to only those rows in which the series_name Series (aka the column called series_name) contains the string bar.

You can see all of the str methods available in the pandas API reference.

String methods can be chained together

For example:

df.series_name.str.replace('[', '').str.replace(']', '')

will operate on the Series called series_name in the DataFrame called df. The first .replace() method will replace [ with nothing and the second .replace() method will replace ] with nothing, allowing you to remove the brackets from the strings in the Series.

Many pandas string methods accept regular expressions

The two chained .replace() methods in the previous example can be replaced with a singular regex .replace(), like this:

df.series_name.str.replace('[\[\]]', '')

Here, the .replace() method is taking the regex string

'[\[\]]'

and replacing with nothing. That regular expression can be deconstructed as follows:

  • the outer brackets [ and ] define a character class, meaning that any of the characters within those character class brackets will be replaced
  • inside the outer brackets is \[\]. It represents the two characters [ and ] which will be replaced. However, since brackets have a special meaning in regular expressions, they need to be escaped with backslashes \. So the bracket characters to be replaced end up looking like this:
\[\]

You can see working code for all of the above examples in my Jupyter notebook

pandas vs. SQL

As I'm starting to learn pandas, I noticed that it seems to have the capabilities of doing everything that SQL can do, and more. So I was curious about whether I would need SQL in a data science career, or if pandas could suffice.

A quick search turned up this article about PostgreSQL vs. pandas. My takeaway is that, not surprisingly, the situation is more complex than simply one tool always being better than the other. Instead, SQL is best suited for certain tasks, and vice versa for pandas.

In particular, SQL is faster for typical database tasks, such as joins. Whereas pandas is better for complex operations like string manipulation or statistics.

While, in theory, you could do practically everything with either tool, it wouldn't be the most performant solution (pandas and SQL are each faster at certain tasks, under certain situations), nor the most desirable solution (e.g. SQL is a lingua franca and compatible with lots of programming languages, so pandas could be limiting in that regard).

The article, titled "PostgreSQL vs. pandas — how to balance tasks between server and client side", offers these ten rules of thumb for evaluating whether to use SQL or pandas for an analytics task:

  1. If doing a task in SQL can cut the amount of data returned to the client (e.g. filtering to a smaller subset of data), then the task belongs on the server.
  2. If the amount of data returned to the client remains unchanged or grows (e.g. adding complex calculated columns; cross-joins, etc.) by doing it in SQL, the task belongs into client side code.
  3. Test different setups on the server and client side to see which is more efficient. In the worst case, you’ll learn something.
  4. Never do in code what the SQL server can do well for you: Data extraction (CRUD, joins and set operations) & simple data analysis.
  5. If it’s painful or ugly, do it in client-side code: Complex data analysis belongs into code. Leave formatting or math for the client side. The database exists mainly to facilitate fast extraction of data.
  6. Minimise SQL complexity: Split overly complex, non-performant queries. Two simpler queries will save the headache of maintaining one mega-query. Alternatively, split it into a simple query and handle the complexity in client-side code.
  7. Minimise database round trips: Try to do as much as you can in one operation. Every semicolon is one round trip and adds another I/O operation.
  8. Configure your database carefully e.g. for postgres. Otherwise, you default to sub-optimal algorithms which is expensive.
  9. It’s well worth investing time in database schema optimisation.
  10. Same goes for setting optimal foreign/sort/distribution keys and properly normalised tables to maintain the integrity of data.

I think the full article is a worthwhile read.

pandas Import and Iteration Tips

This post is based on my takeaways from a Q & A video from Data School.

Import tips

Reading only certain columns during import with usecols argument

The usecols argument can take a list of column names (strs) or column positions (ints). For example:

# column names
foo = pd.read_csv(data_file,
                  usecols=['col_name1', 'col_name2'])
# column positions
foo = pd.read_csv(data_file,
                  usecols=[0, 3])

Reading a sample of rows during import with nrows argument

The nrows argument reads the first n rows of a file. For example, to only import the first three rows from a file:

foo = pd.read_csv(data_file,
                  nrows=3)

Importing columns based on data type using .select_dtypes(), include, and numpy

The select_dtypes() method on a DataFrame allows you to select which datatypes you want to keep in a DataFrame. The include argument on the select_dtypes() method indicates which columns you want to include. In this example, using numpy allows you to specify np.number which includes number types int64, float64, etc.:

import numpy as np
foo = pd.read_csv(data_file)
foo = foo.select_dtypes(include=[np.number])

Iteration tips

Iteration through a Series

This is incredibly easy, and just like iterating over any iterable in Python. For example:

for city in foo.City:  # foo is a pandas DataFrame
    print(city)

Iteration through a DataFrame with .iterrows()

Using the DataFrame .iterrows() method is similar to enumerate in python:

for index, row in foo.iterrows():  # foo is a pandas DataFrame
    print(index, row.City, row.State)

pandas Tips

Dot notation vs. bracket notation to select a Series

Each panda Series is essentially a column within a pandas DataFrame. Whenever a column is added to a DataFrame, the column name (Series) is added as an attribute on the pandas object. Series can be accessed through either dot notation or bracket notation:

# See this code snippet in context at: https://youtu.be/zxqjeyKP2Tk
import pandas as pd
ufo = pd.read_table('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv',
                   sep=',')
ufo_city = ufo['City']  # bracket notation
ufo_city = ufo.City  # dot notation

You must use bracket notation whenever there is:

  • a space in the name of a Series
  • a Series name that conflicts with a pandas method

You must also use bracket notation when you want to make a new Series:

# See this code snippet in context at: https://youtu.be/zxqjeyKP2Tk
import pandas as pd
ufo = pd.read_table('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv',
                   sep=',')
ufo['City_State'] = ufo['City'] + ', ' + ufo['State']
ufo.head()

Keyboard tricks in Jupyter notebooks

  • You can access all of an object's methods and attributes by typing <name_of_the_pandas_object>. + TAB-key. (That's the name of the pandas object, followed by a period and the tab key.)
  • To see the optional or required parameters for a function or a method, when inside of the parentheses, hit SHIFT + TAB one, two, three, or four times.

A few helpful methods and attributes

  • .describe(include='all') a method that returns statistics about a DataFrame and produces results like this:

.describe() results - .shape an attribute that returns a tuple with the number of rows and columns - .dtypes an attribute that returns the types of each of the Series in the DataFrame, which produces results like this:

.dtypes results

Note that -- just like in Python -- methods must be followed by () and attributes must not.

Renaming and Removing Columns in pandas DataFrames

Renaming columns

Four methods for renaming columns in a pandas DataFrame:

1. Rename specific columns

foo.rename(columns={}, inplace=True)

Pass a dict of columns to be renamed. For example:

foo.rename(columns={'old_col1': 'new_col1',
                    'old_col2': 'new_col2'},
                    inplace=True)

2. Rename all the column names

Set .columns to a list of all of the new column names. For example:

foo.columns = ['new_col1', 'new_col2']

3. Rename the columns when reading in a file

For example:

foo_new_names = ['new_col1', 'new_col2']
foo = pd.read_csv(data_file.csv,
                  names=foo_new_names,
                  header=0)

Note that in addition to setting the names parameter to a list of the new column names, you must also set header=0 to indicate that you're replacing the existing column names in the 0th row (if the 0th row is a header row).

4. Replace existing spaces in column names with underscores:

foo.columns = foo.columns.str.replace(' ', '_')

Removing columns (and rows)

Here is the general syntax:

foo.drop(col_str_or_list_of_strs,
         axis=1,  # 0 axis is rows, 1 axis is cols
         inplace=True)

to drop a single column:

foo.drop('col_name',
         axis=1,
         inplace=True)

to drop multiple columns:

foo.drop(['col1', 'col2'],
         axis=1,
         inplace=True)

to drop a single row (data row 0):

foo.drop(0,  # int identifying the data row
         axis=0,
         inplace=True)

to drop multiple rows (data rows 1 and 2):

foo.drop([1, 2],  # list of ints identifying the data rows
         axis=0,
         inplace=True)

Sorting and Filtering in pandas

Sorting

Sorting a Series using dot notation

This is a simple as:

foo.col_name.sort_values()

To sort in descending order:

foo.col_name.sort_values(ascending=False)

Note that the .sort_values() method does not change the underlying sort order . It is a method on a Series that returns a sorted Series.

Sorting a DataFrame by a single Series

This is even simpler than the first example!

foo.sort_values('col_name')

Sorting a DataFrame by multiple Series

Simply pass a list of the column names you want to sort by:

foo.sort_values(['col_name1', 'col_name2'])

Filtering

Filtering rows based on a single column criteria

Filtering by a single column criteria is easy in pandas:

foo[foo.col_name >= 42]

To understand how this works, please see my Jupyter notebook about filtering, which is based on Data School's pandas video about filtering.

Filtering rows based on multiple column criteria

Each criterion must be enclosed in parentheses and chained together with either & (AND) or | (OR). For example:

foo[(foo.col_name1 >= 42) & (foo.col_name2 == 'Bar')]

To use multiple OR criteria for the same column, you can use the isin method. For example:

foo[(foo.col_name1 >= 42) & (foo.col_name2.isin(['Bar', 'Baz', 'Boz']))]

Thoughts on JetBrains' 2018 Data Science Survey

As I’m considering pursuing a career in data science, I found JetBrains 2018 Data Science Survey interesting because it gives me a sense (albeit an imperfect one) of which tools and technologies might be most useful to learn.

Here are my takeaways from the survey:

  • The most popular programming languages regularly used for data analysis are:
  • Python 72%
  • Java 62%
  • R 23%
  • As an aside, Kotlin runs on the Java Virtual Machine, integrates with Hadoop and Spark, and is more concise than Java. It is sponsored by JetBrains, and the survey acknowledges that it likely has some bias, but Kotlin may be an up-and-coming language.
  • Spark is most popular for big data, followed closely by Hadoop.
  • Jupyter notebooks and PyCharm are the most popular IDEs/editors.
  • TensorFlow is the most popular deep learning library. (TensorFlow is lower-level than scikit-learn, according to these Quora answers.)
  • Spreadsheet editors and Tableau are the most popular statistics packages for analyzing and visualizing data.
  • The most popular operating systems are:
  • Windows 62%
  • Linux 44%
  • macOS 37%
  • Computations are performed on:
  • local machines 78%
  • clusters 36%
  • cloud service 32%
  • The most popular cloud services are:
  • Amazon Web Services (AWS) 56%
  • Google Cloud Platform 41%
  • Microsoft Azure 28%
  • The correlation seems to be that the more expertise one’s manager has an data science, the more one tends to agree with this statement: "My manager gives me realistic assignments that are relevant to my skills and responsibilities, with a clear and specific description of the requirements."

It’s nice that I already have experience with Python, Jupyter, PyCharm, spreadsheet editors, Windows and Linux, and AWS.

I intend to next learn pandas.

After that, my priorities would probably be:

  • scikit-learn
  • Spark (Hadoop?)
  • TensorFlow
  • Tableau
  • Java

Getting Started with Jupyter notebook

Here are the steps I took to get started with Jupyter notebook:

Make and activate a new conda virtual environment with Python 3

conda create --name jupyter "python>=3"

This creates a new conda virtual environment named jupyter with Python 3 installed. As of December 2018, the latest version is 3.7.1, which is the version that gets installed with the above command. And if the latest version when you're reading this happens to be 3.8.2, then you'll have 3.8.2 in your virtual environment.

conda activate jupyter

This activates the new jupyter virtual environment.

Configure the environment to allow Python 3 for the Jupyter notebooks

When I first launched Jupyter, I found that I could only create Jupyter notebooks using Python 2. Python 2 is old and approaching the end of its life, so I wanted to be able to create Jupyter notebooks that use Python 3; plus, I prefer using Python 3's syntax.

Thanks to this Stack Overflow answer, I discovered that a couple of more lines of configuration did the trick:

conda install notebook ipykernel
ipython kernel install --user

The first line installs the notebook and ipykernel packages into the active virtual environment (ipython is jupyter's old name).

I don't really understand the second line, but I believe it has something to do with registering the ipython kernel to make it available to Jupyter, and the --user flag just means that you're doing it only for your user account.

Make a directory and optionally initialize git

I think it's a good idea to have a directory for one's Jupyter notebooks so that they're in one place (unless you want to file them in different directories). It also allows you to then use git on the directory for version control.

I made a directory called jupyter, but choose whatever directory name you want.

mkdir jupyter
cd jupyter

You can also (optionally) initialize git in order to start using version control on the directory:

git init

I'm not going to cover how to use git because that's beyond the scope of this post, but there are some resources here on GitHub's site.

Start the Jupyter notebook server and make a notebook

jupyter notebook

This starts the server locally on your computer and automatically opens up a browser tab pointing to http://localhost:8888/tree, which is the default location and port (8888) for Jupyter notebooks.

If you already have created any files or notebooks in the directory then you'll see them there, but if not you'll see an empty directory.

In either case, you can make a new Jupyter notebook right in the browser. In the upper-right corner of the page you'll see a dropdown menu called New. If you click on that you should see an option to create a new notebook in whatever language(s) are available in this virtual environment. If you followed the directions above and they (hopefully) worked for you, then you should have an option to create a notebook using Python 3 and possibly Python 2. If you have other languages, you might see options to create notebooks in R, Julia, or perhaps other languages.

Start using the notebook

Your new notebook will be blank and look something like this:

blank Jupyter notebook

Click on Untitled and give your notebook a name, such as hello_world.

Click in the first blank, gray rectangle (called a cell) and enter some Python code, such as this:

squares = [num ** 2 for num in range(10)]
print("Hello world")
print(f"Here is a list of squares: {squares}")

Type SHIFT-ENTER to execute the code. Your notebook should now look something like this:

Jupyter notebook with some Python code

Click the first icon which looks like a floppy disk to save your notebook (Jupyter will also autosave your notebook).

You can now optionally commit your changes to version control if you'd like.

Congratulations, you installed, configured, and created your first Jupyter notebook!