Skip to content

2018

Thoughts on JetBrains' 2018 Data Science Survey

As I’m considering pursuing a career in data science, I found JetBrains 2018 Data Science Survey interesting because it gives me a sense (albeit an imperfect one) of which tools and technologies might be most useful to learn.

Here are my takeaways from the survey:

  • The most popular programming languages regularly used for data analysis are:
  • Python 72%
  • Java 62%
  • R 23%
  • As an aside, Kotlin runs on the Java Virtual Machine, integrates with Hadoop and Spark, and is more concise than Java. It is sponsored by JetBrains, and the survey acknowledges that it likely has some bias, but Kotlin may be an up-and-coming language.
  • Spark is most popular for big data, followed closely by Hadoop.
  • Jupyter notebooks and PyCharm are the most popular IDEs/editors.
  • TensorFlow is the most popular deep learning library. (TensorFlow is lower-level than scikit-learn, according to these Quora answers.)
  • Spreadsheet editors and Tableau are the most popular statistics packages for analyzing and visualizing data.
  • The most popular operating systems are:
  • Windows 62%
  • Linux 44%
  • macOS 37%
  • Computations are performed on:
  • local machines 78%
  • clusters 36%
  • cloud service 32%
  • The most popular cloud services are:
  • Amazon Web Services (AWS) 56%
  • Google Cloud Platform 41%
  • Microsoft Azure 28%
  • The correlation seems to be that the more expertise one’s manager has an data science, the more one tends to agree with this statement: "My manager gives me realistic assignments that are relevant to my skills and responsibilities, with a clear and specific description of the requirements."

It’s nice that I already have experience with Python, Jupyter, PyCharm, spreadsheet editors, Windows and Linux, and AWS.

I intend to next learn pandas.

After that, my priorities would probably be:

  • scikit-learn
  • Spark (Hadoop?)
  • TensorFlow
  • Tableau
  • Java

Getting Started with Jupyter notebook

Here are the steps I took to get started with Jupyter notebook:

Make and activate a new conda virtual environment with Python 3

conda create --name jupyter "python>=3"

This creates a new conda virtual environment named jupyter with Python 3 installed. As of December 2018, the latest version is 3.7.1, which is the version that gets installed with the above command. And if the latest version when you're reading this happens to be 3.8.2, then you'll have 3.8.2 in your virtual environment.

conda activate jupyter

This activates the new jupyter virtual environment.

Configure the environment to allow Python 3 for the Jupyter notebooks

When I first launched Jupyter, I found that I could only create Jupyter notebooks using Python 2. Python 2 is old and approaching the end of its life, so I wanted to be able to create Jupyter notebooks that use Python 3; plus, I prefer using Python 3's syntax.

Thanks to this Stack Overflow answer, I discovered that a couple of more lines of configuration did the trick:

conda install notebook ipykernel
ipython kernel install --user

The first line installs the notebook and ipykernel packages into the active virtual environment (ipython is jupyter's old name).

I don't really understand the second line, but I believe it has something to do with registering the ipython kernel to make it available to Jupyter, and the --user flag just means that you're doing it only for your user account.

Make a directory and optionally initialize git

I think it's a good idea to have a directory for one's Jupyter notebooks so that they're in one place (unless you want to file them in different directories). It also allows you to then use git on the directory for version control.

I made a directory called jupyter, but choose whatever directory name you want.

mkdir jupyter
cd jupyter

You can also (optionally) initialize git in order to start using version control on the directory:

git init

I'm not going to cover how to use git because that's beyond the scope of this post, but there are some resources here on GitHub's site.

Start the Jupyter notebook server and make a notebook

jupyter notebook

This starts the server locally on your computer and automatically opens up a browser tab pointing to http://localhost:8888/tree, which is the default location and port (8888) for Jupyter notebooks.

If you already have created any files or notebooks in the directory then you'll see them there, but if not you'll see an empty directory.

In either case, you can make a new Jupyter notebook right in the browser. In the upper-right corner of the page you'll see a dropdown menu called New. If you click on that you should see an option to create a new notebook in whatever language(s) are available in this virtual environment. If you followed the directions above and they (hopefully) worked for you, then you should have an option to create a notebook using Python 3 and possibly Python 2. If you have other languages, you might see options to create notebooks in R, Julia, or perhaps other languages.

Start using the notebook

Your new notebook will be blank and look something like this:

blank Jupyter notebook

Click on Untitled and give your notebook a name, such as hello_world.

Click in the first blank, gray rectangle (called a cell) and enter some Python code, such as this:

squares = [num ** 2 for num in range(10)]
print("Hello world")
print(f"Here is a list of squares: {squares}")

Type SHIFT-ENTER to execute the code. Your notebook should now look something like this:

Jupyter notebook with some Python code

Click the first icon which looks like a floppy disk to save your notebook (Jupyter will also autosave your notebook).

You can now optionally commit your changes to version control if you'd like.

Congratulations, you installed, configured, and created your first Jupyter notebook!

Getting Started with conda

There are some good resources for getting started with conda on the conda website:

Here are a few commands that I found useful as a complete newbie to conda:

Command What it does
conda info displays information about your setup
conda info --envs lists your virtual environments
conda create --name jupyter --python=3.7 create a new virtual environment named jupyter with python version 3.7.x
conda create --name jupyter "python>=3.7" same as above except python version is greater than or equal to 3.7 (in this case, has the same effect as the previous command); note that you need to use double quotes when using >=
conda activate jupyter activate the virtual environment named jupyter; it's possible that your command might be different. You can also try source activate jupyter or activate jupyter in Windows.
conda deactivate deactivate the current virtual environment; it's possible that your command might be different. You can also try source deactivate or deactivate in Windows.
conda env remove --name jupyter delete the virtual environment named jupyter and everything in it
python -V check which version of python you are using
which python find the path to the version of python you are using

Installing Anaconda in Ubuntu

Why I chose to install Anaconda in Ubuntu Linux

I want to start learning data science and there are a bunch of resources I want to try out (more on those another time). I watched the first two videos of Kevin Markham's Data School video series about pandas. But, in order to get going with pandas I needed to install it. He recommends using the Anaconda distribution of Python which is supposed to have the easiest package installation (numpy, pandas, etc.). From what I've read online, the ease of installation argument is less important for Linux machines where you have admin privileges than for, say, Windows machines where you don't. However, I figured I'd go with the recommended distribution so I could eliminate potential problems and also because I may end up working on a Windows machine later on, so I'd prefer the cross-platform ease of installation of Anaconda.

The PATH variable problem that tripped me up

I found installing Anaconda to be really easy. I just went to the installation page, chose the Installing on Linux option and followed the instructions, including verifying the MD5 hash.

I followed the FAQ's advice and let the installer add the path to Anaconda to my system's PATH variable. This works fine from Anaconda's perspective, but it changes the default system Python to the version of Python installed with Anaconda. In other words, if you type python at the command line, you'll be running the version of python in the Anaconda directory instead of the default directory (in my case, /usr/bin/python).

After way too much time trying to resolve this, the solution I stumbled on turned out to be very simple. Here's the code block that Anaconda added at the end of my ~/.bashrc file:

# added by Anaconda3 5.3.1 installer
# >>> conda init >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$(CONDA_REPORT_ERRORS=false '/home/seth/anaconda3/bin/conda' shell.bash hook 2> /dev/null)"
if [ $? -eq 0 ]; then
    \eval "$__conda_setup"
else
    if [ -f "/home/seth/anaconda3/etc/profile.d/conda.sh" ]; then
        . "/home/seth/anaconda3/etc/profile.d/conda.sh"
        CONDA_CHANGEPS1=false conda activate base
    else
        \export PATH="/home/seth/anaconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda init <<<

The \export PATH="/home/seth/anaconda3/bin:$PATH" prepends the Anaconda directory to the beginning of your PATH variable's existing values. This is great, except that it means that when the system looks for the Python binaries, it finds them in the Anaconda directory and stops looking there.

The solution I came up with (thanks to Piotr Dobrogost's comment on this Stack Overflow question) is to add the default Python directory to the beginning of the PATH variable, prior to the Anaconda directory.

To accomplish that, here's the extra two lines of code that I added after the Anaconda code block:

# added by Anaconda3 5.3.1 installer
# >>> conda init >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$(CONDA_REPORT_ERRORS=false '/home/seth/anaconda3/bin/conda' shell.bash hook 2> /dev/null)"
if [ $? -eq 0 ]; then
    \eval "$__conda_setup"
else
    if [ -f "/home/seth/anaconda3/etc/profile.d/conda.sh" ]; then
        . "/home/seth/anaconda3/etc/profile.d/conda.sh"
        CONDA_CHANGEPS1=false conda activate base
    else
        \export PATH="/home/seth/anaconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda init <<<

# add /usr/bin to beginning of PATH so that python, python3, python2 use default system python not Anaconda python 
export PATH="/usr/bin:$PATH"

Is there a better, cleaner solution than this? Probably. I notice that this solution ends up putting /usr/bin in my PATH variable twice, which doesn't seem great. But at the moment this seems to be working, so I'm writing it up as a solution to this problem that I ran into.

Useful commands

  • echo $PATH will print the contents of the PATH variable to the screen. Note that directories are separated by colons.
  • source ~/.bashrc will run your .bashrc file and recreate your PATH variable, although note what the Anaconda FAQ has to say about this:

If you have any terminal windows open, close them all then open a new one. You may need to restart your computer for the PATH change to take effect.

Disclaimer

I'm definitely not a command line or bash expert. Your setup may be different than mine and so these instructions may not apply to you. Of course, if you notice anything that's wrong or could be improved in this post, I'd be glad to hear!