pandas string methods
This blog post is based on lesson 12 ("How do I use string methods in pandas?") from Data School's pandas video series.
pandas has many string methods available on a Series via .str.some_method()
¶
For example:
- df.series_name.str.upper()
changes all the strings in the Series called
series_name
(in the DataFrame called df
) to uppercase
- df.series_name.str.title()
changes the strings to title case (first
character of each word is capitalized)
- String methods on a Series return a Series. In the case of
df.series_name.str.contains('bar')
the .contains()
method returns a Series of True
s and False
s, in which
True
is returned if the string in the Series series_name
contains bar
and False
is returned if the string in the Series series_name
does not
contain bar
.
- You could easily use the True
/False
Series returned by the .contains()
method above to filter a DataFrame. For example:
df[df.series_name.str.contains('bar')]
will return a new DataFrame filtered to only those rows in which the
series_name
Series (aka the column called series_name
) contains the
string bar
.
You can see all of the str
methods available in the
pandas API reference.
String methods can be chained together¶
For example:
df.series_name.str.replace('[', '').str.replace(']', '')
will operate on the Series called series_name
in the DataFrame called df
.
The first .replace()
method will replace [
with nothing and the second
.replace()
method will replace ]
with nothing, allowing you to remove
the brackets from the strings in the Series.
Many pandas string methods accept regular expressions¶
The two chained .replace()
methods in the previous example can be replaced
with a singular regex .replace()
, like this:
df.series_name.str.replace('[\[\]]', '')
Here, the .replace()
method is taking the regex
string
'[\[\]]'
and replacing with nothing. That regular expression can be deconstructed as follows:
- the outer brackets
[
and]
define a character class, meaning that any of the characters within those character class brackets will be replaced - inside the outer brackets is
\[\]
. It represents the two characters[
and]
which will be replaced. However, since brackets have a special meaning in regular expressions, they need to be escaped with backslashes\
. So the bracket characters to be replaced end up looking like this:
\[\]
You can see working code for all of the above examples in my Jupyter notebook