Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas text matching like SQL's LIKE?

Is there a way to do something similar to SQL's LIKE syntax on a pandas text DataFrame column, such that it returns a list of indices, or a list of booleans that can be used for indexing the dataframe? For example, I would like to be able to match all rows where the column starts with 'prefix_', similar to WHERE <col> LIKE prefix_% in SQL.

like image 419
naught101 Avatar asked Mar 10 '14 03:03

naught101


People also ask

Is Panda faster than SQL?

This main difference can mean that the two tools are separate, however, you can also perform several of the same functions in each respective tool, for example, you can create new features from existing columns in pandas, perhaps easier and faster than in SQL.

Is PyArrow faster than pandas?

There's a better way. It's called PyArrow — an amazing Python binding for the Apache Arrow project. It introduces faster data read/write times and doesn't otherwise interfere with your data analysis pipeline. It's the best of both worlds, as you can still use Pandas for further calculations.

Is Panda like SQL?

Pandas is a Python library for data analysis and manipulation. SQL is a programming language that is used to communicate with a database. Most relational database management systems (RDBMS) use SQL to operate on tables stored in a database.

Is pandas query faster than LOC?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

Can pandas query simulate like operator?

Pandas queries can simulate Like operator as well. Let's find a simple example of it. Here is the moment to point out two points: naming columns with reserved words like class is dangerous and might cause errors the other culprit for errors are None values.

Can I use a dummy dataset for PANDAS query?

The dataset used in this analysis and tutorial for pandas query is a dummy dataset created to mimic a dataframe with both text and numeric features. Feel free to use your own .csv file with either or both text and numeric columns to follow the tutorial.

How to use like in SQL?

How to Use LIKE in SQL? Suppose you have to retrieve some records based on whether a column contains a certain group of characters. As you know, in SQL the WHERE clause filters SELECT results. By itself, WHERE finds exact matches. But what if you need to find something using a partial match? In that case, you can use LIKE in SQL.

How do I use pandas query in Jupyter Notebook?

The query function from pandas is an easy and quick way to manipulate your dataframe. You can use SQL-like clauses that return certain rows from satisfying the conditions that you determine. It is beneficial if you are already in your Jupyter Notebook.ipynb file or.py file rather than having to re-upload or execute SQL commands on a SQL platform.


3 Answers

  1. To find all the values from the series that starts with a pattern "s":

SQL - WHERE column_name LIKE 's%'
Python - column_name.str.startswith('s')

  1. To find all the values from the series that ends with a pattern "s":

SQL - WHERE column_name LIKE '%s'
Python - column_name.str.endswith('s')

  1. To find all the values from the series that contains pattern "s":

SQL - WHERE column_name LIKE '%s%'
Python - column_name.str.contains('s')

For more options, check : https://pandas.pydata.org/pandas-docs/stable/reference/series.html

like image 39
H Raihan Avatar answered Sep 21 '22 12:09

H Raihan


You can use the Series method str.startswith (which takes a regex):

In [11]: s = pd.Series(['aa', 'ab', 'ca', np.nan])

In [12]: s.str.startswith('a', na=False)
Out[12]: 
0     True
1     True
2    False
3    False
dtype: bool

You can also do the same with str.contains (using a regex):

In [13]: s.str.contains('^a', na=False)
Out[13]: 
0     True
1     True
2    False
3    False
dtype: bool

So you can do df[col].str.startswith...

See also the SQL comparison section of the docs.

Note: (as pointed out by OP) by default NaNs will propagate (and hence cause an indexing error if you want to use the result as a boolean mask), we use this flag to say that NaN should map to False.

In [14]: s.str.startswith('a')  # can't use as boolean mask
Out[14]:
0     True
1     True
2    False
3      NaN
dtype: object
like image 54
Andy Hayden Avatar answered Sep 22 '22 12:09

Andy Hayden


you can use

s.str.contains('a', case = False)
like image 7
sushmit Avatar answered Sep 22 '22 12:09

sushmit