pyspark.sql.functions.regexp_extract_all#

pyspark.sql.functions.regexp_extract_all(str, regexp, idx=None)[source]#

Extract all strings in the str that match the Java regex regexp and corresponding to the regex group index.

New in version 3.5.0.

Parameters
strColumn or column name

target column to work on.

regexpColumn or column name

regex pattern to apply.

idxColumn or int, optional

matched group id.

Returns
Column

all strings in the str that match a Java regex and corresponding to the regex group index.

Examples

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("100-200, 300-400", r"(\d+)-(\d+)")], ["str", "regexp"])
>>> df.select('*', sf.regexp_extract_all('str', sf.lit(r'(\d+)-(\d+)'))).show()
+----------------+-----------+---------------------------------------+
|             str|     regexp|regexp_extract_all(str, (\d+)-(\d+), 1)|
+----------------+-----------+---------------------------------------+
|100-200, 300-400|(\d+)-(\d+)|                             [100, 300]|
+----------------+-----------+---------------------------------------+
>>> df.select('*', sf.regexp_extract_all('str', sf.lit(r'(\d+)-(\d+)'), sf.lit(1))).show()
+----------------+-----------+---------------------------------------+
|             str|     regexp|regexp_extract_all(str, (\d+)-(\d+), 1)|
+----------------+-----------+---------------------------------------+
|100-200, 300-400|(\d+)-(\d+)|                             [100, 300]|
+----------------+-----------+---------------------------------------+
>>> df.select('*', sf.regexp_extract_all('str', sf.lit(r'(\d+)-(\d+)'), 2)).show()
+----------------+-----------+---------------------------------------+
|             str|     regexp|regexp_extract_all(str, (\d+)-(\d+), 2)|
+----------------+-----------+---------------------------------------+
|100-200, 300-400|(\d+)-(\d+)|                             [200, 400]|
+----------------+-----------+---------------------------------------+
>>> df.select('*', sf.regexp_extract_all('str', sf.col("regexp"))).show()
+----------------+-----------+----------------------------------+
|             str|     regexp|regexp_extract_all(str, regexp, 1)|
+----------------+-----------+----------------------------------+
|100-200, 300-400|(\d+)-(\d+)|                        [100, 300]|
+----------------+-----------+----------------------------------+
>>> df.select('*', sf.regexp_extract_all(sf.col('str'), "regexp")).show()
+----------------+-----------+----------------------------------+
|             str|     regexp|regexp_extract_all(str, regexp, 1)|
+----------------+-----------+----------------------------------+
|100-200, 300-400|(\d+)-(\d+)|                        [100, 300]|
+----------------+-----------+----------------------------------+